Skip to main content

Issues related to S3 COPY command for Redshift

We use s3 COPY command to move data from S3 to Redshift table. Make sure the schema for Redshift table is created before running your COPY command. I'm writing this post to log all errors related to the COPY command that I faced which might help others in saving their time.

Copy command that I have used for a while:

copy <redshift table name>
from '<s3 file location>'
iam_role '<iam role arn>'
NULL AS 'NULL'
EMPTYASNULL
delimiter ','
IGNOREHEADER 1;;

Problem 1: Faced issue with delimiters on CSV file. As we know the delimiter for any CSV file will be comma separators, but the problem on keeping this as a delimiter is that it will fail with below error if the data also have commas in it.

Error Code:  1213, Error Description: Missing data for not-null field

AWS Documentation link that helped in solving this issue:
https://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html#load-from-csv

Solution 1.1: As the issue is with comma as a delimiter, I have specified the file format as CSV and ignored the headers. This fixed the issue.

copy <Redshift Table Name>
from 's3://<location of the file>'
iam_role '<iam role arn>'
csv
IGNOREHEADER 1;;

Solution 1.2: Using ESCAPE, and remove quotes should work as well

copy <redshift table name>
from '<s3 file location>'
iam_role '<iam role arn>'
delimiter ','
escape
REMOVEQUOTES
IGNOREHEADER 1;


Problem 2: Redshift doesn't support materialized views. I have tried copying views directly to Redshift from S3, which resulted in below error:

Error: psycopg2.errors.InternalError_: Invalid table DETAIL: ----------------------------------------------- error: Invalid table code: 8001 context: ************ is not a table. Views cannot be loaded with the COPY command. query: 0 location: compression_analyzer.cpp:432 process: padbmaster [pid=16256] -----------------------------------------------

Solution 2.1: Create views as a table and then COPY, if you don't care its either a view or a table on Redshift.

Solution 2.2: First COPY all the underlying tables, and then CREATE VIEW on Redshift.

FYI, ***Problem with JSON format****
The format of the file which you are trying to copy also plays an important role in fixing the issues related to the COPY command. I'm using CSV files for most of the test runs because I faced a limitation of 4MB while dealing with JSON formats. JSON parsers related to redshift have more problems to deal with. FYI, here is the error info I got related to JSON:


psycopg2.errors.InternalError_: The total size of the JSON object exceeds the max limit of 4194304 bytes
DETAIL: 
 ----------------------------------------------
error:  The total size of the JSON object exceeds the max limit of 4194304 bytes
code:      8001
--------------------------------------------------


Problem 3: Delimiter not found while running copy command

Try with FILLRECORD - where missing columns will be with zero-length strings or NULLs

copy <redshift table name>
from '<s3 file location>'
iam_role '<iam role arn>'
csv
IGNOREHEADER 1
FILLRECORD;


Problem 4: When there are issues with timestamps and the file format is not csv but comma separated

solution:

copy <redshift table name>
from 's3://location' 
iam_role 'iam role arn'
delimiter ','
escape
NULL as 'null'
REMOVEQUOTES
TIMEFORMAT 'auto';

Comments

  1. I was able to differentiate Aurora RDS and Redshift Limitations at much granular detail with the help of this practical blog. I can imagine the number of hours I saved by reading this.

    ReplyDelete

Post a Comment

Popular posts from this blog

AWS Connect: Reporting and Visualizations

Amazon connect offers: - built in reports i.e., historical and real-time reports.  We can customize these reports, schedule them and can integrate with any BI tool of our requirement to query and view the connect data.  Sample solution provided by AWS: 1. Make sure Connect is exporting the CTR data using Kinesis Data Stream 2. Use Kinesis Firehose to deliver the CTR that are in KDS to S3. (CTR's can be delivered as batch of records, so one s3 object might have multiple CTR's). AWS Lambda is used to add a new line character to each record, which makes object easier to parse.  3. s3 Event Notifications are used to send an event to modify the CTR record and saves it in S3. 4. Athena queries the modified CTR's using SQL. Use partitions to restrict the amount of data scanned by each query, improving performance and reducing cost. Lambda function is used to maintain the partitions.  5. Quicksight is used to visualize the modified CTRs.  Solution variations: Convert re...

Databricks: Job aborted due to stage failure. Total size of serialized results is bigger that spark driver memory.

  While running a databricks job, especially running a job with large datasets and longer running queries that creates a lot of temp space - we might be facing below issue if we have a minimal configuration set to the cluster.  The simple way to fix this would be changing the spark driver config in the databricks cluster tab spark.driver.maxResultSize = 100G (change the GB based on your cluster size)

Terraform lifecycle

 If we are using terraform, terraform state file is the heart of all the infrastructure that we spin up using terraform templates.  There are several ways to deploy the infrastructure using terraform: 1. Using CLI (setup terraform and then run terraform commands) 2. Automated Build (terraform scripts integrated as part of your jenkins pipeline) No matter of the way we chose, we must make sure that we are using the same terraform state file, so that we are having a sync and proper checklists of the resources that we used.  I would like to share the terraform commands that we do on a daily basis: terraform init = the basic/starting command which initializes the terraform (make sure the proper provider is provided. In my case, I use AWS).  terraform workspace select <workspace name > (creates a new workspace, useful in scenarios where we have different terraform modules - database, servers, logs, storage) terraform state list = shows the list of terraform resour...