-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Destination Redshift: (Performance) Add parallel loading to Redshift via manifest file #6583
Destination Redshift: (Performance) Add parallel loading to Redshift via manifest file #6583
Conversation
…y command - rather than a series of copy commands per AWS Redshift best practices
- Also made sure to clean up the manifest file on cleanup
@@ -8,5 +8,5 @@ COPY build/distributions/${APPLICATION}*.tar ${APPLICATION}.tar | |||
|
|||
RUN tar xf ${APPLICATION}.tar --strip-components=1 | |||
|
|||
LABEL io.airbyte.version=0.3.5 | |||
LABEL io.airbyte.version=0.3.6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure if I needed to add this since the changes here are passive to this connector.
public void copyStagingFileToTemporaryTable() { | ||
var possibleManifest = Optional.ofNullable(createManifest()); | ||
LOGGER.info("Starting copy to tmp table: {} in destination for stream: {}, schema: {}, .", tmpTableName, streamName, schemaName); | ||
possibleManifest.stream() | ||
.map(this::putManifest) | ||
.forEach(this::executeCopy); | ||
LOGGER.info("Copy to tmp table {} in destination for stream {} complete.", tmpTableName, streamName); | ||
} | ||
|
||
@Override | ||
public void copyS3CsvFileIntoTable(JdbcDatabase database, String s3FileLocation, String schema, String tableName, S3Config s3Config) | ||
throws SQLException { | ||
public void copyS3CsvFileIntoTable(JdbcDatabase database, String s3FileLocation, String schema, String tableName, S3Config s3Config) { | ||
throw new RuntimeException("Redshift Stream Copier should not copy individual files without use of a manifest"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this breaks the initial abstraction, but it was necessary and I wanted to make this change more surgical rather than structural to all related s3 connectors.
|
private final ExtendedNameTransformer nameTransformer; | ||
private final SqlOperations sqlOperations; | ||
private final Set<String> s3StagingFiles = new HashSet<>(); | ||
protected final Set<String> s3StagingFiles = new HashSet<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was probably the only variable that definitely needed to be changed to protected
for child class access, but I changed the others because seemed cleaner than defining them all again in: RedshiftStreamCopier.java
Thanks @archaean we're going to review and give you feedback about your implementation. Did you deploy as |
tableName, | ||
s3FileLocation, | ||
+ "STATUPDATE OFF\n" | ||
+ "MANIFEST;", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @archaean. I only have one question. I deliberately split one temporary file on S3 or GCS for every stream into several smaller files to avoid a timeout exception in a copy query from a file to a temp table. This happened when the work was done on a large amount of data. Are you sure using the manifest won't get us back to this issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andriikorotkov , I guess there are edge cases where it could, but the data would likely need to be an order of magnitude larger than whatever caused the timeout in your case given the parallelism.
I don't know exactly what timeout exception you are referring to, but if it from the Redshift connection, it is likely ERROR: Query (150) canceled on user's request
which is generally caused when a query reaches the timeout limit in redshift. I assume the COPY command is also affected by this query limitation, but this timeout is configurable.
Other timeouts are possible for long running COPY commands due to firewall configurations.
One possible implementation would be to further batch the files to arbitrary number (hopefully close to the number of shards on the redshift cluster) and make a manifest file for each batch and do serial copy commands that way.
So to answer your question, it won't be the same issue because the copy command should process an order of magnitude faster than serial copy commands depending on the configuration of the Redshift cluster. I think the query timeout limit can be configured to avoid this issue. And while batching the files into multiple manifest and copy commands is an option, at this point it seems to me like a premature optimization.
@marcosmarxm I have not tested it with any datasets locally, I just ran the acceptance tests. I plan on running it against a medium size table today and benchmarking against the old implementation. |
I started some small load tests with 1M row table and began by benchmarking the Redshift Destination: (0.3.14):
However, against master though (not even my code) it kept failing after about 200K records (two separate failed logs below):
I didn't have time to dig in more... But this seems like a blocking issue. |
@andriikorotkov @marcosmarxm , I am not going to have much time to debug this issue this week on my end. I am curious if this problem can be recreated on master with medium sized dataset on your end. I am pretty sure my code has nothing to do with it (per my tests of |
Hello! @archaean and @andriikorotkov the tests are passing in our CI. Should we merge this or wait to correct and improve tests? |
@marcosmarxm, I think we can merge this pull request. |
…via manifest file (airbytehq#6583) * 4871 - Modified RedshiftStreamCopier to use manifest and a single copy command - rather than a series of copy commands per AWS Redshift best practices * 4871 - Add manifest keyword to COPY Statement * 4871 - added fault tolerance for empty entries in the manifest - Also made sure to clean up the manifest file on cleanup * 4871 - Fixed erroneous documentation in RedshiftStreamCopier * 4781 - Update outdated `RedshiftCopyS3Destination` documentation.
What
Issue: 4871
Add Parallelization for copying multiple files in Redshift.
How
Recommended reading order
RedshiftStreamCopier.java
Manifest.java
Entry.java
S3StreamCopier.java
Pre-merge Checklist
Expand the relevant checklist and delete the others.
New Connector
Community member or Airbyter
airbyte_secret
./gradlew :airbyte-integrations:connectors:<name>:integrationTest
.README.md
bootstrap.md
. See description and examplesdocs/SUMMARY.md
docs/integrations/<source or destination>/<name>.md
including changelog. See changelog exampledocs/integrations/README.md
airbyte-integrations/builds.md
Airbyter
If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.
/test connector=connectors/<name>
command is passing./publish
command described hereUpdating a connector
Community member or Airbyter
airbyte_secret
./gradlew :airbyte-integrations:connectors:<name>:integrationTest
.README.md
bootstrap.md
. See description and examplesdocs/integrations/<source or destination>/<name>.md
including changelog. See changelog exampleAirbyter
If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.
/test connector=connectors/<name>
command is passing./publish
command described hereConnector Generator
-scaffold
in their name) have been updated with the latest scaffold by running./gradlew :airbyte-integrations:connector-templates:generator:testScaffoldTemplates
then checking in your changes