-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Source S3: speed up discovery #22500
Conversation
/test connector=connectors/source-s3
Build FailedTest summary info:
|
@@ -52,6 +55,7 @@ def fileformatparser_map(self) -> Mapping[str, type]: | |||
ab_file_name_col = "_ab_source_file_url" | |||
airbyte_columns = [ab_additional_col, ab_last_mod_col, ab_file_name_col] | |||
datetime_format_string = "%Y-%m-%dT%H:%M:%S%z" | |||
parallel_tasks_size = 256 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
128 results in 6 min 50 seconds
256 results in 6 min 30 seconds
512 ~ almost the same
☝️ the failing test is not related. It's about the actual and expected schema mismatch (avro) |
9789607
to
a70d6e8
Compare
/publish connector=connectors/source-s3 run-tests=false if you have connectors that successfully published but failed definition generation, follow step 4 here |
/publish connector=connectors/source-s3 run-tests=false
if you have connectors that successfully published but failed definition generation, follow step 4 here |
Airbyte Code Coverage
|
What
https://github.com/airbytehq/oncall/issues/1470
How
When
discover
is called, the connector reads all the files one by one and tries to infer the schema of each of them and then merge them into a single final schema. In case there are too many files, it takes much time. This PR introduces a threaded approach to reading the files.It used to take the connector mentioned in the oncall issue about 1 second to read each of the 3.3k files they have in the bucket + 60-90 seconds to fetch the list of files. Now the whole
discover
takes about 6.5 minutes