Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File CDK: Allow skipping unparseable file types #32092

Merged
merged 6 commits into from
Nov 9, 2023

Conversation

flash1293
Copy link
Contributor

@flash1293 flash1293 commented Nov 2, 2023

Allow skipping file types that can't be parsed:

Screenshot 2023-11-03 at 10 57 13

This will still fail

I didn't add the mime type filter because not all file-cdk-based sources support it so it might lead to unexpected behaviors and I'm not sure how critical this feature is.

Biggest open question is whether this should default to true or false.

Note: Until #31605 the behavior is a bit weird anyway at the moment:

  • In case an unprocessable file is encountered during check or discover, the job will fail
  • In case an unprocessable file is encountered during read, it will skip the file and continue with the next one

So, until #31605 is done, this flag will only take effect for check and discover (during the connection setup), not during the actual sync.

IMHO this makes it less critical and we can also wait for #31605 to be implemented before introducing it (which is planned for Q4b at the moment)

Copy link

vercel bot commented Nov 2, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
airbyte-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Nov 9, 2023 3:16pm

@octavia-squidington-iii octavia-squidington-iii added the CDK Connector Development Kit label Nov 2, 2023
.set_expected_records([])
).build()

# TODO When working on https://github.com/airbytehq/airbyte/issues/31605, this test should be split into two tests:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind putting a link to this in here so whoever implements it knows to update this?

@@ -14,3 +16,10 @@ class Config:
"unstructured",
const=True,
)

skip_unprocessable_file_types: Optional[bool] = Field(
default=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to default to False to be consistent with existing connections. Otherwise, if a user re-runs a sync they could end up syncing different files than they would have previously, despite not actually having signed up for a config change. I believe that would be considered a breaking change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't change existing connections - these don't have the property at all yet, so it will use the default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of an existing connection that has a sync error for whatever reason, then the connector is upgraded to this version, then the sync is retried. The files that will be processed are different in the two cases.

I don' think this is a huge deal though so if you feel that it's a better UX to default to True that's fine.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed in slack. I'm comfortable either way on this. 👍

@octavia-squidington-iv octavia-squidington-iv requested a review from a team November 6, 2023 15:03
@@ -14,3 +16,10 @@ class Config:
"unstructured",
const=True,
)

skip_unprocessable_file_types: Optional[bool] = Field(
default=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of an existing connection that has a sync error for whatever reason, then the connector is upgraded to this version, then the sync is retried. The files that will be processed are different in the two cases.

I don' think this is a huge deal though so if you feel that it's a better UX to default to True that's fine.

Copy link
Collaborator

@aaronsteers aaronsteers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. 🚀

As noted in slack, I really like the UX. Very easy to follow and the toggle is very inviting.

@@ -14,3 +16,10 @@ class Config:
"unstructured",
const=True,
)

skip_unprocessable_file_types: Optional[bool] = Field(
default=True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed in slack. I'm comfortable either way on this. 👍

@flash1293 flash1293 merged commit f1a11e1 into master Nov 9, 2023
15 checks passed
@flash1293 flash1293 deleted the flash1293/unstructured-filter-options branch November 9, 2023 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CDK Connector Development Kit
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants