Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

destination-s3: add file transfer #46302

Merged
merged 12 commits into from
Oct 30, 2024

Conversation

stephane-airbyte
Copy link
Contributor

@stephane-airbyte stephane-airbyte commented Oct 1, 2024

adding file transfer to destinaiton-s3

file transfer and record-based sync are exclusive. The platform will set the environment variables USE_FILE_TRANSFER to true and AIRBYTE_STAGING_DIRECTORY to the mounting point of the staging directory when the destination supports file transfer and the source enabled it in its config.
destination-s3 will check the USE_FILE_TRANSFER to decide whether to enable file transfer or record-based sync.
Record-based integration tests are all passing, and there's an extra test that makes sure file-based transfer is disabled.

Copy link

vercel bot commented Oct 1, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
airbyte-docs ⬜️ Ignored (Inspect) Visit Preview Oct 30, 2024 7:05pm

Copy link
Contributor Author

stephane-airbyte commented Oct 1, 2024

@octavia-squidington-iii octavia-squidington-iii added area/connectors Connector related issues CDK Connector Development Kit labels Oct 1, 2024
@stephane-airbyte stephane-airbyte force-pushed the stephane/10-01-destination-s3_add_file_transfer branch from 5649507 to 0a94310 Compare October 1, 2024 23:51
@@ -36,6 +36,11 @@ object UploadFormatConfigFactory {
FileUploadFormat.PARQUET -> {
UploadParquetFormatConfig(formatConfig)
}
FileUploadFormat.RAW_FILES ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nice. This sidesteps any questions w/r/t conversion.

@stephane-airbyte stephane-airbyte force-pushed the stephane/10-01-destination-s3_add_file_transfer branch from 0a94310 to 7baeb75 Compare October 3, 2024 00:31
@stephane-airbyte stephane-airbyte changed the base branch from stephane/09-30-cdk-java_add_file_transfer_mount_to_destinationacceptancetest to stephane/10-02-cdk-java_reorganize_the_destinationaccptancetest_to_split_out_the_actual_tests_from_all_the_util_methods October 3, 2024 00:31
@stephane-airbyte stephane-airbyte force-pushed the stephane/10-02-cdk-java_reorganize_the_destinationaccptancetest_to_split_out_the_actual_tests_from_all_the_util_methods branch from 6f87147 to 52c0fe1 Compare October 3, 2024 15:59
@stephane-airbyte stephane-airbyte force-pushed the stephane/10-01-destination-s3_add_file_transfer branch from 7baeb75 to 2dd1a87 Compare October 3, 2024 15:59
@stephane-airbyte stephane-airbyte force-pushed the stephane/10-02-cdk-java_reorganize_the_destinationaccptancetest_to_split_out_the_actual_tests_from_all_the_util_methods branch from 52c0fe1 to 95a7d03 Compare October 7, 2024 22:05
@stephane-airbyte stephane-airbyte force-pushed the stephane/10-01-destination-s3_add_file_transfer branch from 2dd1a87 to 75cc841 Compare October 7, 2024 22:05
@stephane-airbyte stephane-airbyte force-pushed the stephane/10-02-cdk-java_reorganize_the_destinationaccptancetest_to_split_out_the_actual_tests_from_all_the_util_methods branch from 95a7d03 to 721ddfa Compare October 8, 2024 18:27
@stephane-airbyte stephane-airbyte force-pushed the stephane/10-01-destination-s3_add_file_transfer branch from 75cc841 to f0f2536 Compare October 8, 2024 18:28
@stephane-airbyte stephane-airbyte force-pushed the stephane/10-01-destination-s3_add_file_transfer branch 8 times, most recently from e2bb0c0 to e1dd9ce Compare October 9, 2024 01:29
@stephane-airbyte stephane-airbyte force-pushed the stephane/10-02-cdk-java_reorganize_the_destinationaccptancetest_to_split_out_the_actual_tests_from_all_the_util_methods branch from 721ddfa to 8a78c22 Compare October 9, 2024 17:19
@stephane-airbyte stephane-airbyte force-pushed the stephane/10-01-destination-s3_add_file_transfer branch 2 times, most recently from a371928 to cd74813 Compare October 9, 2024 18:57
@stephane-airbyte stephane-airbyte force-pushed the stephane/10-02-cdk-java_reorganize_the_destinationaccptancetest_to_split_out_the_actual_tests_from_all_the_util_methods branch from 8a78c22 to 48a9e9e Compare October 9, 2024 20:51
}
val flushFunction =
if (featureFlags.useFileTransfer()) {
FileTransferDestinationFlushFunction(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What determines whether the feature flag is set? The fact that the source is flagged as a file source? Explicit opt-in at the sync level?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair question.

Basically, we need the source configuration to have a specific parameter enabled (I don't know the details of the parameter) AND the destination needs to have supportsFileTransfer set to true in its metadata.yaml. If those 2 conditions are true, then the 2 variables are set accordingly, a common volume is mounted on both containers, and it's expected that all records are file-based instead of record-based.
If the source config has the parameter set to true and the destination doesn't support file transfer, the platform will throw an exception

Copy link
Contributor

@johnny-schmidt johnny-schmidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. The shim seems like it's in the best place, and the file flush function is straightforward. I didn't have enough time to go over the tests in detail, but high-level how we're adding the file option to the docker env is clear.

One question about the env variables just to help me plan for the new CDK, but that's my own curiosity.

Copy link
Contributor

@edgao edgao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit: had one question about the protocol

java.util.List.of(
ConfiguredAirbyteStream()
.withSyncMode(SyncMode.INCREMENTAL)
.withDestinationSyncMode(DestinationSyncMode.APPEND_DEDUP)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we just ignoring this sync mode? (.... are we expected to behave differently in overwrite/append mode?)

Copy link
Collaborator

@aaronsteers aaronsteers Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edgao
I spoke with Stephane about this. Plan is to add as follow-up. For now:

  1. Same file synced twice overwrites/updates the prior version written.
  2. No support for purging old files via reset.

This matches the business requirements as I understand them in this first iteration, so I think we are good for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice. so in particular - that means we can do a blind "write <file> to <path>", i.e. we don't need to check if the file already exists 🚛

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think there's 2 roadblocks to being "better" about sync modes :

  1. sftp doesn't allow to see deletes. It'll only see the current state
  2. destination state would really allow us to know which files we saved without slow and expensive S3 calls (we could even store a hash in there, and add that to the file transfer protocol)

Copy link
Collaborator

@aaronsteers aaronsteers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my part, this looks good to go when ready!

@stephane-airbyte
Copy link
Contributor Author

stephane-airbyte commented Oct 26, 2024

I'll merge this on monday morning HI time (so probably around 9AM PST) if everyone is OK with the timing (especially @johnny-schmidt and @edgao as they would have to deal with potential oncall issues). Please 👍 or 👎 this post to confirm timing of merge (I've added a 👎 and a 👍 so it's easier for everyone. Doesn't mean I'm against the timing I'm suggesting, obviously)

This reverts commit d8a3bb0.
@benmoriceau benmoriceau force-pushed the stephane/10-01-destination-s3_add_file_transfer branch from 42d9f6d to 53a8182 Compare October 29, 2024 14:27
@benmoriceau
Copy link
Contributor

@edgao @johnny-schmidt I made the fix in DetectStreamToFlush I have re-requested a review.

@@ -4,7 +4,7 @@ plugins {
}

airbyteJavaConnector {
cdkVersionRequired = '0.46.1'
cdkVersionRequired = '0.48.0'
features = ['db-destinations', 's3-destinations']
useLocalCdk = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to change that once the cdk is publish.

Copy link
Contributor

@edgao edgao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnny-schmidt in case you have thoughts - IMO (a) in general we don't care that much about the async framework, and (b) in particular I don't care enough to figure out why the existing queue size tracker stuff isn't working as expected

lgtm from my side, had a few style nitpicks

runningFlushWorkers,
AtomicBoolean(false),
flusher,
true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use named parameters for primitive arguments

Suggested change
true
isFileTransfer = true

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -54,6 +54,7 @@ constructor(
workerPool: ExecutorService = Executors.newFixedThreadPool(5),
private val airbyteMessageDeserializer: AirbyteMessageDeserializer =
AirbyteMessageDeserializer(),
private val isFileTransfer: Boolean = false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rename to flushOnEveryMessage (to reflect functionality rather than usage)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -63,7 +63,7 @@ class BufferDequeue(

// otherwise pull records until we hit the memory limit.
val newSize: Long = (memoryItem.size) + bytesRead.get()
if (newSize <= optimalBytesToRead) {
if (newSize <= optimalBytesToRead || output.isEmpty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this accomplishing? Is this because other changes caused optimalBytesToRead to be zero?

Copy link
Contributor

@benmoriceau benmoriceau Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not 0 (it's 1) but one and yes it is allowing to add a record to the output disregard of the size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem here was if bytesRead == 0 && memoryItem.size > optimalBytesToRead, then we never add anything to the queue. So here, regardless of memoryItem.size or optimalBytesToRead, if there's no item in the queue, we add the current one.
For fileTransfer we set optimalBytesToRead to 1 so that we force a flush for each message. but with such a small value, any message is bigger than the optimal size, which causes an infinite loop.
Note that the infinite loop could also happen if memoryItem.size() was big enough and optimalBytesToRead was small enough. With our current settings, I don't believe it's possible, but it's just a couple of setting tweaks away...

@benmoriceau
Copy link
Contributor

benmoriceau commented Oct 30, 2024

/publish-java-cdk

🕑 https://github.com/airbytehq/airbyte/actions/runs/11599674873
❌ Publish Java CDK version=0.48.0 failed!

@benmoriceau
Copy link
Contributor

benmoriceau commented Oct 30, 2024

/publish-java-cdk

🕑 https://github.com/airbytehq/airbyte/actions/runs/11599816233
✅ Successfully published Java CDK version=0.48.1!

@benmoriceau benmoriceau enabled auto-merge (squash) October 30, 2024 19:26
@benmoriceau benmoriceau merged commit 8957119 into master Oct 30, 2024
38 checks passed
@benmoriceau benmoriceau deleted the stephane/10-01-destination-s3_add_file_transfer branch October 30, 2024 19:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation CDK Connector Development Kit connectors/destination/s3-glue connectors/destination/s3-v2 connectors/destination/s3
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants