destination-s3: add file transfer #46302

stephane-airbyte · 2024-10-01T23:33:44Z

adding file transfer to destinaiton-s3

file transfer and record-based sync are exclusive. The platform will set the environment variables USE_FILE_TRANSFER to true and AIRBYTE_STAGING_DIRECTORY to the mounting point of the staging directory when the destination supports file transfer and the source enabled it in its config.
destination-s3 will check the USE_FILE_TRANSFER to decide whether to enable file transfer or record-based sync.
Record-based integration tests are all passing, and there's an extra test that makes sure file-based transfer is disabled.

vercel · 2024-10-01T23:33:46Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment

Name	Status	Preview	Comments	Updated (UTC)
airbyte-docs	⬜️ Ignored (Inspect)	Visit Preview		Oct 30, 2024 7:05pm

stephane-airbyte · 2024-10-01T23:33:55Z

destination-s3 introduce a sleep to allow testing #48112
destination-s3: fix doc formatting #46695
destination-s3: add file transfer #46302 👈
simple split of DestinationAcceptanceTest #46689
destination-s3: fix tests #46281 : 2 other dependent PRs (#46325 , #46564 )
master

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @stephane-airbyte and the rest of your teammates on Graphite

johnny-schmidt · 2024-10-02T00:04:47Z

...ions/src/main/kotlin/io/airbyte/cdk/integrations/destination/s3/UploadFormatConfigFactory.kt

@@ -36,6 +36,11 @@ object UploadFormatConfigFactory {
            FileUploadFormat.PARQUET -> {
                UploadParquetFormatConfig(formatConfig)
            }
+            FileUploadFormat.RAW_FILES ->


Oh nice. This sidesteps any questions w/r/t conversion.

johnny-schmidt · 2024-10-23T23:52:32Z

...destinations/src/main/kotlin/io/airbyte/cdk/integrations/destination/s3/S3ConsumerFactory.kt

+            }
+        val flushFunction =
+            if (featureFlags.useFileTransfer()) {
+                FileTransferDestinationFlushFunction(


What determines whether the feature flag is set? The fact that the source is flagged as a file source? Explicit opt-in at the sync level?

Fair question.

Basically, we need the source configuration to have a specific parameter enabled (I don't know the details of the parameter) AND the destination needs to have supportsFileTransfer set to true in its metadata.yaml. If those 2 conditions are true, then the 2 variables are set accordingly, a common volume is mounted on both containers, and it's expected that all records are file-based instead of record-based.
If the source config has the parameter set to true and the destination doesn't support file transfer, the platform will throw an exception

johnny-schmidt

Looks good. The shim seems like it's in the best place, and the file flush function is straightforward. I didn't have enough time to go over the tests in detail, but high-level how we're adding the file option to the docker env is clear.

One question about the env variables just to help me plan for the new CDK, but that's my own curiosity.

edgao

had one question about the protocol

edgao · 2024-10-24T14:49:26Z

...estFixtures/kotlin/io/airbyte/cdk/integrations/destination/s3/S3DestinationAcceptanceTest.kt

+                    java.util.List.of(
+                        ConfiguredAirbyteStream()
+                            .withSyncMode(SyncMode.INCREMENTAL)
+                            .withDestinationSyncMode(DestinationSyncMode.APPEND_DEDUP)


are we just ignoring this sync mode? (.... are we expected to behave differently in overwrite/append mode?)

@edgao
I spoke with Stephane about this. Plan is to add as follow-up. For now:

Same file synced twice overwrites/updates the prior version written.

No support for purging old files via reset.

This matches the business requirements as I understand them in this first iteration, so I think we are good for now.

nice. so in particular - that means we can do a blind "write <file> to <path>", i.e. we don't need to check if the file already exists 🚛

yeah, I think there's 2 roadblocks to being "better" about sync modes :

sftp doesn't allow to see deletes. It'll only see the current state

destination state would really allow us to know which files we saved without slow and expensive S3 calls (we could even store a hash in there, and add that to the file transfer protocol)

aaronsteers

For my part, this looks good to go when ready!

stephane-airbyte · 2024-10-26T18:39:56Z

I'll merge this on monday morning HI time (so probably around 9AM PST) if everyone is OK with the timing (especially @johnny-schmidt and @edgao as they would have to deal with potential oncall issues). Please 👍 or 👎 this post to confirm timing of merge (I've added a 👎 and a 👍 so it's easier for everyone. Doesn't mean I'm against the timing I'm suggesting, obviously)

This reverts commit d8a3bb0.

benmoriceau · 2024-10-30T15:18:13Z

@edgao @johnny-schmidt I made the fix in DetectStreamToFlush I have re-requested a review.

benmoriceau · 2024-10-30T15:39:27Z

airbyte-integrations/connectors/destination-s3/build.gradle

@@ -4,7 +4,7 @@ plugins {
 }

 airbyteJavaConnector {
-    cdkVersionRequired = '0.46.1'
+    cdkVersionRequired = '0.48.0'
    features = ['db-destinations', 's3-destinations']
    useLocalCdk = true


I need to change that once the cdk is publish.

edgao

@johnny-schmidt in case you have thoughts - IMO (a) in general we don't care that much about the async framework, and (b) in particular I don't care enough to figure out why the existing queue size tracker stuff isn't working as expected

lgtm from my side, had a few style nitpicks

edgao · 2024-10-30T16:01:30Z

...ore/src/test/kotlin/io/airbyte/cdk/integrations/destination/async/DetectStreamToFlushTest.kt

+                runningFlushWorkers,
+                AtomicBoolean(false),
+                flusher,
+                true


nit: use named parameters for primitive arguments

Suggested change

true

isFileTransfer = true

edgao · 2024-10-30T16:04:09Z

...dk/core/src/main/kotlin/io/airbyte/cdk/integrations/destination/async/AsyncStreamConsumer.kt

@@ -54,6 +54,7 @@ constructor(
    workerPool: ExecutorService = Executors.newFixedThreadPool(5),
    private val airbyteMessageDeserializer: AirbyteMessageDeserializer =
        AirbyteMessageDeserializer(),
+    private val isFileTransfer: Boolean = false,


nit: rename to flushOnEveryMessage (to reflect functionality rather than usage)

johnny-schmidt · 2024-10-30T18:39:56Z

.../core/src/main/kotlin/io/airbyte/cdk/integrations/destination/async/buffers/BufferDequeue.kt

@@ -63,7 +63,7 @@ class BufferDequeue(

                // otherwise pull records until we hit the memory limit.
                val newSize: Long = (memoryItem.size) + bytesRead.get()
-                if (newSize <= optimalBytesToRead) {
+                if (newSize <= optimalBytesToRead || output.isEmpty()) {


What is this accomplishing? Is this because other changes caused optimalBytesToRead to be zero?

it is not 0 (it's 1) but one and yes it is allowing to add a record to the output disregard of the size.

the problem here was if bytesRead == 0 && memoryItem.size > optimalBytesToRead, then we never add anything to the queue. So here, regardless of memoryItem.size or optimalBytesToRead, if there's no item in the queue, we add the current one.
For fileTransfer we set optimalBytesToRead to 1 so that we force a flush for each message. but with such a small value, any message is bigger than the optimal size, which causes an infinite loop.
Note that the infinite loop could also happen if memoryItem.size() was big enough and optimalBytesToRead was small enough. With our current settings, I don't believe it's possible, but it's just a couple of setting tweaks away...

benmoriceau · 2024-10-30T18:44:35Z

/publish-java-cdk

🕑 https://github.com/airbytehq/airbyte/actions/runs/11599674873
❌ Publish Java CDK version=0.48.0 failed!

benmoriceau · 2024-10-30T18:54:30Z

/publish-java-cdk

🕑 https://github.com/airbytehq/airbyte/actions/runs/11599816233
✅ Successfully published Java CDK version=0.48.1!

octavia-squidington-iii added area/connectors Connector related issues CDK Connector Development Kit labels Oct 1, 2024

stephane-airbyte mentioned this pull request Oct 1, 2024

destination-s3: fix tests #46281

Merged

octavia-squidington-iii added the connectors/destination/s3 label Oct 1, 2024

stephane-airbyte force-pushed the stephane/10-01-destination-s3_add_file_transfer branch from 5649507 to 0a94310 Compare October 1, 2024 23:51

johnny-schmidt reviewed Oct 2, 2024

View reviewed changes

stephane-airbyte force-pushed the stephane/10-01-destination-s3_add_file_transfer branch from 0a94310 to 7baeb75 Compare October 3, 2024 00:31

stephane-airbyte changed the base branch from stephane/09-30-cdk-java_add_file_transfer_mount_to_destinationacceptancetest to stephane/10-02-cdk-java_reorganize_the_destinationaccptancetest_to_split_out_the_actual_tests_from_all_the_util_methods October 3, 2024 00:31

stephane-airbyte mentioned this pull request Oct 3, 2024

cdk-java: reorganize the DestinationAccptanceTest to split out the actual tests from all the util methods #46325

Closed

2 tasks

stephane-airbyte force-pushed the stephane/10-02-cdk-java_reorganize_the_destinationaccptancetest_to_split_out_the_actual_tests_from_all_the_util_methods branch from 6f87147 to 52c0fe1 Compare October 3, 2024 15:59

stephane-airbyte force-pushed the stephane/10-01-destination-s3_add_file_transfer branch from 7baeb75 to 2dd1a87 Compare October 3, 2024 15:59

stephane-airbyte force-pushed the stephane/10-02-cdk-java_reorganize_the_destinationaccptancetest_to_split_out_the_actual_tests_from_all_the_util_methods branch from 52c0fe1 to 95a7d03 Compare October 7, 2024 22:05

stephane-airbyte force-pushed the stephane/10-01-destination-s3_add_file_transfer branch from 2dd1a87 to 75cc841 Compare October 7, 2024 22:05

stephane-airbyte force-pushed the stephane/10-02-cdk-java_reorganize_the_destinationaccptancetest_to_split_out_the_actual_tests_from_all_the_util_methods branch from 95a7d03 to 721ddfa Compare October 8, 2024 18:27

stephane-airbyte force-pushed the stephane/10-01-destination-s3_add_file_transfer branch from 75cc841 to f0f2536 Compare October 8, 2024 18:28

stephane-airbyte mentioned this pull request Oct 8, 2024

split DestinationAcceptanceTest #46651

Closed

2 tasks

stephane-airbyte force-pushed the stephane/10-01-destination-s3_add_file_transfer branch 8 times, most recently from e2bb0c0 to e1dd9ce Compare October 9, 2024 01:29

stephane-airbyte force-pushed the stephane/10-02-cdk-java_reorganize_the_destinationaccptancetest_to_split_out_the_actual_tests_from_all_the_util_methods branch from 721ddfa to 8a78c22 Compare October 9, 2024 17:19

stephane-airbyte force-pushed the stephane/10-01-destination-s3_add_file_transfer branch 2 times, most recently from a371928 to cd74813 Compare October 9, 2024 18:57

stephane-airbyte force-pushed the stephane/10-02-cdk-java_reorganize_the_destinationaccptancetest_to_split_out_the_actual_tests_from_all_the_util_methods branch from 8a78c22 to 48a9e9e Compare October 9, 2024 20:51

johnny-schmidt reviewed Oct 23, 2024

View reviewed changes

johnny-schmidt approved these changes Oct 23, 2024

View reviewed changes

edgao approved these changes Oct 24, 2024

View reviewed changes

Log file copy info

d8a3bb0

aaronsteers approved these changes Oct 25, 2024

View reviewed changes

Revert "Log file copy info"

53a8182

This reverts commit d8a3bb0.

benmoriceau force-pushed the stephane/10-01-destination-s3_add_file_transfer branch from 42d9f6d to 53a8182 Compare October 29, 2024 14:27

benmoriceau added 5 commits October 29, 2024 08:36

Force the stream to flush

7ee9838

Use local CDK

75486ef

Use FF and test

415b532

Format

ca957df

Fix build

2000516

benmoriceau requested review from edgao and johnny-schmidt October 30, 2024 15:17

benmoriceau reviewed Oct 30, 2024

View reviewed changes

edgao approved these changes Oct 30, 2024

View reviewed changes

benmoriceau added 2 commits October 30, 2024 09:17

PR comments

07f7b66

Remove local CDK

bac819b

johnny-schmidt reviewed Oct 30, 2024

View reviewed changes

johnny-schmidt approved these changes Oct 30, 2024

View reviewed changes

Bump CDK version

1df388d

Bump s3 cdk version

4d742a8

benmoriceau enabled auto-merge (squash) October 30, 2024 19:26

benmoriceau merged commit 8957119 into master Oct 30, 2024
38 checks passed

benmoriceau deleted the stephane/10-01-destination-s3_add_file_transfer branch October 30, 2024 19:29

stephane-airbyte mentioned this pull request Nov 1, 2024

destination-s3 introduce a sleep to allow testing #48112

Draft

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

destination-s3: add file transfer #46302

destination-s3: add file transfer #46302

stephane-airbyte commented Oct 1, 2024 •

edited

Loading

vercel bot commented Oct 1, 2024 •

edited

Loading

stephane-airbyte commented Oct 1, 2024 •

edited

Loading

johnny-schmidt Oct 2, 2024

johnny-schmidt Oct 23, 2024

stephane-airbyte Oct 24, 2024

johnny-schmidt left a comment

edgao left a comment

edgao Oct 24, 2024

aaronsteers Oct 25, 2024 •

edited

Loading

edgao Oct 25, 2024

stephane-airbyte Oct 26, 2024

aaronsteers left a comment •

edited

Loading

stephane-airbyte commented Oct 26, 2024 •

edited

Loading

benmoriceau commented Oct 30, 2024

benmoriceau Oct 30, 2024

edgao left a comment

edgao Oct 30, 2024

benmoriceau Oct 30, 2024

edgao Oct 30, 2024

benmoriceau Oct 30, 2024

johnny-schmidt Oct 30, 2024

benmoriceau Oct 30, 2024 •

edited

Loading

stephane-airbyte Oct 31, 2024

benmoriceau commented Oct 30, 2024 •

edited by github-actions bot

Loading

benmoriceau commented Oct 30, 2024 •

edited by github-actions bot

Loading

destination-s3: add file transfer #46302

destination-s3: add file transfer #46302

Conversation

stephane-airbyte commented Oct 1, 2024 • edited Loading

vercel bot commented Oct 1, 2024 • edited Loading

stephane-airbyte commented Oct 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnny-schmidt left a comment

Choose a reason for hiding this comment

edgao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaronsteers Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaronsteers left a comment • edited Loading

Choose a reason for hiding this comment

stephane-airbyte commented Oct 26, 2024 • edited Loading

benmoriceau commented Oct 30, 2024

Choose a reason for hiding this comment

edgao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benmoriceau Oct 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benmoriceau commented Oct 30, 2024 • edited by github-actions bot Loading

benmoriceau commented Oct 30, 2024 • edited by github-actions bot Loading

stephane-airbyte commented Oct 1, 2024 •

edited

Loading

vercel bot commented Oct 1, 2024 •

edited

Loading

stephane-airbyte commented Oct 1, 2024 •

edited

Loading

aaronsteers Oct 25, 2024 •

edited

Loading

aaronsteers left a comment •

edited

Loading

stephane-airbyte commented Oct 26, 2024 •

edited

Loading

benmoriceau Oct 30, 2024 •

edited

Loading

benmoriceau commented Oct 30, 2024 •

edited by github-actions bot

Loading

benmoriceau commented Oct 30, 2024 •

edited by github-actions bot

Loading