Commits and parquet file sink #197

jacksonrnewhouse · 2023-07-11T20:54:58Z

This adds a parquet sink and introduces a committing phase during checkpointing. This allows for the support of exactly once sinks, the first of which is implemented to support writing Parquet to S3. Parquet on S3 allows for the production of typed parquet files to an S3 directory. The files are created using S3's multipart upload API, and we support the creation of parquet files across multiple checkpoints.

When restoring from checkpoints the subtask with id 0 will be responsible for finishing any files that were previously in-flight, as well as committing any files that were ready to be committed.

Right now we assume idempotency for the completion of multipart uploads, which seems to work. According to the S3 Glacier documentation, "Complete Multipart Upload is an idempotent operation. After your first successful complete multipart upload, if you call the operation again within a short period, the operation will succeed and return the same archive ID." However, the normal S3 docs say that trying to complete an already finished upload might return a 404: "Description: The specified multipart upload does not exist. The upload ID might be invalid, or the multipart upload might have been aborted or completed."

I've manually tested the behavior under a variety of cases, including stopping it through the UI, uncontrolled stopping, and failing to log commits. In all cases it was able to resume writes without losing or duplicating data.

arroyo-api/src/pipelines.rs

mwylde · 2023-07-11T21:54:32Z

arroyo-connectors/src/parquet.rs

+            connection_type: ConnectionType::Sink,
+            schema: schema
+                .map(|s| s.to_owned())
+                .ok_or_else(|| anyhow!("No schema defined for SSE source"))?,


SSE => Parquet

arroyo-console/src/routes/pipelines/CreatePipeline.tsx

arroyo-console/src/routes/connections/JsonForm.tsx

arroyo-controller/src/compiler.rs

mwylde · 2023-07-18T22:32:13Z

arroyo-worker/src/connectors/two_phase_committer.rs

+        data_recovery: Vec<Self::DataRecovery>,
+    ) -> Result<()>;
+    async fn insert_record(&mut self, record: &Record<K, T>) -> Result<()>;
+    // TODO: figure out how to have the relevant vectors be of pointers across async boundaries.


they can be pointers across async bounderies so long as they are &mut (because &T is send only if T is sync, but &mut T is send if T is send).

mwylde · 2023-07-18T22:35:33Z

arroyo-worker/src/connectors/two_phase_committer.rs

+
+    async fn on_close(&mut self, ctx: &mut crate::engine::Context<(), ()>) {
+        info!("waiting for commit message");
+        if let Some(ControlMessage::Commit { epoch }) = ctx.control_rx.recv().await {


This logic assumes that the next control message on the queue is a commit message -- is that guaranteed?

It is except for if the source data ends, which right now is only for test sources.

arroyo-worker/src/connectors/two_phase_committer.rs

arroyo-connectors/src/kafka.rs

arroyo-controller/src/job_controller/checkpointer.rs

mwylde · 2023-07-19T23:24:02Z

_

arroyo-connectors/resources/parquet.svg

arroyo-connectors/src/filesystem.rs

…rquet, refactor writing code.

…ystem in the UI.

mwylde · 2023-07-20T22:51:05Z

arroyo-console/src/routes/connections/JsonForm.tsx

@@ -283,6 +283,26 @@ export function FormInner({
                    </Stack>
                  </fieldset>
                );
+              } else if (values[key] > 0) {


I think this should be (values[key].properties?.length || 0) > 0 -- values[key] is an object and can't be (meaningfully) compared to a number

jacksonrnewhouse requested a review from mwylde July 11, 2023 20:54

jacksonrnewhouse force-pushed the commits_and_parquet_file_sink branch 2 times, most recently from 5941084 to cdb52e6 Compare July 17, 2023 20:52

jacksonrnewhouse mentioned this pull request Jul 17, 2023

initial parquet sink #194

Closed

jacksonrnewhouse marked this pull request as ready for review July 17, 2023 20:59

mwylde reviewed Jul 18, 2023

View reviewed changes

mwylde mentioned this pull request Jul 19, 2023

Update dependencies #211

Merged

jacksonrnewhouse force-pushed the commits_and_parquet_file_sink branch 4 times, most recently from 85184df to 8c147ce Compare July 19, 2023 21:57

mwylde reviewed Jul 19, 2023

View reviewed changes

arroyo-connectors/resources/parquet.svg Outdated Show resolved Hide resolved

arroyo-connectors/src/filesystem.rs Outdated Show resolved Hide resolved

arroyo-connectors/src/filesystem.rs Outdated Show resolved Hide resolved

arroyo-connectors/src/filesystem.rs Outdated Show resolved Hide resolved

jacksonrnewhouse added 5 commits July 20, 2023 15:00

working parquet sink

46bb7b0

Switch from ParquetSink to FileSystemSink, support JSON as well as Pa…

c74813b

…rquet, refactor writing code.

PR feedback, local file writer.

a96ee7b

implement local json writer.

a8221d0

switch to properly using format key in SQL for filesystem, hide files…

eefcbe4

…ystem in the UI.

jacksonrnewhouse force-pushed the commits_and_parquet_file_sink branch from 57a5eef to eefcbe4 Compare July 20, 2023 22:05

jacksonrnewhouse added 3 commits July 20, 2023 15:47

add types to rest of the sinks.

6635d8f

improve errors on filesystem connector.

9f4ffbd

add patch.crates-io to docker/build_base/Cargo.toml

5e01442

mwylde approved these changes Jul 20, 2023

View reviewed changes

JSON Fix

06ded7f

jacksonrnewhouse enabled auto-merge (squash) July 21, 2023 00:20

Merge branch 'master' into commits_and_parquet_file_sink

66fc20c

jacksonrnewhouse merged commit aca1ee5 into master Jul 21, 2023

jacksonrnewhouse mentioned this pull request Jul 21, 2023

Add Parquet as a serialization format throughout. #216

Merged

mwylde mentioned this pull request Aug 1, 2023

Compilation failure when casting types #225

Closed

mwylde mentioned this pull request Aug 26, 2023

Fix pipeline cleanup #274

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commits and parquet file sink #197

Commits and parquet file sink #197

jacksonrnewhouse commented Jul 11, 2023 •

edited

Loading

mwylde Jul 11, 2023

mwylde Jul 18, 2023

mwylde Jul 18, 2023

jacksonrnewhouse Jul 20, 2023

mwylde commented Jul 19, 2023 •

edited

Loading

mwylde Jul 20, 2023 •

edited

Loading

Commits and parquet file sink #197

Commits and parquet file sink #197

Conversation

jacksonrnewhouse commented Jul 11, 2023 • edited Loading

mwylde Jul 11, 2023

Choose a reason for hiding this comment

mwylde Jul 18, 2023

Choose a reason for hiding this comment

mwylde Jul 18, 2023

Choose a reason for hiding this comment

jacksonrnewhouse Jul 20, 2023

Choose a reason for hiding this comment

mwylde commented Jul 19, 2023 • edited Loading

mwylde Jul 20, 2023 • edited Loading

Choose a reason for hiding this comment

jacksonrnewhouse commented Jul 11, 2023 •

edited

Loading

mwylde commented Jul 19, 2023 •

edited

Loading

mwylde Jul 20, 2023 •

edited

Loading