Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Commits and parquet file sink #197

Merged
merged 10 commits into from
Jul 21, 2023
Merged

Conversation

jacksonrnewhouse
Copy link
Contributor

@jacksonrnewhouse jacksonrnewhouse commented Jul 11, 2023

This adds a parquet sink and introduces a committing phase during checkpointing. This allows for the support of exactly once sinks, the first of which is implemented to support writing Parquet to S3. Parquet on S3 allows for the production of typed parquet files to an S3 directory. The files are created using S3's multipart upload API, and we support the creation of parquet files across multiple checkpoints.

When restoring from checkpoints the subtask with id 0 will be responsible for finishing any files that were previously in-flight, as well as committing any files that were ready to be committed.

Right now we assume idempotency for the completion of multipart uploads, which seems to work. According to the S3 Glacier documentation, "Complete Multipart Upload is an idempotent operation. After your first successful complete multipart upload, if you call the operation again within a short period, the operation will succeed and return the same archive ID." However, the normal S3 docs say that trying to complete an already finished upload might return a 404: "Description: The specified multipart upload does not exist. The upload ID might be invalid, or the multipart upload might have been aborted or completed."

I've manually tested the behavior under a variety of cases, including stopping it through the UI, uncontrolled stopping, and failing to log commits. In all cases it was able to resume writes without losing or duplicating data.

@jacksonrnewhouse jacksonrnewhouse requested a review from mwylde July 11, 2023 20:54
@jacksonrnewhouse jacksonrnewhouse force-pushed the commits_and_parquet_file_sink branch 2 times, most recently from 5941084 to cdb52e6 Compare July 17, 2023 20:52
@jacksonrnewhouse jacksonrnewhouse marked this pull request as ready for review July 17, 2023 20:59
arroyo-api/src/pipelines.rs Outdated Show resolved Hide resolved
connection_type: ConnectionType::Sink,
schema: schema
.map(|s| s.to_owned())
.ok_or_else(|| anyhow!("No schema defined for SSE source"))?,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SSE => Parquet

arroyo-console/src/routes/connections/JsonForm.tsx Outdated Show resolved Hide resolved
arroyo-controller/src/compiler.rs Show resolved Hide resolved
data_recovery: Vec<Self::DataRecovery>,
) -> Result<()>;
async fn insert_record(&mut self, record: &Record<K, T>) -> Result<()>;
// TODO: figure out how to have the relevant vectors be of pointers across async boundaries.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they can be pointers across async bounderies so long as they are &mut (because &T is send only if T is sync, but &mut T is send if T is send).


async fn on_close(&mut self, ctx: &mut crate::engine::Context<(), ()>) {
info!("waiting for commit message");
if let Some(ControlMessage::Commit { epoch }) = ctx.control_rx.recv().await {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic assumes that the next control message on the queue is a commit message -- is that guaranteed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is except for if the source data ends, which right now is only for test sources.

arroyo-worker/src/connectors/two_phase_committer.rs Outdated Show resolved Hide resolved
arroyo-connectors/src/kafka.rs Show resolved Hide resolved
arroyo-controller/src/job_controller/checkpointer.rs Outdated Show resolved Hide resolved
@mwylde mwylde mentioned this pull request Jul 19, 2023
@jacksonrnewhouse jacksonrnewhouse force-pushed the commits_and_parquet_file_sink branch 4 times, most recently from 85184df to 8c147ce Compare July 19, 2023 21:57
@mwylde
Copy link
Member

mwylde commented Jul 19, 2023

_

arroyo-connectors/resources/parquet.svg Outdated Show resolved Hide resolved
arroyo-connectors/src/filesystem.rs Outdated Show resolved Hide resolved
arroyo-connectors/src/filesystem.rs Outdated Show resolved Hide resolved
arroyo-connectors/src/filesystem.rs Outdated Show resolved Hide resolved
@jacksonrnewhouse jacksonrnewhouse force-pushed the commits_and_parquet_file_sink branch from 57a5eef to eefcbe4 Compare July 20, 2023 22:05
@@ -283,6 +283,26 @@ export function FormInner({
</Stack>
</fieldset>
);
} else if (values[key] > 0) {
Copy link
Member

@mwylde mwylde Jul 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be (values[key].properties?.length || 0) > 0 -- values[key] is an object and can't be (meaningfully) compared to a number

@jacksonrnewhouse jacksonrnewhouse enabled auto-merge (squash) July 21, 2023 00:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants