Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement at-least-once option that utilizes default stream #1007

Conversation

agrawal-siddharth
Copy link
Collaborator

to support streaming scenarios in which small batches of data are written.

@davidrabinowitz
Copy link
Member

/gcbrun

Copy link
Member

@davidrabinowitz davidrabinowitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add an integration test for testing the writeAtLeastOnce configuration?

@agrawal-siddharth agrawal-siddharth marked this pull request as draft June 26, 2023 22:57
@agrawal-siddharth agrawal-siddharth force-pushed the use_default_stream branch 2 times, most recently from a5b0ad5 to 1c48241 Compare June 28, 2023 05:14
@agrawal-siddharth
Copy link
Collaborator Author

Added an integration test to exercise the following scenarios with writeAtLeastOnce turned on:

(a) append to a new file
(b) append to an existing file
(c) overwrite on a new file
(d) overwrite on an existing file

@@ -208,6 +208,23 @@ public boolean deleteTable(TableId tableId) {
return bigQuery.delete(tableId);
}

private Job buildQueryJob(
TableId temporaryTableId,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be sourceTableId

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

README.md Outdated Show resolved Hide resolved
@agrawal-siddharth agrawal-siddharth force-pushed the use_default_stream branch 2 times, most recently from 9a12708 to ddcb019 Compare July 8, 2023 01:40
@davidrabinowitz
Copy link
Member

/gcbrun

@davidrabinowitz
Copy link
Member

/gcbrun

1 similar comment
@davidrabinowitz
Copy link
Member

/gcbrun

@agrawal-siddharth agrawal-siddharth force-pushed the use_default_stream branch 2 times, most recently from 000429b to b7cb303 Compare July 13, 2023 01:57
@davidrabinowitz
Copy link
Member

/gcbrun

}

protected void writeToBigQuery(Dataset<Row> df, SaveMode mode, String format) {
protected void writeToBigQuery(Dataset<Row> df, SaveMode mode, String writeAtLeastOnce) {
writeToBigQuery(df, mode, "avro", writeAtLeastOnce);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please maintain the format and avoid overriding it - to maintain backward compatibility this should be

 protected void writeToBigQuery(Dataset<Row> df, SaveMode mode, String format) {
  writeToBigQuery(df, mode, format, "False");
}

@@ -162,17 +162,23 @@ private StandardTableDefinition testPartitionedTableDefinition() {
}

protected void writeToBigQuery(Dataset<Row> df, SaveMode mode) {
writeToBigQuery(df, mode, "avro");
writeToBigQuery(df, mode, "False");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace with writeToBigQuery(df, mode, "avro", "False");

to support streaming scenarios in which small batches of data
are written.
@davidrabinowitz
Copy link
Member

/gcbrun

@davidrabinowitz davidrabinowitz merged commit 38b0ef2 into GoogleCloudDataproc:master Jul 13, 2023
6 checks passed
@agrawal-siddharth agrawal-siddharth deleted the use_default_stream branch July 13, 2023 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants