Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seamlessly propagate schema changes made after pipelines starts running for Spanner Change Streams to BigQuery template #1730

Merged
merged 5 commits into from
Sep 4, 2024

Conversation

ShuranZhang
Copy link
Contributor

@ShuranZhang ShuranZhang commented Jul 12, 2024

Original feature description from @ChangyuLi28 :
This pr supports schema updates after pipeline starts running.

Before you add a new column to an existing tracked Cloud Spanner table, first add the column to the BigQuery changelog table. The new column must be NULLABLE. Then add the column to the Cloud Spanner table. The new column is automatically populated when the pipeline receives a new record with that column. It's recommended to wait for >10 minutes after making schema updates in BigQuery to start the load including the new column because of this issue.
To add a new table, add the table in the Cloud Spanner database first. The table is automatically created when the pipeline receives a record for the new table.
The template doesn't drop tables or columns from BigQuery. If a column is dropped from the Cloud Spanner table, then null values are populated to these changelog columns for records generated after the columns are dropped from the Cloud Spanner table, unless you manually drop the column from BigQuery.
The template doesn't support column type updates or column nullable mode updates.
Follow this guide to deploy pipelines with this template. More specific instructions.

Add IT tests for handling schema update during a spanner cdc to bigquery pipeline running.
Previously a base case test that doesn't contain any schema change was added in #1705 . This pull request adds two more IT tests following above workflow:

  • Add a column(disabled): add a column in spanner source table -> add a column in corresponding bq table ->wait for 15mins(this is due to limitation of BigQuery and BQ team recommends wait >10 mins, see more here)->insert a row to source table with new column filled->verify bq table's new column is successfully populated. Due to the above bq limitation, this test requires sleeping at least 10 mins in test body to ensure no flakiness. Thus I disabled this test to avoid just hang the thread and do hardcoded sleep for such a long time.
  • Add a table: add a source table in spanner -> insert a new row to new source table -> verify that the new changelog table in created and filled with correct information in bq. Different from updating an existing bq table's schema, add a completely new table does not suffer from the high latency.

I have run all IT tests together in parallel for >15 times locally and no flakiness observed. We can add more test cases later if tests in this pr are running stable.

I also added a 3 minutes dlq retry parameters to all IT tests in case there are any availability/transient errors happening on the spanner side.

@ShuranZhang ShuranZhang force-pushed the SchemaUpdateIT branch 3 times, most recently from 2a1f6c0 to 30bc294 Compare July 16, 2024 16:08
@ShuranZhang ShuranZhang marked this pull request as ready for review July 16, 2024 16:53
@ShuranZhang ShuranZhang force-pushed the SchemaUpdateIT branch 2 times, most recently from 381b528 to 776929d Compare July 30, 2024 19:14
@ShuranZhang ShuranZhang force-pushed the SchemaUpdateIT branch 2 times, most recently from 82e9deb to 1612c84 Compare August 6, 2024 17:55
Copy link

codecov bot commented Sep 4, 2024

Codecov Report

Attention: Patch coverage is 51.20275% with 142 lines in your changes missing coverage. Please review.

Project coverage is 42.93%. Comparing base (6b094c2) to head (8209708).
Report is 16 commits behind head on main.

Files with missing lines Patch % Lines
...tobigquery/schemautils/SpannerToBigQueryUtils.java 53.52% 24 Missing and 9 partials ⚠️
...eamstobigquery/SpannerChangeStreamsToBigQuery.java 0.00% 28 Missing ⚠️
...reamstobigquery/schemautils/SchemaUpdateUtils.java 60.00% 16 Missing and 6 partials ⚠️
...hangestreamstobigquery/schemautils/TypesUtils.java 58.33% 17 Missing and 3 partials ⚠️
...erchangestreamstobigquery/model/ModColumnType.java 33.33% 18 Missing ⚠️
...bigquery/FailsafeModJsonToTableRowTransformer.java 0.00% 12 Missing ⚠️
...streamstobigquery/BigQueryDynamicDestinations.java 0.00% 4 Missing ⚠️
...ates/spannerchangestreamstobigquery/model/Mod.java 62.50% 2 Missing and 1 partial ⚠️
...gestreamstobigquery/model/TrackedSpannerTable.java 83.33% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1730      +/-   ##
============================================
+ Coverage     42.83%   42.93%   +0.10%     
- Complexity     3424     3455      +31     
============================================
  Files           824      827       +3     
  Lines         48147    48275     +128     
  Branches       5168     5196      +28     
============================================
+ Hits          20624    20728     +104     
- Misses        25847    25860      +13     
- Partials       1676     1687      +11     
Components Coverage Δ
spanner-templates 62.81% <ø> (ø)
spanner-import-export 63.90% <ø> (ø)
spanner-live-forward-migration 75.05% <ø> (ø)
spanner-live-reverse-replication 67.63% <ø> (ø)
spanner-bulk-migration 83.75% <ø> (ø)
Files with missing lines Coverage Δ
.../beam/it/gcp/bigquery/BigQueryResourceManager.java 72.85% <ø> (ø)
...ngestreamstobigquery/schemautils/OptionsUtils.java 61.90% <ø> (-1.74%) ⬇️
...igquery/schemautils/SpannerChangeStreamsUtils.java 80.08% <100.00%> (+9.18%) ⬆️
...gestreamstobigquery/model/TrackedSpannerTable.java 61.53% <83.33%> (+13.39%) ⬆️
...ates/spannerchangestreamstobigquery/model/Mod.java 48.07% <62.50%> (+48.07%) ⬆️
...streamstobigquery/BigQueryDynamicDestinations.java 0.00% <0.00%> (ø)
...bigquery/FailsafeModJsonToTableRowTransformer.java 0.00% <0.00%> (ø)
...erchangestreamstobigquery/model/ModColumnType.java 33.33% <33.33%> (ø)
...hangestreamstobigquery/schemautils/TypesUtils.java 58.33% <58.33%> (ø)
...reamstobigquery/schemautils/SchemaUpdateUtils.java 60.00% <60.00%> (ø)
... and 2 more

@Abacn
Copy link
Contributor

Abacn commented Sep 4, 2024

DLPTextToBigQueryStreamingIT.testDLPTextToBigQuery failed due to #1836

@Abacn Abacn merged commit cfb4be7 into GoogleCloudPlatform:main Sep 4, 2024
11 of 13 checks passed
@ShuranZhang ShuranZhang changed the title Add IT tests for handling schema update during spanner cdc to big query pipeline running Seamlessly propagate schema changes made after pipelines starts running for Spanner Change Streams to BigQuery template Sep 11, 2024
asthamohta pushed a commit to asthamohta/DataflowTemplates that referenced this pull request Sep 15, 2024
…ry pipeline running (GoogleCloudPlatform#1730)

* Seamlessly propagate schema changes made after pipelines starts running for Spanner Change Streams to BigQuery template.

* Resolve merge conflicts

* Fix missing columns in bq when initial record is UPDATE or DELETE

* Improve test coverage for TypesUtils

* Add IT tests for handling schema update during a running pipeline

---------

Co-authored-by: Changyu Li <sherryl1780@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants