Seamlessly propagate schema changes made after pipelines starts running for Spanner Change Streams to BigQuery template #1730

ShuranZhang · 2024-07-12T21:39:58Z

Original feature description from @ChangyuLi28 :
This pr supports schema updates after pipeline starts running.

Before you add a new column to an existing tracked Cloud Spanner table, first add the column to the BigQuery changelog table. The new column must be NULLABLE. Then add the column to the Cloud Spanner table. The new column is automatically populated when the pipeline receives a new record with that column. It's recommended to wait for >10 minutes after making schema updates in BigQuery to start the load including the new column because of this issue.
To add a new table, add the table in the Cloud Spanner database first. The table is automatically created when the pipeline receives a record for the new table.
The template doesn't drop tables or columns from BigQuery. If a column is dropped from the Cloud Spanner table, then null values are populated to these changelog columns for records generated after the columns are dropped from the Cloud Spanner table, unless you manually drop the column from BigQuery.
The template doesn't support column type updates or column nullable mode updates.
Follow this guide to deploy pipelines with this template. More specific instructions.

Add IT tests for handling schema update during a spanner cdc to bigquery pipeline running.
Previously a base case test that doesn't contain any schema change was added in #1705 . This pull request adds two more IT tests following above workflow:

Add a column(disabled): add a column in spanner source table -> add a column in corresponding bq table ->wait for 15mins(this is due to limitation of BigQuery and BQ team recommends wait >10 mins, see more here)->insert a row to source table with new column filled->verify bq table's new column is successfully populated. Due to the above bq limitation, this test requires sleeping at least 10 mins in test body to ensure no flakiness. Thus I disabled this test to avoid just hang the thread and do hardcoded sleep for such a long time.
Add a table: add a source table in spanner -> insert a new row to new source table -> verify that the new changelog table in created and filled with correct information in bq. Different from updating an existing bq table's schema, add a completely new table does not suffer from the high latency.

I have run all IT tests together in parallel for >15 times locally and no flakiness observed. We can add more test cases later if tests in this pr are running stable.

I also added a 3 minutes dlq retry parameters to all IT tests in case there are any availability/transient errors happening on the spanner side.

…ng for Spanner Change Streams to BigQuery template.

codecov · 2024-09-04T01:05:25Z

Codecov Report

Attention: Patch coverage is 51.20275% with 142 lines in your changes missing coverage. Please review.

Project coverage is 42.93%. Comparing base (6b094c2) to head (8209708).
Report is 16 commits behind head on main.

Files with missing lines	Patch %	Lines
...tobigquery/schemautils/SpannerToBigQueryUtils.java	53.52%	24 Missing and 9 partials ⚠️
...eamstobigquery/SpannerChangeStreamsToBigQuery.java	0.00%	28 Missing ⚠️
...reamstobigquery/schemautils/SchemaUpdateUtils.java	60.00%	16 Missing and 6 partials ⚠️
...hangestreamstobigquery/schemautils/TypesUtils.java	58.33%	17 Missing and 3 partials ⚠️
...erchangestreamstobigquery/model/ModColumnType.java	33.33%	18 Missing ⚠️
...bigquery/FailsafeModJsonToTableRowTransformer.java	0.00%	12 Missing ⚠️
...streamstobigquery/BigQueryDynamicDestinations.java	0.00%	4 Missing ⚠️
...ates/spannerchangestreamstobigquery/model/Mod.java	62.50%	2 Missing and 1 partial ⚠️
...gestreamstobigquery/model/TrackedSpannerTable.java	83.33%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #1730      +/-   ##
============================================
+ Coverage     42.83%   42.93%   +0.10%     
- Complexity     3424     3455      +31     
============================================
  Files           824      827       +3     
  Lines         48147    48275     +128     
  Branches       5168     5196      +28     
============================================
+ Hits          20624    20728     +104     
- Misses        25847    25860      +13     
- Partials       1676     1687      +11

Components	Coverage Δ
spanner-templates	`62.81% <ø> (ø)`
spanner-import-export	`63.90% <ø> (ø)`
spanner-live-forward-migration	`75.05% <ø> (ø)`
spanner-live-reverse-replication	`67.63% <ø> (ø)`
spanner-bulk-migration	`83.75% <ø> (ø)`

Files with missing lines	Coverage Δ
.../beam/it/gcp/bigquery/BigQueryResourceManager.java	`72.85% <ø> (ø)`
...ngestreamstobigquery/schemautils/OptionsUtils.java	`61.90% <ø> (-1.74%)`	⬇️
...igquery/schemautils/SpannerChangeStreamsUtils.java	`80.08% <100.00%> (+9.18%)`	⬆️
...gestreamstobigquery/model/TrackedSpannerTable.java	`61.53% <83.33%> (+13.39%)`	⬆️
...ates/spannerchangestreamstobigquery/model/Mod.java	`48.07% <62.50%> (+48.07%)`	⬆️
...streamstobigquery/BigQueryDynamicDestinations.java	`0.00% <0.00%> (ø)`
...bigquery/FailsafeModJsonToTableRowTransformer.java	`0.00% <0.00%> (ø)`
...erchangestreamstobigquery/model/ModColumnType.java	`33.33% <33.33%> (ø)`
...hangestreamstobigquery/schemautils/TypesUtils.java	`58.33% <58.33%> (ø)`
...reamstobigquery/schemautils/SchemaUpdateUtils.java	`60.00% <60.00%> (ø)`
... and 2 more

Abacn · 2024-09-04T18:44:29Z

DLPTextToBigQueryStreamingIT.testDLPTextToBigQuery failed due to #1836

…ry pipeline running (GoogleCloudPlatform#1730) * Seamlessly propagate schema changes made after pipelines starts running for Spanner Change Streams to BigQuery template. * Resolve merge conflicts * Fix missing columns in bq when initial record is UPDATE or DELETE * Improve test coverage for TypesUtils * Add IT tests for handling schema update during a running pipeline --------- Co-authored-by: Changyu Li <sherryl1780@gmail.com>

pull-request-size bot added the size/XXL label Jul 12, 2024

ShuranZhang force-pushed the SchemaUpdateIT branch 3 times, most recently from 2a1f6c0 to 30bc294 Compare July 16, 2024 16:08

ShuranZhang marked this pull request as ready for review July 16, 2024 16:53

ShuranZhang force-pushed the SchemaUpdateIT branch from 30bc294 to c83d18a Compare July 16, 2024 20:04

ShuranZhang force-pushed the SchemaUpdateIT branch 2 times, most recently from 381b528 to 776929d Compare July 30, 2024 19:14

nancyxu123 approved these changes Aug 2, 2024

View reviewed changes

ShuranZhang force-pushed the SchemaUpdateIT branch 2 times, most recently from 82e9deb to 1612c84 Compare August 6, 2024 17:55

ChangyuLi28 and others added 5 commits August 26, 2024 22:54

Seamlessly propagate schema changes made after pipelines starts runni…

f2d6c5c

…ng for Spanner Change Streams to BigQuery template.

Resolve merge conflicts

ca2d649

Fix missing columns in bq when initial record is UPDATE or DELETE

9c53b3e

Improve test coverage for TypesUtils

1e87547

Add IT tests for handling schema update during a running pipeline

8209708

ShuranZhang force-pushed the SchemaUpdateIT branch from 1612c84 to 8209708 Compare August 27, 2024 01:18

Abacn approved these changes Sep 4, 2024

View reviewed changes

Abacn merged commit cfb4be7 into GoogleCloudPlatform:main Sep 4, 2024
11 of 13 checks passed

ShuranZhang changed the title ~~Add IT tests for handling schema update during spanner cdc to big query pipeline running~~ Seamlessly propagate schema changes made after pipelines starts running for Spanner Change Streams to BigQuery template Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seamlessly propagate schema changes made after pipelines starts running for Spanner Change Streams to BigQuery template #1730

Seamlessly propagate schema changes made after pipelines starts running for Spanner Change Streams to BigQuery template #1730

ShuranZhang commented Jul 12, 2024 •

edited

Loading

codecov bot commented Sep 4, 2024 •

edited

Loading

Abacn commented Sep 4, 2024

Seamlessly propagate schema changes made after pipelines starts running for Spanner Change Streams to BigQuery template #1730

Seamlessly propagate schema changes made after pipelines starts running for Spanner Change Streams to BigQuery template #1730

Conversation

ShuranZhang commented Jul 12, 2024 • edited Loading

codecov bot commented Sep 4, 2024 • edited Loading

Codecov Report

Abacn commented Sep 4, 2024

ShuranZhang commented Jul 12, 2024 •

edited

Loading

codecov bot commented Sep 4, 2024 •

edited

Loading