Refactor BQ table listings to a side input #1841

jklukas · 2021-09-29T20:05:35Z

One known failure mode for ingestion-beam is rate limiting from the BQ API when we list datasets/tables in order to check whether destination tables exist. See https://mozilla-hub.atlassian.net/browse/DSRE-194 and mozilla/bigquery-backfill#15

Currently, every worker is independently making these API calls, triggering rate limiting when we scale up the number of machines for backfills. I think it should be possible to express this table listing as a slowly updating global window side input which would make it run on a single machine.

Currently, we look up the tables in a dataset only when we see a record with destination table in that dataset. For the side input case, we'd need to list all datasets, and then list all tables within each dataset, so we'd need to provide information about which project to list datasets from.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor BQ table listings to a side input #1841

Refactor BQ table listings to a side input #1841

jklukas commented Sep 29, 2021

Refactor BQ table listings to a side input #1841

Refactor BQ table listings to a side input #1841

Comments

jklukas commented Sep 29, 2021