Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Refresh side input from BigQuery #26196

Closed
3 of 15 tasks
andreigurau opened this issue Apr 10, 2023 · 5 comments
Closed
3 of 15 tasks

[Feature Request]: Refresh side input from BigQuery #26196

andreigurau opened this issue Apr 10, 2023 · 5 comments

Comments

@andreigurau
Copy link
Contributor

What would you like to happen?

When you join a PubSub stream with BigQuery using side input, the side input data is loaded once and stays for the lifetime of the dataflow job, however there is no way to refresh the cache. Add the ability to be able to refresh the cache

Issue Priority

Priority: 3 (nice-to-have improvement)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@lostluck
Copy link
Contributor

A solution for this is coming in 2.47.0 (the in progress release).

The solution for this is to use the Slowly Updating Side Inputs pattern , using PeriodicImpulse.

After the release, the examples in the linked doc will be updated accordingly.

@BostjanBozic
Copy link

BostjanBozic commented Mar 19, 2024

I wonder if you got this working with ReadFromBigQuery? Since I am getting an error that impulse transform must be PBegin instead of PCollection. And to me it seems that if I move ReadFromBigQuery under Map (e.g. beam.Map(lambda x: ReadFromBigQuery(...)), no data is actually fetched.

As example code:

pipeline = beam.Pipeline()

pipeline
| "Periodic BigQuery Read" >> PeriodicImpulse(fire_interval=600)
| "Reading BigQuery Data"
>> ReadFromBigQuery(
    query="SELECT col1, col2, col3 FROM test_dataset.test_table;",
    use_standard_sql=True,
    flatten_results=False,
    temp_dataset=WriteToBigQueryWithDeadLetter.temp_dataset,
)

Following issue occurs:
TypeError: Input to Impulse transform must be a PBegin but found PCollection[Periodic BigQuery Read/MapToTimestamped.None]

@liferoad
Copy link
Collaborator

| 'PeriodicImpulse' >> PeriodicImpulse(
should work. Have you tried this?

@BostjanBozic
Copy link

@liferoad thanks for the help here. I was first trying to fetch info using just plain ReadFromBigQuery, but it seems that ReadFromBigQueryRequest together with ReadAllFromBigQuery is required. This seems to be fetching data (does not work locally though, only when running on Dataflow, which is also stated in code), but I guess the output is a bit different then if we would just use ReadFromBigQuery, so I will have to troubleshoot this part. But at least read works for now :)

lookup_table = (
    pipeline
    | "Retrigger BigQuery Data Read" >> PeriodicImpulse(fire_interval=600)
    | "Prepare BigQuery Data Read"
    >> beam.Map(
        lambda x: ReadFromBigQueryRequest(
            query="SELECT col1, col2, col3 FROM test.test_table;",
            use_standard_sql=True,
            flatten_results=False,
        )
    )
    | "Reading BigQuery Data"
    >> ReadAllFromBigQuery(
        temp_dataset="temp"
    )
    | "Reshuffling BigQuery Data" >> beam.Reshuffle()
    | "Keying BigQuery Data" >> beam.Map(lambda row: (row.get("col1"), row))
)

@liferoad
Copy link
Collaborator

This is supported by #13170 with ReadAllFromBigQuery.

@github-actions github-actions bot added this to the 2.56.0 Release milestone Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants