-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Add timestamp precision option to bigquery storage read for TIMESTAMP(12) columns. #37079
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @claudevdm, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the BigQueryIO connector by introducing explicit control over timestamp precision when reading data using the BigQuery Storage Read API. This allows users to specify whether TIMESTAMP(12) columns should be read with nanosecond or picosecond precision, ensuring data fidelity for high-precision timestamp values. The changes involve adding a new configuration option and integrating it into the BigQuery Storage API request generation, along with thorough testing to cover various data formats and precision levels. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new option to specify timestamp precision when reading from BigQuery using the Storage Read API. The changes are well-implemented across the relevant classes, and the addition of comprehensive tests is excellent. I have a couple of suggestions: one to correct the default value mentioned in a Javadoc, and another to refactor a new method to improve its structure and reduce code duplication.
...a/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java
Outdated
Show resolved
Hide resolved
...ud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryStorageSourceBase.java
Outdated
Show resolved
Hide resolved
|
R: @Abacn |
|
R: @ahmedabu98 |
|
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment |
1 similar comment
|
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment |
...a/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java
Outdated
Show resolved
Hide resolved
|
/gemini review |
damccorm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM, just had a naming question
...d-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryStorageQuerySource.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a valuable feature for controlling timestamp precision during BigQuery storage reads for TIMESTAMP(12) columns. The implementation is well-structured, propagating the new option from the user-facing API down to the storage read session creation. The accompanying tests are comprehensive and cover a wide range of scenarios. I've identified one critical issue regarding serialization that could break portability, along with a few medium-severity suggestions to improve code robustness and maintainability. Overall, this is a solid contribution.
Add read timestamp precision setting for storage api reads.
The storage API allows reading TIMESTAMP(12) columns with MICRO (default), NANOS or PICOS precision for both AVRO and ARROW formats.
This propagates the read precision setting to the storage API, and adds relevant tests.
Known Issue:
Arrow readTableRows and readTableRowsWithSchema converts arrow records to beam rows via ArrowConversion.java.
ArrowConversion is a generic utility for arrow -> beam schema, it does not take into account the bigquery schema.
Even before this PR, arrow format with readTableRows truncates timestamps to millisecond precision because millis and micro timestamps were historically mapped to FieldType.DATETIME
beam/sdks/java/extensions/arrow/src/main/java/org/apache/beam/sdk/extensions/arrow/ArrowConversion.java
Line 210 in 15b50e2
For avro readTableRowsWithSchema this is not an issue because we can map timestamp-micros to timestamp logical type if the bigquery schema is TIMESTAMP(12) with read precision micros
beam/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryUtils.java
Line 415 in 15b50e2
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.