Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add Pandas DataFrame support to TabularDataset #1185
feat: add Pandas DataFrame support to TabularDataset #1185
Changes from 9 commits
f5785dd
ae01b66
e2de699
1983471
80b2ef4
19d8565
0f81b3d
c6b4ed9
833d9f5
87ac7f9
791ca84
a43b923
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the location requirements should also be documented or a reference to this documentation should be provided: https://cloud.google.com/vertex-ai/docs/general/locations#bq-locations
If possible they should be validated but not a hard requirement. Is it possible for the dataset create to fail because of the regional requirements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I just tested it and it can fail if the dataset location doesn't match the project location or the service doesn't have the right access to the dataset. I'll update the docstring to link to that page.
In terms of validating, the BQ client throws this error: google.api_core.exceptions.FailedPrecondition: 400 BigQuery Dataset location
eu
must be in the same location as the service locationus-central1
.Do you think we should validate as well or let the BQ client handle validation? If we do validation, we'd need to use the BQ client to check the location of the provided BQ dataset string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with relying on BQ client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this imply when the staging table does not exit, user need to pass in BQ schema? No existing staging table is the most common case, and this would be a bit hard to use - users need to learn how to create BQ schema, etc.
Given df_source already has a schema, can we auto generate the BQ schema for users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The best case is BQ client support inferencing schema natively, could you check on that? BQ UI supports inferencing schema for JSONL. Haven't tried Parquet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The BQ client will autodetect the schema using the DataFrame column types if the user doesn't provide this
bq_schema
parameter, this happens here. So I'm guessing most users won't use this schema param to override the autodetect.I think my docstring was confusing. I updated it, let me know if this is clearer:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this config infer all the types? I see the enable_list_inference but I couldn't find a reference in the BQ docs for non list type inference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this will infer the data types from the DF. From the BQ docs:
I added a
bq_schema
param to give the user the option to override the data type autodection, but I think I may need to add more client-side validation on that.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can rely on BQ client validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tabular_dataset should be appended to
shared_state['resources']
so it's deleted after the testsee example: https://github.com/googleapis/python-aiplatform/blob/main/tests/system/aiplatform/test_e2e_tabular.py#L89
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated both new tests to use
shared_state['resources']