-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Fix StructuredDataset empty-str file_format
in dc attr access
#3027
base: master
Are you sure you want to change the base?
[BUG] Fix StructuredDataset empty-str file_format
in dc attr access
#3027
Conversation
Signed-off-by: JiaWei Jiang <waynechuang97@gmail.com>
Code Review Agent Run Status
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #3027 +/- ##
==========================================
+ Coverage 79.61% 79.82% +0.21%
==========================================
Files 202 303 +101
Lines 21430 25811 +4381
Branches 2760 2760
==========================================
+ Hits 17061 20604 +3543
- Misses 3601 4408 +807
- Partials 768 799 +31 ☔ View full report in Codecov by Sentry. |
Signed-off-by: JiaWei Jiang <waynechuang97@gmail.com>
Code Review Agent Run Status
|
Signed-off-by: JiaWei Jiang <waynechuang97@gmail.com>
Code Review Agent Run Status
|
Signed-off-by: JiaWei Jiang <waynechuang97@gmail.com>
file_format
in dc attr accessfile_format
in dc attr access
Code Review Agent Run #2ae7a3Actionable Suggestions - 0Review Details
|
Changelist by BitoThis pull request implements the following key changes.
|
Signed-off-by: JiaWei Jiang <waynechuang97@gmail.com>
Code Review Agent Run #1863ffActionable Suggestions - 0Additional Suggestions - 10
Review Details
|
if file_format != GENERIC_FORMAT: | ||
sdt.format = file_format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the file format always be copied over?
if file_format != GENERIC_FORMAT: | |
sdt.format = file_format | |
sdt.format = file_format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @thomasjpfan,
Thanks for your suggestion! I think we can't always copy it over. I've come up with the following use case:
@task
def modify_format(sd: Annotated[StructuredDataset, {}, "task-format"]) -> StructuredDataset:
return sd
sd = StructuredDataset(uri="s3://my-s3-bucket/df.parquet", file_format="user-format")
sd2 = modify_format(sd=sd)
In this case, we expect sd2.file_format
to be task-format
(as shown in Annotated
), not user-format
. If we always use the user-specified file_format
, the information set in Annotated
will be missing.
Considering sdt.format
can be set here, would it be good to do provide a stricter condition as follows?
if sdt.format == GENERIC_FORMAT and file_format != GENERIC_FORMAT:
sdt.format = file_format
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Code Review Agent Run #2259beActionable Suggestions - 0Additional Suggestions - 10
Review Details
|
Tracking issue
Closes flyteorg/flyte#6096.
Why are the changes needed?
When users create a
StructuredDataset
with a specifiedfile_format
(e.g.,parquet
), thefile_format
information will be accidentally discarded in this case duringasync_to_literal
call. To be concrete,StructuredDatasetType
'sformat
attribute is set toGENERIC_FORMAT
, which is an empty string""
.What changes were proposed in this pull request?
Override
StructuredDatasetType
'sformat
attribute when users explicitly setfile_format
of python nativeStructuredDataset
.How was this patch tested?
This patch is tested through the newly added integration test and double checked by observing the flyte console I/O and the task pod stdout.
Setup process
For local run, the setup process is summarized as follows:
After installation, run the following command:
Screenshots
The following results are expected:
Flyte console input
Flyte console output
Task pod stdout
Check all the applicable boxes
Related PRs
Docs link
Summary by Bito
This PR implements comprehensive improvements to Flytekit, including: 1) Enhanced execution engine with improved error handling and worker queue implementation, 2) Fixed StructuredDataset bug regarding file_format preservation, 3) Introduced new Optuna plugin for hyperparameter optimization with async capabilities, 4) Added Environment class and improved thread safety mechanisms.Unit tests added: True
Estimated effort to review (1-5, lower is better): 5