-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[data][api] implement HudiDataSource
#46273
Conversation
7bc3894
to
97f9de1
Compare
97f9de1
to
d4e8af6
Compare
e2f6704
to
557b887
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a first pass and left some mostly nit comments. Overall looks good, let us know / re-request a review when it is ready for a re-review!
return read_tasks | ||
|
||
def estimate_inmemory_data_size(self) -> Optional[int]: | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any estimate that can be provided / returned here? Perhaps using the size_bytes
from above? Maybe could cache that similar to what is done here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree to provide estimates here. However due to current impl, loading this info during init using HudiTable is not a lightweight operation, plus the size bytes are storage size without some translation to in-memory size. i've added a todo here to support this info through HudiTable API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds reasonable!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xushiyan we can use Parquet metadata fetcher here (fine if we do as follow-up)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexeykudinkin yes, will do follow up to enhance this part.
557b887
to
177caab
Compare
fe1c93a
to
b76c200
Compare
0d5fb56
to
f63fd83
Compare
c3dcb70
to
fa3cc78
Compare
1a62d74
to
cfb8406
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xushiyan LGTM, mostly minor comments
return read_tasks | ||
|
||
def estimate_inmemory_data_size(self) -> Optional[int]: | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xushiyan we can use Parquet metadata fetcher here (fine if we do as follow-up)
83d856b
to
9ffd524
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xushiyan LGTM
@MicroCheck //python:ray/data/tests/test_hudi Signed-off-by: Shiyan Xu <2701446+xushiyan@users.noreply.github.com>
Signed-off-by: Shiyan Xu <2701446+xushiyan@users.noreply.github.com>
Signed-off-by: Shiyan Xu <2701446+xushiyan@users.noreply.github.com>
Signed-off-by: Shiyan Xu <2701446+xushiyan@users.noreply.github.com>
Signed-off-by: Shiyan Xu <2701446+xushiyan@users.noreply.github.com>
Signed-off-by: Shiyan Xu <2701446+xushiyan@users.noreply.github.com>
9ffd524
to
d723528
Compare
Support read from Hudi table into Ray dataset. --------- Signed-off-by: Shiyan Xu <2701446+xushiyan@users.noreply.github.com> Signed-off-by: hjiang <dentinyhao@gmail.com>
Why are these changes needed?
Support read from Hudi table into Ray dataset.
Related issue number
Closes #46272
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.