-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Try get size_bytes
from metadata and consolidate metadata methods
#46862
Conversation
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
size_bytes
from metadata and consolidate metadata methods
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
if all(bundle.num_rows() is not None for bundle in self.input_data): | ||
return sum(bundle.num_rows() for bundle in self.input_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: not a big deal especially since it elongates the code. technically this is a double for loop, which could be optimized into one loop and early exit if we run into a Non
num_rows. something like:
sum_num_rows = 0
for bundle in self.input_data:
if not bundle.num_rows():
return None
sum_num_rows += bundle.num_rows()
return sum_num_rows
and the same could be applied in other similar places, e.g. _size_bytes
method and methods in AbstractFrom
. it may be worthwhile to put this logic in a shared utility function, for example as a static method in BlockMetadata
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, although I prefer keeping as-is since we don't know if this is a performance issue and I want to avoid premature optimization
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Why are these changes needed?
LogicalOperator
andDatasource
expose methods likeschema()
to make metadata efficiently available to Dataset APIs likeDataset.schema()
. Currently,LogicalOperator
exposes three such methds:num_rows()
,schema()
, andinput_files()
.This PR adds
size_bytes()
because it was missing. To simplify the interface, it also consolidates the metadata methods into a singleaggregate_output_metadata()
method.Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.