[Data] Update Data progress bars to use row
as the iteration unit
#46579
Labels
data
Ray Data-related issues
enhancement
Request for new feature and/or capability
P1
Issue that should be fixed within a few weeks
Description
As a followup to #46432, we want to further improve clarity of the progress bar output. Currently, we use
bundle
(corresponding toRefBundle
) andblock
(corresponding toBlock
) as the iteration unit, which are both internal concepts that some Ray Data users may be unfamiliar with.With some more involved code changes, we can replace these with
row
, corresponding to rows in the output Dataset. This is the most atomic unit of the Dataset that all users should be well familiar with, since rows are a fundamental concept in almost all data processing libraries.Code pointers:
MapOperator._task_done_callback()
: Add logic to estimate number of output rows from operator, based on number of rows from completed tasks so far. Can follow existing logic to estimate number of output bundles.PhysicalOperator.estimated_output_num_rows
to get the number of estimated rows described above.op.estimated_output_num_rows
to update progress bars:Use case
Further improve clarity of Ray Data progress bar
The text was updated successfully, but these errors were encountered: