-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] Support out-of-band serialization. #22616
[Datasets] Support out-of-band serialization. #22616
Conversation
a76612e
to
8387d7d
Compare
40284c7
to
d801b00
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few things to simplify the logic here:
- Can we avoid the "shoe-horning" of read tasks into a block list? The plan object can have an explicit list of read tasks we extract from the LazyBlockList. This avoids needing to change the block list classes.
- Can we avoid having an "index" of completed vs not? It would be clearer to instead split the stages into "prev_stages" and "stages".
ping @clarkzinzow any update on this PR? |
@scv119 Actively working on it and the integration with Xiaowei, ran into some complications with the suggested refactor and I'm working on a solution that doesn't increase the scope of the PR. We should merge this by EOD Monday to make sure AIR is unblocked. |
2b1774f
to
dfa88f1
Compare
bad315d
to
6b9398f
Compare
813d7c5
to
7368d13
Compare
7368d13
to
2a68795
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you split this up into smaller PRs?
2a68795
to
2071303
Compare
2071303
to
7576de4
Compare
5d58b71
to
cbeea16
Compare
…#23821) This PR refactors `LazyBlockList` in service of out-of-band serialization (see [mono-PR](#22616)) and is a precursor to an execution plan refactor (PR #2) and adding the actual out-of-band serialization APIs (PR #3). The following is included in this refactor: 1. `ReadTask`s are now a first-class concept, replacing calls; 2. read stage progress tracking is consolidated into `LazyBlockList._get_blocks_with_metadta()` and more of the read task complexity, e.g. the read remote function, was pushed into `LazyBlockList` to make `ray.data.read_datasource()` simpler; 3. we are a bit smarter with how we progressively launch tasks and fetch and cache metadata, including fetching the metadata for read tasks in `.iter_blocks_with_metadata()` instead of relying on the pre-read task metadata (which will be less accurate), and we also fix some small bugs in the lazy ramp-up around progressive metadata fetching. (1) is the most important item for supporting out-of-band serialization and fundamentally changes the `LazyBlockList` data model. This is required since we need to be able to reference the underlying read tasks when rewriting read stages during optimization and when serializing the lineage of the Dataset. See the [mono-PR](#22616) for more context. Other changes: 1. Changed stats actor to a global named actor singleton in order to obviate the need for serializing the actor handle with the Dataset stats; without this, we were encountering serialization failures.
cbeea16
to
fe61ff6
Compare
Superseded by stacked PRs, supported added in #23932. Closing! |
This PR adds support for out-of-band serialization of datasets, which is required for tuning a training dataset hyperparameter with cross-cluster stopping and resuming of experiments.
In the process of adding this feature, a refactor of the execution plan and
LazyBlockList
seemed prudent to meet the following set of requirements:while adhering to the following constraints:
ray.put()
) read tasks into aBlockList
is untenable.Solution
In addition to adding out-of-band serialization support, this PR:
ReadTask
s a first-class concept inLazyBlockList
.LazyBlockList
ramp-up, including around progressive schema/metadata fetching.TODO
ExecutionPlan
andLazyBlockList
.Closes #22778
Checks
scripts/format.sh
to lint the changes in this PR.