Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Support out-of-band serialization. #22616

Conversation

clarkzinzow
Copy link
Contributor

@clarkzinzow clarkzinzow commented Feb 24, 2022

This PR adds support for out-of-band serialization of datasets, which is required for tuning a training dataset hyperparameter with cross-cluster stopping and resuming of experiments.

In the process of adding this feature, a refactor of the execution plan and LazyBlockList seemed prudent to meet the following set of requirements:

  1. [P0] The full stage lineage must be maintained - we must always be able to reconstruct a Datasets from a lazy datasource.
  2. [P0] In order to be able to eagerly reclaim memory, we can’t hold on to object references from previous stages.
  3. [P0] We need to preserve existing eager mode semantics, both for plain datasets and for each stage in a dataset pipeline.
  4. [P0] To keep from unnecessarily duplicating reads, we should reuse computed block partitions when possible.

while adhering to the following constraints:

  1. Refactoring the pipelining implementation is out-of-scope.
  2. Deviating from the core execution model paradigm of BlockList --> BlockList stages is out-of-scope.
  3. Shoe-horning raw (no ray.put()) read tasks into a BlockList is untenable.

Solution

In addition to adding out-of-band serialization support, this PR:

  • Adds tracking of the full lineage for all Datasets, storing any intermediate or end-results as a snapshot of some stage in the lineage.
  • Makes ReadTasks a first-class concept in LazyBlockList.
  • Modifies read stage fusion to utilize already-computed blocks, eliminating redundant reads.
  • Optimization passes no longer mutate the input blocks or stages, allowing us to leverage read stage fusion while maintaining the requisite lineage to reconstruct the dataset from the lazy datasource.
  • Adds more careful logic around LazyBlockList ramp-up, including around progressive schema/metadata fetching.
  • Eagerly unlinks input block references to allow that memory to be reclaimed after the first stage finishes execution.

TODO

  • Fix transient bug in stats.
  • Add more test coverage for ExecutionPlan and LazyBlockList.
  • Try reverting back to always shoe-horning read tasks/block partitions into a plain block list at execution time to consolidate fusion logic into the stages.
  • Validate/clean up implementation.
  • Split into multiple easier-to-review PRs, if needed.

Closes #22778

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@ericl ericl self-assigned this Feb 24, 2022
@clarkzinzow clarkzinzow force-pushed the datasets/feat/out-of-band-serialization branch 2 times, most recently from a76612e to 8387d7d Compare February 24, 2022 03:06
python/ray/data/dataset.py Outdated Show resolved Hide resolved
python/ray/data/dataset.py Outdated Show resolved Hide resolved
python/ray/data/dataset.py Outdated Show resolved Hide resolved
python/ray/data/impl/block_list.py Outdated Show resolved Hide resolved
python/ray/data/impl/plan.py Outdated Show resolved Hide resolved
python/ray/data/impl/plan.py Outdated Show resolved Hide resolved
python/ray/data/impl/plan.py Outdated Show resolved Hide resolved
python/ray/data/impl/stats.py Outdated Show resolved Hide resolved
@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 24, 2022
@clarkzinzow clarkzinzow force-pushed the datasets/feat/out-of-band-serialization branch from 40284c7 to d801b00 Compare February 28, 2022 21:50
@clarkzinzow clarkzinzow requested review from ericl and jjyao February 28, 2022 22:12
@clarkzinzow clarkzinzow changed the title [Datasets] [Prototype] Support out-of-band serialization. [Datasets] Support out-of-band serialization. Feb 28, 2022
Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things to simplify the logic here:

  1. Can we avoid the "shoe-horning" of read tasks into a block list? The plan object can have an explicit list of read tasks we extract from the LazyBlockList. This avoids needing to change the block list classes.
  2. Can we avoid having an "index" of completed vs not? It would be clearer to instead split the stages into "prev_stages" and "stages".

@scv119
Copy link
Contributor

scv119 commented Mar 14, 2022

ping @clarkzinzow any update on this PR?

@clarkzinzow
Copy link
Contributor Author

@scv119 Actively working on it and the integration with Xiaowei, ran into some complications with the suggested refactor and I'm working on a solution that doesn't increase the scope of the PR.

We should merge this by EOD Monday to make sure AIR is unblocked.

@ericl ericl mentioned this pull request Mar 15, 2022
2 tasks
@clarkzinzow clarkzinzow force-pushed the datasets/feat/out-of-band-serialization branch 2 times, most recently from 2b1774f to dfa88f1 Compare March 17, 2022 09:14
@clarkzinzow clarkzinzow added the do-not-merge Do not merge this PR! label Mar 17, 2022
@clarkzinzow clarkzinzow force-pushed the datasets/feat/out-of-band-serialization branch 4 times, most recently from bad315d to 6b9398f Compare March 17, 2022 21:22
@Yard1 Yard1 requested review from Yard1 and removed request for Yard1 March 18, 2022 14:39
@clarkzinzow clarkzinzow force-pushed the datasets/feat/out-of-band-serialization branch 2 times, most recently from 813d7c5 to 7368d13 Compare March 18, 2022 21:38
@clarkzinzow clarkzinzow force-pushed the datasets/feat/out-of-band-serialization branch from 7368d13 to 2a68795 Compare March 18, 2022 22:09
@clarkzinzow clarkzinzow requested a review from ericl March 19, 2022 00:33
Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you split this up into smaller PRs?

@clarkzinzow clarkzinzow force-pushed the datasets/feat/out-of-band-serialization branch from 2a68795 to 2071303 Compare March 20, 2022 10:37
@clarkzinzow clarkzinzow force-pushed the datasets/feat/out-of-band-serialization branch from 2071303 to 7576de4 Compare March 31, 2022 18:56
@clarkzinzow clarkzinzow removed the do-not-merge Do not merge this PR! label Mar 31, 2022
@clarkzinzow clarkzinzow force-pushed the datasets/feat/out-of-band-serialization branch 3 times, most recently from 5d58b71 to cbeea16 Compare April 1, 2022 20:21
ericl pushed a commit that referenced this pull request Apr 14, 2022
…#23821)

This PR refactors `LazyBlockList` in service of out-of-band serialization (see [mono-PR](#22616)) and is a precursor to an execution plan refactor (PR #2) and adding the actual out-of-band serialization APIs (PR #3). The following is included in this refactor:
1. `ReadTask`s are now a first-class concept, replacing calls;
2. read stage progress tracking is consolidated into `LazyBlockList._get_blocks_with_metadta()` and more of the read task complexity, e.g. the read remote function, was pushed into `LazyBlockList` to make `ray.data.read_datasource()` simpler;
3. we are a bit smarter with how we progressively launch tasks and fetch and cache metadata, including fetching the metadata for read tasks in `.iter_blocks_with_metadata()` instead of relying on the pre-read task metadata (which will be less accurate), and we also fix some small bugs in the lazy ramp-up around progressive metadata fetching.

(1) is the most important item for supporting out-of-band serialization and fundamentally changes the `LazyBlockList` data model. This is required since we need to be able to reference the underlying read tasks when rewriting read stages during optimization and when serializing the lineage of the Dataset. See the [mono-PR](#22616) for more context.

Other changes:
1. Changed stats actor to a global named actor singleton in order to obviate the need for serializing the actor handle with the Dataset stats; without this, we were encountering serialization failures.
@clarkzinzow
Copy link
Contributor Author

Superseded by stacked PRs, supported added in #23932. Closing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Dataset] Create the stats actor as a global named singleton instead of passing by handle
5 participants