[Data] Revisiting `make_async_gen` to address issues with concurrency control for sequences of varying lengths #51661

alexeykudinkin · 2025-03-25T03:53:57Z

Why are these changes needed?

This change addresses potential deadlocks inside make_async_gen when used in with functions producing sequences of wildly varying in lengths.

Fundamentally make_async_gen was trying to solve 2 problems respective solutions for which never actually overlapped:

Implement parallel processing based on transforming an input iterator into an output one, while preserving back-pressure semantic, where input iterator should not be outpacing output iterator being consumed.
Implement parallel processing allowing ordering of the input iterator being preserved.

These requirements coupled with the fact the transformation is expected to received and produce iterators are what led to erroneous deduction that it could be implemented:

Transforming iterators is very different from bijective mapping: we actually don't know how many input elements will result into a single output element (ie transformation is a black box that could be anything from 1-to-1 to many-to-many)
Preserving ordering of the transformation of iterators requires N input and output queues (1 per worker) as well as bot h producer and consumer fill/draw these queues in the same consistent order (without skipping!)
Because there could be no skipping (to preserve the order) there could be a case where some input AND output queues get full at the same time getting both producer and consumer stuck and not able to make progress

To resolve that problem fundamentally we decoupling this 2 use-cases into

Preserving order: has N input and output queues, with the input queues being uncapped (while output queues still being capped at queue_buffer_size), meaning that incoming iterator will be unrolled eagerly by the producer (till exhaustion)
Not preserving order: has 1 input queue and N output queues, with both input and output queues being capped in size based queue_buffer_size configuration. This allows to implement back-pressure semantic where consumption speed will limit production speed (and amount of buffered data)

Changes

Added stress-test successfully repro-ing deadlocks on the current impl
Added preserve_ordering param
Adjusted semantic to handle preserve_ordering=True/False scenarios separately
Beefed up existing tests
Tidying up

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin · 2025-03-25T03:56:58Z

python/ray/data/_internal/util.py

@@ -915,49 +918,14 @@ def make_async_gen(
    base_iterator: Iterator[T],
    fn: Callable[[Iterator[T]], Iterator[U]],
    num_workers: int = 1,
-    queue_buffer_size: int = 2,
+    queue_buffer_size: Optional[int] = None,


@raulchen i'd recommend reviewing in isolation as this was written from scratch

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Fixed fixture Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

python/ray/data/_internal/util.py

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

raulchen

LGTM

raulchen · 2025-03-26T20:45:53Z

python/ray/data/tests/block_batching/test_util.py

+        iter(range(3)),
+        _transform_b,
+    ):
+        pass


nit, maybe use multiple workers and test that transform fn is entered at most once per worker.
and for simplicity in the multi-threading case, we can just use a counter instead of capturing the logs.

raulchen · 2025-03-26T20:50:55Z

Can you update the PR description as the fix has changed?

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

… on whether ordering has to be preserved; Updated docs Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

… queue instead of N Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Tidying up Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin added 4 commits March 24, 2025 20:18

Added test_make_async_gen_deadlock

906c17d

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Cleaned up test_make_async_gen

034f5a1

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Updated fixtures

e1c138f

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Revisiting make_async_gen to address issues with concurrency control

85b2093

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin requested a review from a team as a code owner March 25, 2025 03:53

Cleaned up

19aed03

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin commented Mar 25, 2025

View reviewed changes

alexeykudinkin added 3 commits March 24, 2025 22:07

Tidying up

e9c1d73

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Tidying up more

e82e1ce

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

lint

ce1bc2f

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin assigned raulchen Mar 25, 2025

alexeykudinkin changed the title ~~[WIP][Data] Revisiting make_async_gen to address issues with concurrency control for sequences of varying lengths~~ [Data] Revisiting make_async_gen to address issues with concurrency control for sequences of varying lengths Mar 25, 2025

alexeykudinkin added the go add ONLY when ready to merge, run all tests label Mar 25, 2025

raulchen approved these changes Mar 25, 2025

View reviewed changes

alexeykudinkin added 10 commits March 25, 2025 16:54

Added missing exception handling in the submitting thread

2ed9f1c

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Propagate exceptions from submitting thread to the consuming one

0832a4d

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Pass failed future instead of passing exceptions direclty

bff21e6

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Added test asserting make_async_gen is not reentrant

d24c2a1

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Updated test;

56c2be2

Fixed fixture Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Revisited make_async_gen to make it non-reentrant

9034926

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Reverting to master

6ad7933

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Fixed make_async_gen to utilize single input queue to avoid deadlocks

761c778

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Evolved test_make_async_gen_varying_seq_lengths

4ea49ce

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

lint

e4339fa

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin commented Mar 26, 2025

View reviewed changes

python/ray/data/_internal/util.py Outdated Show resolved Hide resolved

Added comments

f443bf1

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

raulchen approved these changes Mar 26, 2025

View reviewed changes

alexeykudinkin added 2 commits March 26, 2025 14:40

Updating comments

08d1982

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Reverting back (to multiple queues)

2c4f223

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin added 11 commits March 26, 2025 17:30

Tidying up tests

23772b4

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Revisited make_async_gen to explicitly condition its semantic based…

87e9fa1

… on whether ordering has to be preserved; Updated docs Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Updated tests to validate both preserving/non-preserving cases

7d194b3

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Fixed handling of empty queues

5779a84

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Revisited unordered iteration architecture to simply rely on a single…

aadab0b

… queue instead of N Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Added assertions

7c03ede

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

lint

0a1532d

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Fixed all usages;

e3f0a45

Tidying up Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

lint

645f6c6

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Missing req param

819668d

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Missing import

e891b27

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

raulchen merged commit b7dae2a into master Mar 27, 2025
5 checks passed

raulchen deleted the ak/asnc-gen-ddlk-fix branch March 27, 2025 18:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Revisiting `make_async_gen` to address issues with concurrency control for sequences of varying lengths #51661

[Data] Revisiting `make_async_gen` to address issues with concurrency control for sequences of varying lengths #51661

alexeykudinkin commented Mar 25, 2025 •

edited

Loading

alexeykudinkin Mar 25, 2025

raulchen left a comment

raulchen Mar 26, 2025

raulchen commented Mar 26, 2025

[Data] Revisiting make_async_gen to address issues with concurrency control for sequences of varying lengths #51661

[Data] Revisiting make_async_gen to address issues with concurrency control for sequences of varying lengths #51661

Conversation

alexeykudinkin commented Mar 25, 2025 • edited Loading

Why are these changes needed?

Changes

Related issue number

Checks

alexeykudinkin Mar 25, 2025

Choose a reason for hiding this comment

raulchen left a comment

Choose a reason for hiding this comment

raulchen Mar 26, 2025

Choose a reason for hiding this comment

raulchen commented Mar 26, 2025

[Data] Revisiting `make_async_gen` to address issues with concurrency control for sequences of varying lengths #51661

[Data] Revisiting `make_async_gen` to address issues with concurrency control for sequences of varying lengths #51661

alexeykudinkin commented Mar 25, 2025 •

edited

Loading