Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data][train] Bug in SplitCoordinator: "assert self._output_iterator is not None" #45225

Closed
raulchen opened this issue May 9, 2024 · 0 comments · Fixed by #47176
Closed

[data][train] Bug in SplitCoordinator: "assert self._output_iterator is not None" #45225

raulchen opened this issue May 9, 2024 · 0 comments · Fixed by #47176
Assignees
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P1 Issue that should be fixed within a few weeks

Comments

@raulchen
Copy link
Contributor

raulchen commented May 9, 2024

This bug occasionally happens, looks like a race condition issue.

  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/block_batching/iter_batches.py", line 271, in prefetch_batches_locally          
    next_block_ref_and_metadata = next(block_ref_iter)                                                                                                     
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/util.py", line 898, in __next__                                                 
    return next(self.it)                                                                                                                                                                                                                                                                                              
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 79, in gen_blocks                                                                                                                                                                                 
    cur_epoch = ray.get(                                                                                                                                                                                                                                                                                              
ray.exceptions.RayTaskError(AssertionError): [36mray::SplitCoordinator.start_epoch()[39m (pid=96843, ip=172.24.101.168, actor_id=4c22650eb39c06073f62b14408000000, repr=<ray.data._internal.iterator.stream_split_iterator.SplitCoordinator object at 0x79550c01bf40>)                                                
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 201, in start_epoch
    epoch_id = self._barrier(split_idx)                                                                                                                                                                                                                                                                               
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 280, in _barrier                       
    assert self._output_iterator is not None                                                                                                                                                                                                                                                                          
AssertionError    
@raulchen raulchen added bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks data Ray Data-related issues labels May 9, 2024
@raulchen raulchen self-assigned this May 9, 2024
raulchen added a commit that referenced this issue May 9, 2024
Add debugging info to debug
#45225.
The bug is hard to reproduce manually. Add debugging info, so that when
it happens, we have enough info to investigate the issue.
---------

Signed-off-by: Hao Chen <chenh1024@gmail.com>
HenryZJY pushed a commit to HenryZJY/ray that referenced this issue May 10, 2024
Add debugging info to debug
ray-project#45225.
The bug is hard to reproduce manually. Add debugging info, so that when
it happens, we have enough info to investigate the issue.
---------

Signed-off-by: Hao Chen <chenh1024@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant