Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process OUT_OF_MEMORY metadata exceptions #191

Merged
merged 19 commits into from
Oct 13, 2023
Merged

Conversation

omus
Copy link
Member

@omus omus commented Oct 7, 2023

Split out from #184.

@omus omus changed the title Process OutOfMemoryError metadata exceptions Process OUT_OF_MEMORY metadata exceptions Oct 7, 2023
@codecov
Copy link

codecov bot commented Oct 7, 2023

Codecov Report

Merging #191 (be76168) into main (c2d129a) will increase coverage by 0.24%.
Report is 2 commits behind head on main.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #191      +/-   ##
==========================================
+ Coverage   95.58%   95.82%   +0.24%     
==========================================
  Files          12       12              
  Lines         589      623      +34     
==========================================
+ Hits          563      597      +34     
  Misses         26       26              
Flag Coverage Δ
Ray.jl 95.82% <100.00%> (+0.24%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
src/Ray.jl 100.00% <ø> (ø)
src/exceptions.jl 100.00% <100.00%> (ø)
src/ray_serializer.jl 100.00% <100.00%> (ø)
src/runtime.jl 94.37% <100.00%> (+0.18%) ⬆️

... and 4 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@omus
Copy link
Member Author

omus commented Oct 10, 2023

Re-running the failure from #143 now looks like:

ERROR: LoadError: TaskFailedException

    nested task error: Ray.OutOfMemoryError: Task was killed due to the node running low on memory.
    Memory on the node (IP: 10.0.16.71, ID: 48c787db45bb6206e50b4558f42d955391e694f5a4d8887993c8b23e) where the task (task ID: 915865ef77f6566d16ddfbc75b39874da6b5731102000000, name=v0_4_4.process_segment, pid=159, memory used=3.29GB) was running was 51.34GB / 54.00GB (0.950803), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: 045a239f45da7b905a0bbe85711208643ecc4f3248d5efd416de821f) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 10.0.16.71`. To see the logs of the worker, use `ray logs worker-045a239f45da7b905a0bbe85711208643ecc4f3248d5efd416de821f*out -ip 10.0.16.71. Top 10 memory users:
    PID	MEM(GB)	COMMAND
    417	3.86	/usr/local/julia/bin/julia -Cnative -J/usr/local/julia/lib/julia/sys.so -g1 -e using Ray; start_work...
    493	3.69	/usr/local/julia/bin/julia -Cnative -J/usr/local/julia/lib/julia/sys.so -g1 -e using Ray; start_work...
    162	3.68	/usr/local/julia/bin/julia -Cnative -J/usr/local/julia/lib/julia/sys.so -g1 -e using Ray; start_work...
    415	3.38	/usr/local/julia/bin/julia -Cnative -J/usr/local/julia/lib/julia/sys.so -g1 -e using Ray; start_work...
    165	3.32	/usr/local/julia/bin/julia -Cnative -J/usr/local/julia/lib/julia/sys.so -g1 -e using Ray; start_work...
    159	3.29	/usr/local/julia/bin/julia -Cnative -J/usr/local/julia/lib/julia/sys.so -g1 -e using Ray; start_work...
    158	3.17	/usr/local/julia/bin/julia -Cnative -J/usr/local/julia/lib/julia/sys.so -g1 -e using Ray; start_work...
    164	3.16	/usr/local/julia/bin/julia -Cnative -J/usr/local/julia/lib/julia/sys.so -g1 -e using Ray; start_work...
    498	3.11	/usr/local/julia/bin/julia -Cnative -J/usr/local/julia/lib/julia/sys.so -g1 -e using Ray; start_work...
    157	3.07	/usr/local/julia/bin/julia -Cnative -J/usr/local/julia/lib/julia/sys.so -g1 -e using Ray; start_work...
    Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.X
    Stacktrace:
     [1] deserialize_from_ray_object(ray_obj::CxxWrap.StdLib.SharedPtrAllocated{Ray.ray_julia_jll.RayObject}, outer_object_ref::Ray.ObjectRef)
       @ Ray ~/dev/Ray/src/ray_serializer.jl:101
     [2] get(obj_ref::Ray.ObjectRef)
       @ Ray ~/dev/Ray/src/object_store.jl:31
     ...

...and 7 more exceptions.

Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base ./task.jl:445
 [2] macro expansion
   @ ./task.jl:477 [inlined]
 ...
 [7] top-level scope
   @ /tmp/ray/session_2023-10-06_21-48-16_110249_9/runtime_resources/working_dir_files/_ray_pkg_738606109762db83/migration.jl:5
in expression starting at /tmp/ray/session_2023-10-06_21-48-16_110249_9/runtime_resources/working_dir_files/_ray_pkg_738606109762db83/migration.jl:5

@omus omus self-assigned this Oct 10, 2023
@omus omus marked this pull request as ready for review October 10, 2023 14:26
src/exceptions.jl Show resolved Hide resolved
src/exceptions.jl Show resolved Hide resolved
src/exceptions.jl Show resolved Hide resolved
src/ray_serializer.jl Outdated Show resolved Hide resolved
Base automatically changed from cv/get_data_metadata to main October 10, 2023 19:26
@omus omus merged commit 204dff1 into main Oct 13, 2023
7 checks passed
@omus omus deleted the cv/process-raylet-exceptions branch October 13, 2023 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants