You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The text was updated successfully, but these errors were encountered:
scv119
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Sep 16, 2022
scv119
added
P0
Issues that should be fixed in short order
core
Issues that should be addressed in Ray Core
and removed
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Sep 16, 2022
the memory check is very strict and cgroup can violate the check. We should move the check to log
the memory usage went to 100% even when we had memory monitor enabled. the memory grew by another 1.54GB after 10 sec. The raylet logs are gone now so we couldn't debug what happened
(raylet) [2022-09-16 09:27:14,959 E 236 236] (raylet) node_manager.cc:2964: System memory low at node with IP 172.31.92.150. Used memory (108.26GB) / total capacity (109.80GB) (0.985935) exceeds threshold 0.95, killing latest task with name shuffle_map() and task ID 7ed7a67c70ea0389aaa0163194640bd63fc3fd4301000000 to avoid running out of memory.
2022-09-16 09:27:26,177 WARNING worker.py:1828 -- Raylet is terminated: ip=172.31.92.150, id=a6643e181ce30976f943473c5aefde654dff789f08cc0f69b7324b49. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access
For first issue : #28642
For second issue will need to stress test to repro
What happened + What you expected to happen
https://console.anyscale-staging.com/o/anyscale-internal/projects/prj_qC3ZfndQWYYjx2cz8KWGNUL4/clusters/ses_GASWyegwPsGW8V2H2iuFvt88?command-history-section=command_history
(raylet) [2022-09-16 09:27:25,719 C 236 236] (raylet) memory_monitor.cc:121: Check failed: (left > right) 117896851456 vs 117896851456
(raylet) *** StackTrace Information ***
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x485f6a) [0x55b0b8ce2f6a] ray::operator<<()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x487a42) [0x55b0b8ce4a42] ray::SpdLogMessage::Flush()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x487d57) [0x55b0b8ce4d57] ray::RayLog::~RayLog()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x45cfc3) [0x55b0b8cb9fc3] ray::MemoryMonitor::GetCGroupMemoryBytes()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x45ed9b) [0x55b0b8cbbd9b] ray::MemoryMonitor::GetMemoryBytes()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x45edfc) [0x55b0b8cbbdfc] std::_Function_handler<>::_M_invoke()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x4126bb) [0x55b0b8c6f6bb] ray::PeriodicalRunner::DoRunFnPeriodicallyInstrumented()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x413e10) [0x55b0b8c70e10] std::_Function_handler<>::_M_invoke()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x469da6) [0x55b0b8cc6da6] EventTracker::RecordExecution()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x41069e) [0x55b0b8c6d69e] ray::PeriodicalRunner::DoRunFnPeriodicallyInstrumented()::{lambda()#1}::operator()()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x41157f) [0x55b0b8c6e57f] boost::asio::detail::wait_handler<>::do_complete()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x994b6b) [0x55b0b91f1b6b] boost::asio::detail::scheduler::do_run_one()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x996331) [0x55b0b91f3331] boost::asio::detail::scheduler::run()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x996560) [0x55b0b91f3560] boost::asio::io_context::run()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x140ae4) [0x55b0b899dae4] main
(raylet) /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f61a991e083] __libc_start_main
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x17d777) [0x55b0b89da777]
(raylet)
Versions / Dependencies
latest
Reproduction script
n/a
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: