Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] memory_monitor crash #28579

Closed
scv119 opened this issue Sep 16, 2022 · 1 comment · Fixed by #28642
Closed

[Core] memory_monitor crash #28579

scv119 opened this issue Sep 16, 2022 · 1 comment · Fixed by #28642
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P0 Issues that should be fixed in short order

Comments

@scv119
Copy link
Contributor

scv119 commented Sep 16, 2022

What happened + What you expected to happen

https://console.anyscale-staging.com/o/anyscale-internal/projects/prj_qC3ZfndQWYYjx2cz8KWGNUL4/clusters/ses_GASWyegwPsGW8V2H2iuFvt88?command-history-section=command_history

(raylet) [2022-09-16 09:27:25,719 C 236 236] (raylet) memory_monitor.cc:121: Check failed: (left > right) 117896851456 vs 117896851456
(raylet) *** StackTrace Information ***
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x485f6a) [0x55b0b8ce2f6a] ray::operator<<()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x487a42) [0x55b0b8ce4a42] ray::SpdLogMessage::Flush()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x487d57) [0x55b0b8ce4d57] ray::RayLog::~RayLog()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x45cfc3) [0x55b0b8cb9fc3] ray::MemoryMonitor::GetCGroupMemoryBytes()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x45ed9b) [0x55b0b8cbbd9b] ray::MemoryMonitor::GetMemoryBytes()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x45edfc) [0x55b0b8cbbdfc] std::_Function_handler<>::_M_invoke()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x4126bb) [0x55b0b8c6f6bb] ray::PeriodicalRunner::DoRunFnPeriodicallyInstrumented()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x413e10) [0x55b0b8c70e10] std::_Function_handler<>::_M_invoke()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x469da6) [0x55b0b8cc6da6] EventTracker::RecordExecution()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x41069e) [0x55b0b8c6d69e] ray::PeriodicalRunner::DoRunFnPeriodicallyInstrumented()::{lambda()#1}::operator()()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x41157f) [0x55b0b8c6e57f] boost::asio::detail::wait_handler<>::do_complete()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x994b6b) [0x55b0b91f1b6b] boost::asio::detail::scheduler::do_run_one()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x996331) [0x55b0b91f3331] boost::asio::detail::scheduler::run()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x996560) [0x55b0b91f3560] boost::asio::io_context::run()
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x140ae4) [0x55b0b899dae4] main
(raylet) /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f61a991e083] __libc_start_main
(raylet) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x17d777) [0x55b0b89da777]
(raylet)

Versions / Dependencies

latest

Reproduction script

n/a

Issue Severity

High: It blocks me from completing my task.

@scv119 scv119 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Sep 16, 2022
@scv119 scv119 added P0 Issues that should be fixed in short order core Issues that should be addressed in Ray Core and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Sep 16, 2022
@clarng
Copy link
Contributor

clarng commented Sep 20, 2022

Two issues:

  • the memory check is very strict and cgroup can violate the check. We should move the check to log
  • the memory usage went to 100% even when we had memory monitor enabled. the memory grew by another 1.54GB after 10 sec. The raylet logs are gone now so we couldn't debug what happened

(raylet) [2022-09-16 09:27:14,959 E 236 236] (raylet) node_manager.cc:2964: System memory low at node with IP 172.31.92.150. Used memory (108.26GB) / total capacity (109.80GB) (0.985935) exceeds threshold 0.95, killing latest task with name shuffle_map() and task ID 7ed7a67c70ea0389aaa0163194640bd63fc3fd4301000000 to avoid running out of memory.
2022-09-16 09:27:26,177 WARNING worker.py:1828 -- Raylet is terminated: ip=172.31.92.150, id=a6643e181ce30976f943473c5aefde654dff789f08cc0f69b7324b49. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access

For first issue : #28642
For second issue will need to stress test to repro

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P0 Issues that should be fixed in short order
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants