-
Notifications
You must be signed in to change notification settings - Fork 652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ray crashes when writing large df to csv #2999
Comments
cc @modin-project/modin-ray |
Can you guys post this to Ray github? |
Also what’s the ray version? |
The Ray version is 1.1.0. I'll make an issue on the Ray repo. |
Can you first verify if the issue is reproducible with the master ray? |
Upgrading to 1.2.0 fixed this one for me (upgrading 1.6.0 caused other issues so left it at that) |
Interesting, this appears closely related to #3256. Were you on Windows by chance @OliverColeman? |
@OliverColeman Is it possible to tell me what are issues now? |
@devin-petersohn Nope, Ubuntu 20.04 inside a docker container running on a Ubuntu host |
So I retried this on the latest Ray and Modin (Modin 0.16.1 and Ray 2.0.0), and I got the following output: (raylet) Spilled 2067 MiB, 8 objects, write throughput 43 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.
(raylet) Spilled 4314 MiB, 17 objects, write throughput 43 MiB/s.
(raylet) Spilled 8326 MiB, 36 objects, write throughput 44 MiB/s.
(raylet) Spilled 16902 MiB, 75 objects, write throughput 85 MiB/s.
(_deploy_ray_func pid=4747) E1018 22:41:54.592428436 4960 chttp2_transport.cc:1103] Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
(_deploy_ray_func pid=4748) E1018 22:41:55.114165888 4839 chttp2_transport.cc:1103] Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
(_deploy_ray_func pid=4746) E1018 22:42:02.685787618 4853 chttp2_transport.cc:1103] Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
(_deploy_ray_func pid=4751) E1018 22:42:03.383642485 4961 chttp2_transport.cc:1103] Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
(_deploy_ray_func pid=4750) E1018 22:42:03.410378847 4931 chttp2_transport.cc:1103] Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
(raylet) Spilled 32885 MiB, 157 objects, write throughput 88 MiB/s.
(raylet) Spilled 65677 MiB, 297 objects, write throughput 76 MiB/s.
2022-10-18 23:08:00,203 WARNING worker.py:1829 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: c0d14f93a0ea027bd2c3881341f021db17bec49901000000 Worker ID: 7be584e359d375b82f51588b29090b5034ec50556a50e4ecd6915be7 Node ID: 8e8681a1b6ba49acd447c72af8c9796005fcb29b10d17c6f87a2353f Worker IP address: 172.31.54.137 Worker port: 46763 Worker PID: 4750 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
2022-10-18 23:08:41,887 WARNING worker.py:1829 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: 303353265faac38e8d547085ef220bce000885c201000000 Worker ID: 92df65dd550ffc678a5e8c80cb374f4b500c30877e725ecbfc2fe637 Node ID: 8e8681a1b6ba49acd447c72af8c9796005fcb29b10d17c6f87a2353f Worker IP address: 172.31.54.137 Worker port: 44965 Worker PID: 4748 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors. and then it hung for about an hour before I killed it. After killing it, I checked, and the file was not present. |
Hey @RehanSD is it possible to create an issue to the Ray github? This seems to look bad |
Can do - we're getting similar bugs when sorting a 10M x 100 dataframe - Ray workers seem to be OOM-ing and dying when the object store has 14.98 GB - only by quadrupling the object store do we get rid of the OOMs. I opened an issue here: ray-project/ray#29668 |
@RehanSD it really depends on your implementation. i.e. it could be that the library (modin) uses a lot of heap memory when doing the sort; which likely leads to OOM. We have an experimental feature available in ray master and next release. https://docs.ray.io/en/master/ray-core/scheduling/ray-oom-prevention.html I'd suggest to enable it, it will give you a bit more information if the application is really using excessive memory. |
System information
modin.__version__
): pip installed from branch rehan/issues/2656Describe the problem
Ray seems to crash when I try to write a large data frame to a CSV. This data frame is about 23 GB.
Source code / logs
The text was updated successfully, but these errors were encountered: