Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray crashes when writing large df to csv #2999

Open
RehanSD opened this issue Apr 20, 2021 · 13 comments
Open

Ray crashes when writing large df to csv #2999

RehanSD opened this issue Apr 20, 2021 · 13 comments
Labels
bug 🦗 Something isn't working P1 Important tasks that we should complete soon Performance 🚀 Performance related issues and pull requests.

Comments

@RehanSD
Copy link
Collaborator

RehanSD commented Apr 20, 2021

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 20.04
  • Modin version (modin.__version__): pip installed from branch rehan/issues/2656
  • Python version: 3.8
  • Code we can use to reproduce:
import modin.pandas as pd
df = pd.concat([
pd.read_csv("s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-01.csv", quoting=3),
pd.read_csv("s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-02.csv", quoting=3),
pd.read_csv("s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-03.csv", quoting=3),
pd.read_csv("s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-04.csv", quoting=3),
pd.read_csv("s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-05.csv", quoting=3),
pd.read_csv("s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-06.csv", quoting=3),
pd.read_csv("s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-07.csv", quoting=3),
pd.read_csv("s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-08.csv", quoting=3),
pd.read_csv("s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-09.csv", quoting=3),
pd.read_csv("s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-10.csv", quoting=3),
pd.read_csv("s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-11.csv", quoting=3),
pd.read_csv("s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-12.csv", quoting=3),
])

df.to_csv('nyc_taxi.csv', index=False)

Describe the problem

Ray seems to crash when I try to write a large data frame to a CSV. This data frame is about 23 GB.

Source code / logs

(pid=86889) [2021-04-20 19:47:02,782 C 86889 88394] core_worker.cc:190:  Check failed: instance_ The core worker process is not initialized yet or already shutdown.
(pid=86889) *** StackTrace Information ***
(pid=86889)     @     0x7f3db0e9b795  google::GetStackTraceToString()
(pid=86889)     @     0x7f3db0e12efe  ray::GetCallTrace()
(pid=86889)     @     0x7f3db0e38304  ray::RayLog::~RayLog()
(pid=86889)     @     0x7f3db09fe4e2  ray::CoreWorkerProcess::EnsureInitialized()
(pid=86889)     @     0x7f3db0a07352  ray::CoreWorkerProcess::GetCoreWorker()
(pid=86889)     @     0x7f3db095f8c8  __pyx_pw_3ray_7_raylet_10CoreWorker_61profile_event()
(pid=86889)     @     0x55a3594e81c7  method_vectorcall_VARARGS_KEYWORDS
(pid=86889)     @     0x55a35944375e  _PyEval_EvalFrameDefault.cold.2790
(pid=86889)     @     0x55a3594cda92  _PyEval_EvalCodeWithName
(pid=86889)     @     0x55a3594ce943  _PyFunction_Vectorcall.localalias.355
(pid=86889)     @     0x55a35944377f  _PyEval_EvalFrameDefault.cold.2790
(pid=86889)     @     0x55a3594ce86b  _PyFunction_Vectorcall.localalias.355
(pid=86889)     @     0x55a35944375e  _PyEval_EvalFrameDefault.cold.2790
(pid=86889)     @     0x55a3594ce86b  _PyFunction_Vectorcall.localalias.355
(pid=86889)     @     0x55a3594ceee7  method_vectorcall
(pid=86889)     @     0x55a359480041  PyVectorcall_Call
(pid=86889)     @     0x55a35950599b  _PyEval_EvalFrameDefault
(pid=86889)     @     0x55a3594ce86b  _PyFunction_Vectorcall.localalias.355
(pid=86889)     @     0x55a35944375e  _PyEval_EvalFrameDefault.cold.2790
(pid=86889)     @     0x55a3594ce86b  _PyFunction_Vectorcall.localalias.355
(pid=86889)     @     0x55a35944375e  _PyEval_EvalFrameDefault.cold.2790
(pid=86889)     @     0x55a3594ce86b  _PyFunction_Vectorcall.localalias.355
(pid=86889)     @     0x55a3594ceee7  method_vectorcall
(pid=86889)     @     0x55a359480041  PyVectorcall_Call
(pid=86889)     @     0x55a3595788be  t_bootstrap
(pid=86889)     @     0x55a359525708  pythread_wrapper
(pid=86889)     @     0x7f3db20a6609  start_thread
(pid=86889)     @     0x7f3db1fcd293  clone
(pid=86889)
@RehanSD RehanSD added the bug 🦗 Something isn't working label Apr 20, 2021
@RehanSD RehanSD changed the title Modin crashes when writing large df to csv Ray crashes when writing large df to csv Apr 20, 2021
@devin-petersohn
Copy link
Collaborator

cc @modin-project/modin-ray

@rkooo567
Copy link
Collaborator

Can you guys post this to Ray github?

@rkooo567
Copy link
Collaborator

Also what’s the ray version?

@RehanSD
Copy link
Collaborator Author

RehanSD commented Apr 20, 2021

The Ray version is 1.1.0. I'll make an issue on the Ray repo.

@rkooo567
Copy link
Collaborator

Can you first verify if the issue is reproducible with the master ray?

@OliverColeman
Copy link

Upgrading to 1.2.0 fixed this one for me (upgrading 1.6.0 caused other issues so left it at that)

@devin-petersohn
Copy link
Collaborator

Interesting, this appears closely related to #3256. Were you on Windows by chance @OliverColeman?

@rkooo567
Copy link
Collaborator

Upgrading to 1.2.0 fixed this one for me (upgrading 1.6.0 caused other issues so left it at that)

@OliverColeman Is it possible to tell me what are issues now?

@OliverColeman
Copy link

@devin-petersohn Nope, Ubuntu 20.04 inside a docker container running on a Ubuntu host

@mvashishtha mvashishtha added Performance 🚀 Performance related issues and pull requests. P1 Important tasks that we should complete soon labels Oct 12, 2022
@RehanSD
Copy link
Collaborator Author

RehanSD commented Oct 18, 2022

So I retried this on the latest Ray and Modin (Modin 0.16.1 and Ray 2.0.0), and I got the following output:

(raylet) Spilled 2067 MiB, 8 objects, write throughput 43 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.
(raylet) Spilled 4314 MiB, 17 objects, write throughput 43 MiB/s.
(raylet) Spilled 8326 MiB, 36 objects, write throughput 44 MiB/s.
(raylet) Spilled 16902 MiB, 75 objects, write throughput 85 MiB/s.
(_deploy_ray_func pid=4747) E1018 22:41:54.592428436    4960 chttp2_transport.cc:1103]   Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
(_deploy_ray_func pid=4748) E1018 22:41:55.114165888    4839 chttp2_transport.cc:1103]   Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
(_deploy_ray_func pid=4746) E1018 22:42:02.685787618    4853 chttp2_transport.cc:1103]   Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
(_deploy_ray_func pid=4751) E1018 22:42:03.383642485    4961 chttp2_transport.cc:1103]   Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
(_deploy_ray_func pid=4750) E1018 22:42:03.410378847    4931 chttp2_transport.cc:1103]   Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
(raylet) Spilled 32885 MiB, 157 objects, write throughput 88 MiB/s.
(raylet) Spilled 65677 MiB, 297 objects, write throughput 76 MiB/s.
2022-10-18 23:08:00,203	WARNING worker.py:1829 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: c0d14f93a0ea027bd2c3881341f021db17bec49901000000 Worker ID: 7be584e359d375b82f51588b29090b5034ec50556a50e4ecd6915be7 Node ID: 8e8681a1b6ba49acd447c72af8c9796005fcb29b10d17c6f87a2353f Worker IP address: 172.31.54.137 Worker port: 46763 Worker PID: 4750 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
2022-10-18 23:08:41,887	WARNING worker.py:1829 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: 303353265faac38e8d547085ef220bce000885c201000000 Worker ID: 92df65dd550ffc678a5e8c80cb374f4b500c30877e725ecbfc2fe637 Node ID: 8e8681a1b6ba49acd447c72af8c9796005fcb29b10d17c6f87a2353f Worker IP address: 172.31.54.137 Worker port: 44965 Worker PID: 4748 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

and then it hung for about an hour before I killed it. After killing it, I checked, and the file was not present.

@rkooo567
Copy link
Collaborator

Hey @RehanSD is it possible to create an issue to the Ray github? This seems to look bad

@RehanSD
Copy link
Collaborator Author

RehanSD commented Oct 25, 2022

Can do - we're getting similar bugs when sorting a 10M x 100 dataframe - Ray workers seem to be OOM-ing and dying when the object store has 14.98 GB - only by quadrupling the object store do we get rid of the OOMs.

I opened an issue here: ray-project/ray#29668

@scv119
Copy link

scv119 commented Oct 26, 2022

@RehanSD it really depends on your implementation. i.e. it could be that the library (modin) uses a lot of heap memory when doing the sort; which likely leads to OOM.

We have an experimental feature available in ray master and next release. https://docs.ray.io/en/master/ray-core/scheduling/ray-oom-prevention.html I'd suggest to enable it, it will give you a bit more information if the application is really using excessive memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working P1 Important tasks that we should complete soon Performance 🚀 Performance related issues and pull requests.
Projects
None yet
Development

No branches or pull requests

6 participants