-
Notifications
You must be signed in to change notification settings - Fork 652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dfsql tests failing on Windows/MacOS after the last modin update #3256
Comments
The problem is definitely Ray related, as using Might be related to this: |
cc @modin-project/modin-ray |
@btseytlin the windows issue will likely take a bit longer to fix, but you seem to have run into this problem on macOS too? Could you provide a stack trace/relevant logs for that too? |
Unfortunately there is no stack trace on MacOS: tests just hang until my timeout kills them: Disabling Ray fixes them on MacOS too. My current hypothesis is that there is some kind of a deadlock happening on Ray side, on certain OSes. |
Is it possible to obtain the logs from ray from you guys GithubAction CI? (log files are usually in /tmp/ray/session_latest/logs) Also, is there a way to repro the environment? |
Oh and also, we've fixed one critical segfault issue recently. Is it possible to check if the issue is reproducible in the ray master? It'll be also great if you guys can open an issue in our Github https://github.com/ray-project/ray/issues |
@rkooo567 I will try to get the logs and try using ray master. I am still not 100% sure it's a Ray issue, but if nothing works I will open an issue just in case. |
@rkooo567 issue persists when using Ray nightly build |
@rkooo567 you can get the saved ray logs in artifacts of this CI job: |
Finally the issue turned out to be Modin related. Ray hanged when I was trying to write a Modin DataFrame to disk while using Ray. It seems there was some kind of a deadlock, but I am still not sure. For now I resolved it by using Pandas to write to disk, after which the issue was gone: Still, it's wroth investigating why that was happening. |
Thanks @btseytlin! @anmyachev Are you aware of hanging behavior on |
@devin-petersohn I will take a look |
Hello @btseytlin! I tried to reproduce your problem. To do this, I took dfsql current master ( I saw messages like I could not reproduce hanging behavior of @btseytlin can you try going back to using Ray and check the current workability? |
@anmyachev what do you mean by "going back to using Ray"? I tried to investigate the Ray side of things, but I couldn't find the problem, I don't have enough knowledge of Ray, unfortunately. |
Is it happening on Windows? It could be a bug from Ray on Windows (I feel like I saw that error before). Also is it possible if you can repro this issue with older ray versions? Like 1.2? If it doesn’t we will figure out a way to fix it quickly |
I meant return https://github.com/mindsdb/dfsql/pull/19/files#diff-287da181ac34dcb8710924d3be04f46fac4c8b26c7de303766af97d571d1b969R26 line to the previous state (to use ray to write information to disk). |
I also tried |
Unfortunately I am on Linux and on Linux this error doesnt happen. It did happen on Windows and Macos in Github CI. Since it's in CI it's very awkward to reproduce and leaves no way to debug, but I can try. Switching from release to nightly Ray builds didn't change anything, but I didn't try older Ray versions |
We can try in CI again. I create PR mindsdb/dfsql#31. |
Great, lets try |
I was able to reproduce hanging behavior locally (Reproducibility is not 100%). Environment: conda env create -f environment-dev.yml
set MODIN_CPUS=4
set MODIN_ENGINE=ray
pytest modin\pandas\test\test_io.py::TestCsv::test_hanging_behavior --verbose -s Simplified reproducer (that should be added to def test_hanging_behavior(self):
for i in range(16):
#print("to_csv")
pd.DataFrame([1, 2, 3, 4]).to_csv("initial-data.csv", index=False)
#print("read_csv")
df = pd.read_csv("initial-data.csv")
#print("isnull, all, axis=1")
df.index[df.isnull().all(axis=1)].values.tolist()
#print("isnull, all, axis=0")
df.columns[df.isnull().all(axis=0)].values.tolist() Logs: ...\modin>pytest modin\pandas\test\test_io.py::TestCsv::test_hanging_behavior --verbose -s
=============================================== test session starts ===============================================
platform win32 -- Python 3.8.10, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 -- ...\Miniconda3\envs\modin\python.exe
cachedir: .pytest_cache
benchmark: 3.4.1 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: ...\modin, configfile: setup.cfg
plugins: benchmark-3.4.1, cov-2.11.0, forked-1.3.0, xdist-2.3.0
collected 1 item
modin/pandas/test/test_io.py::TestCsv::test_just_test to_csv
read_csv
(pid=20528) Windows fatal exception: access violation
(pid=20528)
isnull, all, axis=1
isnull, all, axis=0
to_csv
read_csv
isnull, all, axis=1
(pid=15432) Windows fatal exception: access violation
(pid=15432)
isnull, all, axis=0
to_csv
read_csv
isnull, all, axis=1
(pid=23128) Windows fatal exception: access violation
(pid=23128)
isnull, all, axis=0
to_csv
read_csv
isnull, all, axis=1
(pid=3412) Windows fatal exception: access violation
(pid=3412)
isnull, all, axis=0
to_csv
read_csv
(pid=20984) Windows fatal exception: access violation
(pid=20984)
isnull, all, axis=1
isnull, all, axis=0
to_csv
read_csv
(pid=7096) Windows fatal exception: access violation
(pid=7096)
isnull, all, axis=1
isnull, all, axis=0
to_csv
read_csv
isnull, all, axis=1
(pid=23464) Windows fatal exception: access violation
(pid=23464)
isnull, all, axis=0
to_csv
read_csv
(pid=12504) Windows fatal exception: access violation
(pid=12504)
isnull, all, axis=1
isnull, all, axis=0
to_csv
read_csv
isnull, all, axis=1
(pid=11500) Windows fatal exception: access violation
(pid=11500)
isnull, all, axis=0
to_csv
read_csv
(pid=19948) Windows fatal exception: access violation
(pid=19948)
isnull, all, axis=1
isnull, all, axis=0
to_csv
read_csv
isnull, all, axis=1
(pid=20848) Windows fatal exception: access violation
(pid=20848)
isnull, all, axis=0
to_csv
2021-08-25 20:03:18,393 WARNING worker.py:1189 -- The actor or task with ID c6cf2fddfe5e7c90b398e5da6a4450ee63f746a18d1ec44e cannot be scheduled right now. It requires {CPU: 1.000000} for placement, but this node only has remaining
{4.000000/4.000000 CPU, 13.969839 GiB/13.969839 GiB memory, 13.969839 GiB/13.969839 GiB object_store_memory, 1.000000/1.000000 node:10.147.230.30}
. In total there are 1 pending tasks and 0 pending actors on this node. This is likely due to all cluster resources
being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale or if you specified a runtime_env for this task or actor because it takes time to install. @rkooo567 did this clarify anything? |
You said this is only happening in windows right? |
If so should be related to the access violation error. |
Unfortunately I don’t have a Windows machine to fix it. Is it possible to file an issue in our github? |
@rkooo567 yes, only on windows. Sure :) I will create the issue in Ray repo. |
@anmyachev @rkooo567 has this issue been resolved? |
@RehanSD it should be |
System information
modin.__version__
): 0.10.1Describe the problem
We have been getting hanging unit-tests in Github actions since upgrading to latest modin. I haven't been able to find what the problem is exactly, but tests just hang forever.
I am creating this issue in case the problem is modin-related.
Source code / logs
Adding timeouts to tests revealed such logs on Windows:
It might be related to ray saving logs, as in this issue. Weird that the issue is old, but these messages didn't appear earlier.
On Windows + python 3.7 (but not 3.8) this Segfault happens, which seems to be ray/modin related:
The full logs are available here: https://github.com/mindsdb/dfsql/pull/19/checks?check_run_id=3112462155
The text was updated successfully, but these errors were encountered: