Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dfsql tests failing on Windows/MacOS after the last modin update #3256

Closed
btseytlin opened this issue Jul 20, 2021 · 27 comments
Closed

dfsql tests failing on Windows/MacOS after the last modin update #3256

btseytlin opened this issue Jul 20, 2021 · 27 comments
Labels
Needs more information ❔ Issues that require more information from the reporter P3 Very minor bugs, or features we can hopefully add some day. Testing 📈 Issues related to testing

Comments

@btseytlin
Copy link
Contributor

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows, MacOS
  • Modin version (modin.__version__): 0.10.1
  • Python version: 3.7, 3.8
  • Code we can use to reproduce:

Describe the problem

We have been getting hanging unit-tests in Github actions since upgrading to latest modin. I haven't been able to find what the problem is exactly, but tests just hang forever.

I am creating this issue in case the problem is modin-related.

Source code / logs

Adding timeouts to tests revealed such logs on Windows:

(pid=6528) Windows fatal exception: access violation
(pid=6528) 
(pid=420) Windows fatal exception: access violation
(pid=420) 
(pid=5080) Windows fatal exception: access violation
(pid=5080) 
(pid=3592) Windows fatal exception: access violation
(pid=3592) 
(pid=6112) Windows fatal exception: access violation
(pid=6112) 
(pid=5660) Windows fatal exception: access violation
(pid=5660) 
(pid=6404) Windows fatal exception: access violation
(pid=6404) 
(pid=3924) Windows fatal exception: access violation
(pid=3924) 
(pid=3684) Windows fatal exception: access violation
(pid=3684) 

It might be related to ray saving logs, as in this issue. Weird that the issue is old, but these messages didn't appear earlier.

On Windows + python 3.7 (but not 3.8) this Segfault happens, which seems to be ray/modin related:

Thread 0x00001b24 (most recent call first):
  File "c:\hostedtoolcache\windows\python\3.7.9\x64\lib\site-packages\ray\worker.py", line 1637 in wait
  File "c:\hostedtoolcache\windows\python\3.7.9\x64\lib\site-packages\ray\_private\client_mode_hook.py", line 62 in wrapper
  File "c:\hostedtoolcache\windows\python\3.7.9\x64\lib\site-packages\modin\engines\ray\generic\io.py", line 198 in to_csv
  File "c:\hostedtoolcache\windows\python\3.7.9\x64\lib\site-packages\modin\data_management\factories\factories.py", line 398 in _to_csv
  File "c:\hostedtoolcache\windows\python\3.7.9\x64\lib\site-packages\modin\data_management\factories\dispatcher.py", line 267 in to_csv
  File "c:\hostedtoolcache\windows\python\3.7.9\x64\lib\site-packages\modin\pandas\base.py", line 2513 in to_csv
  File "d:\a\dfsql\dfsql\dfsql\__init__.py", line 25 in sql_query
  File "d:\a\dfsql\dfsql\dfsql\extensions.py", line 66 in __call__
  File "D:\a\dfsql\dfsql\tests\test_extensions.py", line 47 in test_df_sql_nested_select_in
...
D:\a\_temp\1ac47c42-bfc7-4fcd-b689-944e647c7102.sh: line 1:  1841 Segmentation fault 

The full logs are available here: https://github.com/mindsdb/dfsql/pull/19/checks?check_run_id=3112462155

@btseytlin btseytlin added the bug 🦗 Something isn't working label Jul 20, 2021
@btseytlin
Copy link
Contributor Author

btseytlin commented Jul 21, 2021

The problem is definitely Ray related, as using MODIN_ENGINE=dask removes the issue.

Might be related to this:
ray-project/ray#17239

@devin-petersohn
Copy link
Collaborator

cc @modin-project/modin-ray

@wuisawesome
Copy link
Collaborator

@btseytlin the windows issue will likely take a bit longer to fix, but you seem to have run into this problem on macOS too?

Could you provide a stack trace/relevant logs for that too?

@btseytlin
Copy link
Contributor Author

@btseytlin the windows issue will likely take a bit longer to fix, but you seem to have run into this problem on macOS too?

Could you provide a stack trace/relevant logs for that too?

Unfortunately there is no stack trace on MacOS: tests just hang until my timeout kills them:
https://github.com/mindsdb/dfsql/runs/3112681341

Disabling Ray fixes them on MacOS too.

My current hypothesis is that there is some kind of a deadlock happening on Ray side, on certain OSes.

@rkooo567
Copy link
Collaborator

rkooo567 commented Jul 23, 2021

Is it possible to obtain the logs from ray from you guys GithubAction CI? (log files are usually in /tmp/ray/session_latest/logs)

Also, is there a way to repro the environment?

@rkooo567
Copy link
Collaborator

Oh and also, we've fixed one critical segfault issue recently. Is it possible to check if the issue is reproducible in the ray master? It'll be also great if you guys can open an issue in our Github https://github.com/ray-project/ray/issues

@btseytlin
Copy link
Contributor Author

Oh and also, we've fixed one critical segfault issue recently. Is it possible to check if the issue is reproducible in the ray master? It'll be also great if you guys can open an issue in our Github https://github.com/ray-project/ray/issues

@rkooo567 I will try to get the logs and try using ray master. I am still not 100% sure it's a Ray issue, but if nothing works I will open an issue just in case.

@btseytlin
Copy link
Contributor Author

@rkooo567 issue persists when using Ray nightly build

@btseytlin
Copy link
Contributor Author

@rkooo567 you can get the saved ray logs in artifacts of this CI job:
https://github.com/mindsdb/dfsql/actions/runs/1066780608

@btseytlin
Copy link
Contributor Author

Finally the issue turned out to be Modin related. Ray hanged when I was trying to write a Modin DataFrame to disk while using Ray. It seems there was some kind of a deadlock, but I am still not sure.

For now I resolved it by using Pandas to write to disk, after which the issue was gone:
https://github.com/mindsdb/dfsql/pull/19/files#diff-287da181ac34dcb8710924d3be04f46fac4c8b26c7de303766af97d571d1b969R26

Still, it's wroth investigating why that was happening.

@devin-petersohn
Copy link
Collaborator

Thanks @btseytlin!

@anmyachev Are you aware of hanging behavior on to_csv?

@anmyachev
Copy link
Collaborator

@devin-petersohn I will take a look

@anmyachev
Copy link
Collaborator

Hello @btseytlin!

I tried to reproduce your problem. To do this, I took dfsql current master (dbe5979f2e9b79de196cc17143b66b11e06851ba) and changed the line(PandasDataFrame(dataframe.values, columns=dataframe.columns, index=dataframe.index).to_csv(tmp_fpath, index=False) to old version. test_df_sql_nested_select_in[modin] was checked with ray==1.4.1 and ray==1.5.2.

I saw messages like Windows fatal exception: access violation, but this problem related to Ray (as you wrote).

I could not reproduce hanging behavior of to_csv. Moreover, I looked at the dataframe used in this test. It is small (shape: (9,12)) and fits into one partition, so there should not be a deadlock on the Modin side since the writing to disk occurs in one process.

@btseytlin can you try going back to using Ray and check the current workability?

@anmyachev anmyachev added the Needs more information ❔ Issues that require more information from the reporter label Aug 22, 2021
@btseytlin
Copy link
Contributor Author

@anmyachev what do you mean by "going back to using Ray"?

I tried to investigate the Ray side of things, but I couldn't find the problem, I don't have enough knowledge of Ray, unfortunately.

@rkooo567
Copy link
Collaborator

Is it happening on Windows? It could be a bug from Ray on Windows (I feel like I saw that error before). Also is it possible if you can repro this issue with older ray versions? Like 1.2? If it doesn’t we will figure out a way to fix it quickly

@anmyachev
Copy link
Collaborator

@anmyachev what do you mean by "going back to using Ray"?

I tried to investigate the Ray side of things, but I couldn't find the problem, I don't have enough knowledge of Ray, unfortunately.

I meant return https://github.com/mindsdb/dfsql/pull/19/files#diff-287da181ac34dcb8710924d3be04f46fac4c8b26c7de303766af97d571d1b969R26 line to the previous state (to use ray to write information to disk).

@anmyachev
Copy link
Collaborator

Is it happening on Windows? It could be a bug from Ray on Windows (I feel like I saw that error before). Also is it possible if you can repro this issue with older ray versions? Like 1.2? If it doesn’t we will figure out a way to fix it quickly

I also tried ray==1.2.0. Segmentation fault is not reproducible.

@btseytlin
Copy link
Contributor Author

@anmyachev what do you mean by "going back to using Ray"?
I tried to investigate the Ray side of things, but I couldn't find the problem, I don't have enough knowledge of Ray, unfortunately.

I meant return https://github.com/mindsdb/dfsql/pull/19/files#diff-287da181ac34dcb8710924d3be04f46fac4c8b26c7de303766af97d571d1b969R26 line to the previous state (to use ray to write information to disk).

Unfortunately I am on Linux and on Linux this error doesnt happen. It did happen on Windows and Macos in Github CI. Since it's in CI it's very awkward to reproduce and leaves no way to debug, but I can try.

Switching from release to nightly Ray builds didn't change anything, but I didn't try older Ray versions

@anmyachev
Copy link
Collaborator

@anmyachev what do you mean by "going back to using Ray"?
I tried to investigate the Ray side of things, but I couldn't find the problem, I don't have enough knowledge of Ray, unfortunately.

I meant return https://github.com/mindsdb/dfsql/pull/19/files#diff-287da181ac34dcb8710924d3be04f46fac4c8b26c7de303766af97d571d1b969R26 line to the previous state (to use ray to write information to disk).

Unfortunately I am on Linux and on Linux this error doesnt happen. It did happen on Windows and Macos in Github CI. Since it's in CI it's very awkward to reproduce and leaves no way to debug, but I can try.

Switching from release to nightly Ray builds didn't change anything, but I didn't try older Ray versions

We can try in CI again. I create PR mindsdb/dfsql#31.

@btseytlin
Copy link
Contributor Author

@anmyachev what do you mean by "going back to using Ray"?
I tried to investigate the Ray side of things, but I couldn't find the problem, I don't have enough knowledge of Ray, unfortunately.

I meant return https://github.com/mindsdb/dfsql/pull/19/files#diff-287da181ac34dcb8710924d3be04f46fac4c8b26c7de303766af97d571d1b969R26 line to the previous state (to use ray to write information to disk).

Unfortunately I am on Linux and on Linux this error doesnt happen. It did happen on Windows and Macos in Github CI. Since it's in CI it's very awkward to reproduce and leaves no way to debug, but I can try.
Switching from release to nightly Ray builds didn't change anything, but I didn't try older Ray versions

We can try in CI again. I create PR mindsdb/dfsql#31.

Great, lets try

@anmyachev
Copy link
Collaborator

I was able to reproduce hanging behavior locally (Reproducibility is not 100%).

Environment:

conda env create -f environment-dev.yml
set MODIN_CPUS=4
set MODIN_ENGINE=ray
pytest modin\pandas\test\test_io.py::TestCsv::test_hanging_behavior --verbose -s

Simplified reproducer (that should be added to TestCsv class):

def test_hanging_behavior(self):
    for i in range(16):
        #print("to_csv")
        pd.DataFrame([1, 2, 3, 4]).to_csv("initial-data.csv", index=False)
        #print("read_csv")
        df = pd.read_csv("initial-data.csv")
        #print("isnull, all, axis=1")
        df.index[df.isnull().all(axis=1)].values.tolist()
        #print("isnull, all, axis=0")
        df.columns[df.isnull().all(axis=0)].values.tolist()

Logs:

...\modin>pytest modin\pandas\test\test_io.py::TestCsv::test_hanging_behavior --verbose -s
=============================================== test session starts ===============================================
platform win32 -- Python 3.8.10, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 -- ...\Miniconda3\envs\modin\python.exe
cachedir: .pytest_cache
benchmark: 3.4.1 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: ...\modin, configfile: setup.cfg
plugins: benchmark-3.4.1, cov-2.11.0, forked-1.3.0, xdist-2.3.0
collected 1 item

modin/pandas/test/test_io.py::TestCsv::test_just_test to_csv
read_csv
(pid=20528) Windows fatal exception: access violation
(pid=20528)
isnull, all, axis=1
isnull, all, axis=0
to_csv
read_csv
isnull, all, axis=1
(pid=15432) Windows fatal exception: access violation
(pid=15432)
isnull, all, axis=0
to_csv
read_csv
isnull, all, axis=1
(pid=23128) Windows fatal exception: access violation
(pid=23128)
isnull, all, axis=0
to_csv
read_csv
isnull, all, axis=1
(pid=3412) Windows fatal exception: access violation
(pid=3412)
isnull, all, axis=0
to_csv
read_csv
(pid=20984) Windows fatal exception: access violation
(pid=20984)
isnull, all, axis=1
isnull, all, axis=0
to_csv
read_csv
(pid=7096) Windows fatal exception: access violation
(pid=7096)
isnull, all, axis=1
isnull, all, axis=0
to_csv
read_csv
isnull, all, axis=1
(pid=23464) Windows fatal exception: access violation
(pid=23464)
isnull, all, axis=0
to_csv
read_csv
(pid=12504) Windows fatal exception: access violation
(pid=12504)
isnull, all, axis=1
isnull, all, axis=0
to_csv
read_csv
isnull, all, axis=1
(pid=11500) Windows fatal exception: access violation
(pid=11500) 
isnull, all, axis=0
to_csv
read_csv
(pid=19948) Windows fatal exception: access violation
(pid=19948)
isnull, all, axis=1
isnull, all, axis=0
to_csv
read_csv
isnull, all, axis=1
(pid=20848) Windows fatal exception: access violation
(pid=20848) 
isnull, all, axis=0
to_csv
2021-08-25 20:03:18,393 WARNING worker.py:1189 -- The actor or task with ID c6cf2fddfe5e7c90b398e5da6a4450ee63f746a18d1ec44e cannot be scheduled right now. It requires {CPU: 1.000000} for placement, but this node only has remaining 
{4.000000/4.000000 CPU, 13.969839 GiB/13.969839 GiB memory, 13.969839 GiB/13.969839 GiB object_store_memory, 1.000000/1.000000 node:10.147.230.30}
. In total there are 1 pending tasks and 0 pending actors on this node. This is likely due to all cluster resources 
being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale or if you specified a runtime_env for this task or actor because it takes time to install.

@rkooo567 did this clarify anything?

@rkooo567
Copy link
Collaborator

You said this is only happening in windows right?

@rkooo567
Copy link
Collaborator

If so should be related to the access violation error.

@rkooo567
Copy link
Collaborator

Unfortunately I don’t have a Windows machine to fix it. Is it possible to file an issue in our github?

@anmyachev
Copy link
Collaborator

anmyachev commented Aug 25, 2021

@rkooo567 yes, only on windows.

Sure :) I will create the issue in Ray repo.

@anmyachev anmyachev removed the Needs more information ❔ Issues that require more information from the reporter label Sep 9, 2021
@RehanSD
Copy link
Collaborator

RehanSD commented Oct 12, 2022

@anmyachev @rkooo567 has this issue been resolved?

@RehanSD RehanSD added Needs more information ❔ Issues that require more information from the reporter P3 Very minor bugs, or features we can hopefully add some day. labels Oct 12, 2022
@mvashishtha mvashishtha added Testing 📈 Issues related to testing and removed bug 🦗 Something isn't working labels Oct 12, 2022
@anmyachev
Copy link
Collaborator

@anmyachev @rkooo567 has this issue been resolved?

@RehanSD it should be

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs more information ❔ Issues that require more information from the reporter P3 Very minor bugs, or features we can hopefully add some day. Testing 📈 Issues related to testing
Projects
None yet
Development

No branches or pull requests

7 participants