-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pandas has no attribute 'compat'
Deserialization bug when running tasks very rarely
#7879
Comments
Interesting, but you can see it show up in CI? are those >40 workers on a single machine or on a cluster? |
It is showing up in Modin's CI, the workers are on a single machine. We only have observed this in a large single machine context. |
I noticed that defining |
To be clear, @ray-project/ray, the most consistent way we can reproduce this is in pytest with |
I don't have any ideas on what this could be, but have you tried pinning all the processes to one core? E.g., with |
The bug looks like a race condition because it isn't triggered every time. Even with pytest-xdist it is possible to get lucky and get no test failures. I think the problem happens when pytest starts up workers, they are initialized and connect to Ray pool, one or more workers may get unlucky and after that tests that run on it fail. |
@suquark any ideas here? |
Probably not related, but I came across this. tensorflow/tensorflow#26266 (comment) Can you see if that comment makes the error go away? |
@robertnishihara Downgrading pandas is not an option, it should not be a pandas version issue unless Ray is linking the wrong library for some workers and not others. |
@devin-petersohn totally agree regarding downgrading pandas. This was more of a suggestion for trying to track down the issue and not a solution for the bug. |
Bumping this, it's a serious blocker for us. Downgrading pandas did not solve the issue, it broke other things. |
I think the possible cause was the race condition between import thread and worker thread in the worker. The worker tries to deserialize pandas object, and somehow it is executed in the middle of import thread trying to import and deserialize the function as well. So there's a race condition between two thread deserializing and importing at the same time. A temporary workaround will be to run import pandas across all workers but this won't be the long term solution. This might still lead to this race condition though. @robertnishihara is there any harm to make import thread optional and turn it off by a flag? |
@simon-mo what do you mean by making it optional? Would the alternative be to synchronously fetch the definitions as we execute tasks? |
@simon-mo great point! That (or something like that) could be the issue. Unfortunately turning off the import thread would require a ton of changes (because that's how we ship remote function definitions around. Though I suppose there is the "load_code_from_local" code path which does something like what you're suggesting. @edoakes you had a branch that prototyped this and would potentially solve the problem that @simon-mo is mentioning, right? Was that in a working state and something that @devin-petersohn could try out? |
@robertnishihara no, it's not in a fully working state |
I believe I get a similar error when running PBT hyperparameter optimization:
Happens after a long time, slowly but steady all the workers die. |
@janblumenkamp can you give more information about the code you're running / size of the cluster / machine etc. We want to be able to track this one down! If you have something reproducible that'd obviously be ideal. |
Sure! I use a custom environment based on this example, but I am not importing Pandas in my application. I am not sure if I can reproduce this error with the given example since I don't own a mujoco licence and also I can't spend computation time on that right now :/
I do not want to publish my whole training script right now, but if you think that it will be helpful in debugging this, I'd be happy to send it to you. It takes a few hours until the errors occurred though. |
Here also logs of three failed trials: |
Oh this is nasty... one quick workaround is to make the Pandas import a "soft-import" (or get rid of this import in the first place) In your fork, maybe move this code out:
into a common |
Hi, I'm a bot from the Ray team :) To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months. If there is no further activity in the 14 days, the issue will be closed!
You can always ask for help on our discussion forum or Ray's public slack channel. |
@devin-petersohn Is it resolved? |
We have a workaround that @simon-mo implemented. I think the issue can be marked resolved from my perspective. There was also the issue @janblumenkamp ran into, which seems related. |
I don't know if the error persists, I haven't used PBT in a long time. I suggest to close the issue for now, if it occurs again we can always re-open :) |
SGTM @janblumenkamp |
I'm really late to the party, but things like "doing import in parallel thread break" sound like incorrect locking to me. |
What is the problem?
pandas has no attribute 'compat
gets thrown on deserialization, see traceback below.Ray version and other system information (Python version, TensorFlow version, OS): 0.8.3 and later including latest wheels.
I tried to include this PR #7406, which was reverted in #7437. This did not fix the issue.
Reproduction (REQUIRED)
This will require external dependencies because it's a bug with serialization of the pandas library. It is not reproducible in every environment, though we have a way to reproduce it regularly in Modin's CI.
Some observations about the bug:
pytest-xdist
with >40 workers or so. Below that it doesn't really fail. We start Ray with 1 CPU in these conditions. - again points to concurrency issue for deserializationImportError: cannot import name 'DataFrame' from 'pandas.core.frame' (/path/to/python3.7/site-packages/pandas/core/frame.py)
int is not callable
.cc @gshimansky
The text was updated successfully, but these errors were encountered: