Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas has no attribute 'compat' Deserialization bug when running tasks very rarely #7879

Open
2 tasks done
devin-petersohn opened this issue Apr 2, 2020 · 28 comments
Open
2 tasks done
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P2 Important issue, but not time-critical
Milestone

Comments

@devin-petersohn
Copy link
Member

What is the problem?

pandas has no attribute 'compat gets thrown on deserialization, see traceback below.
Ray version and other system information (Python version, TensorFlow version, OS): 0.8.3 and later including latest wheels.

I tried to include this PR #7406, which was reverted in #7437. This did not fix the issue.

Reproduction (REQUIRED)

This will require external dependencies because it's a bug with serialization of the pandas library. It is not reproducible in every environment, though we have a way to reproduce it regularly in Modin's CI.

2020-04-02 14:59:40,210 INFO resource_spec.py:212 -- Starting Ray with 200.0 GiB memory available for workers and up t 200.0 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-04-02 14:59:40,554 INFO services.py:1123 -- View the Ray dashboard at localhost:8265
2020-04-02 14:59:40,557 WARNING services.py:1455 -- WARNING: object_store_memory is not verified when plasma_directoryis set.
Running on Modin on Ray with tmp directory /path and memory 214748364800
Traceback (most recent call last):
  File "/modin/modin/engines/ray/pandas_on_ray/frame/partition.py", line 46, in get
    return ray.get(self.oid)
  File "/path/to/python3.7/site-packages/ray/worker.py", line 1502, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AttributeError): ray::IDLE (pid=81352, ip=XXX.XXX.XXX.XXX)
  File "python/ray/_raylet.pyx", line 433, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 434, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 312, in ray._raylet.deserialize_args
ray.exceptions.RayTaskError: ray::IDLE (pid=81333, ip=XXX.XXX.XXX.XXX)
  File "python/ray/_raylet.pyx", line 433, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 434, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 306, in ray._raylet.deserialize_args
  File "/path/to/python3.7/site-packages/ray/serialization.py", line 323, in deserialize_objcts
    self._deserialize_object(data, metadata, object_id))
  File "/path/to/python3.7/site-packages/ray/serialization.py", line 271, in _deserialize_obect
    return self._deserialize_pickle5_data(data)
  File "/path/to/python3.7/site-packages/ray/serialization.py", line 260, in _deserialize_pikle5_data
    obj = pickle.loads(in_band, buffers=buffers)
  File "/path/to/python3.7/site-packages/pandas/__init__.py", line 193, in <module>
    if pandas.compat.PY37:
AttributeError: module 'pandas' has no attribute 'compat'

Some observations about the bug:

  • We can only see it happen on >100 core machines, and it happens rarely - may be some concurrency issue that only happens sometimes.
  • Happens when we run Ray tests on CI in parallel using pytest-xdist with >40 workers or so. Below that it doesn't really fail. We start Ray with 1 CPU in these conditions. - again points to concurrency issue for deserialization
  • The error message is not always the same, but it is always an import error, for example: ImportError: cannot import name 'DataFrame' from 'pandas.core.frame' (/path/to/python3.7/site-packages/pandas/core/frame.py)
  • We can even see some error such as int is not callable.

cc @gshimansky

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.
@devin-petersohn devin-petersohn added the bug Something that is supposed to be working; but isn't label Apr 2, 2020
@bllchmbrs
Copy link
Contributor

Interesting, but you can see it show up in CI? are those >40 workers on a single machine or on a cluster?

@devin-petersohn
Copy link
Member Author

It is showing up in Modin's CI, the workers are on a single machine. We only have observed this in a large single machine context.

@gshimansky
Copy link

I noticed that defining MODIN_CPUS=1 greatly increases chances to trigger this bug compared to default mode. When running pytest tests even with -n=4 (or greater) failures appear very often:
MODIN_ENGINE=ray MODIN_CPUS=1 python3 -m pytest -n=4 modin/pandas/test/test_series.py modin/pandas/test/test_dataframe.py modin/pandas/test/test_concat.py modin/pandas/test/test_groupby.py modin/pandas/test/test_reshape.py modin/pandas/test/test_general.py

@devin-petersohn
Copy link
Member Author

To be clear, @ray-project/ray, the most consistent way we can reproduce this is in pytest with pytest-xdist, but it also happens rarely in normal environments. Being able to consistently reproduce this is important to fixing the bug in large machine environments.

@stephanie-wang
Copy link
Contributor

I don't have any ideas on what this could be, but have you tried pinning all the processes to one core? E.g., with taskset on linux: https://linux.die.net/man/1/taskset

@gshimansky
Copy link

The bug looks like a race condition because it isn't triggered every time. Even with pytest-xdist it is possible to get lucky and get no test failures. I think the problem happens when pytest starts up workers, they are initialized and connect to Ray pool, one or more workers may get unlucky and after that tests that run on it fail.
When I tried running with taskset 0x00000001 to schedule all processes to one CPU core I didn't get any failures in 3 subsequent runs.

@edoakes
Copy link
Contributor

edoakes commented Apr 3, 2020

@suquark any ideas here?

@robertnishihara
Copy link
Collaborator

Probably not related, but I came across this. tensorflow/tensorflow#26266 (comment) Can you see if that comment makes the error go away?

@devin-petersohn
Copy link
Member Author

@robertnishihara Downgrading pandas is not an option, it should not be a pandas version issue unless Ray is linking the wrong library for some workers and not others.

@robertnishihara
Copy link
Collaborator

robertnishihara commented Apr 5, 2020

@devin-petersohn totally agree regarding downgrading pandas. This was more of a suggestion for trying to track down the issue and not a solution for the bug.

@devin-petersohn
Copy link
Member Author

Bumping this, it's a serious blocker for us. Downgrading pandas did not solve the issue, it broke other things.

@simon-mo
Copy link
Contributor

I think the possible cause was the race condition between import thread and worker thread in the worker. The worker tries to deserialize pandas object, and somehow it is executed in the middle of import thread trying to import and deserialize the function as well. So there's a race condition between two thread deserializing and importing at the same time.

A temporary workaround will be to run import pandas across all workers but this won't be the long term solution. This might still lead to this race condition though.

@robertnishihara is there any harm to make import thread optional and turn it off by a flag?

@edoakes
Copy link
Contributor

edoakes commented Apr 11, 2020

@simon-mo what do you mean by making it optional? Would the alternative be to synchronously fetch the definitions as we execute tasks?

@robertnishihara
Copy link
Collaborator

@simon-mo great point! That (or something like that) could be the issue. Unfortunately turning off the import thread would require a ton of changes (because that's how we ship remote function definitions around. Though I suppose there is the "load_code_from_local" code path which does something like what you're suggesting.

@edoakes you had a branch that prototyped this and would potentially solve the problem that @simon-mo is mentioning, right? Was that in a working state and something that @devin-petersohn could try out?

@edoakes
Copy link
Contributor

edoakes commented Apr 12, 2020

@robertnishihara no, it's not in a fully working state

@janblumenkamp
Copy link
Contributor

janblumenkamp commented May 7, 2020

I believe I get a similar error when running PBT hyperparameter optimization:

Failure # 1 (occurred at 2020-05-07_15-33-12)          
Traceback (most recent call last):                                              
  File "[...]/ray/python/ray/tune/trial_runner.py", line 467, in _process_trial
    result = self.trial_executor.fetch_result(trial)                        
  File "[...]/ray/python/ray/tune/ray_trial_executor.py", line 431, in fetch_result
    result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT)                               
  File "[...]/ray/python/ray/worker.py", line 1515, in get           
    raise value.as_instanceof_cause()                                                               
ray.exceptions.RayTaskError(AttributeError): ray::MultiPPO.train() (pid=51742, ip=128.232.69.20)
  File "python/ray/_raylet.pyx", line 463, in ray._raylet.execute_task                                       
  File "python/ray/_raylet.pyx", line 417, in ray._raylet.execute_task.function_executor
  File "[...]/ray/python/ray/rllib/agents/trainer.py", line 495, in train                       
    raise e                                   
  File "[...]/ray/python/ray/rllib/agents/trainer.py", line 484, in train                                  
    result = Trainable.train(self)       
  File "[...]/ray/python/ray/tune/trainable.py", line 261, in train                                     
    result = self._train()                  
  File "[...]/ray/python/ray/rllib/agents/trainer_template.py", line 151, in _train                 
    fetches = self.optimizer.step()            
  File "[...]/ray/python/ray/rllib/optimizers/sync_samples_optimizer.py", line 59, in step
    for e in self.workers.remote_workers()
  File "[...]/ray/python/ray/rllib/utils/memory.py", line 32, in ray_get_and_free
    return ray.get(object_ids)
ray.exceptions.RayTaskError(AttributeError): ray::TemporaryActor.sample() (pid=51740, ip=128.232.69.20)
  File "python/ray/_raylet.pyx", line 424, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 424, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 424, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 424, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 427, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 444, in ray._raylet.execute_task
  File "[...]/ray/python/ray/serialization.py", line 317, in deserialize_objects
    self._deserialize_object(data, metadata, object_id))
  File "[...]/ray/python/ray/serialization.py", line 257, in _deserialize_object
    return self._deserialize_msgpack_data(data, metadata)
  File "[...]/ray/python/ray/serialization.py", line 238, in _deserialize_msgpack_data
    python_objects = self._deserialize_pickle5_data(pickle5_data)
  File "[...]/ray/python/ray/serialization.py", line 226, in _deserialize_pickle5_data
    obj = pickle.loads(in_band)
  File "[...]/gnn-rl-coverage/world_super_2_sparse.py", line 11, in <module>
    from ray.rllib.env.multi_agent_env import MultiAgentEnv
  File "[...]/ray/python/ray/rllib/__init__.py", line 5, in <module>
    from ray.rllib.env.base_env import BaseEnv
  File "[...]/ray/python/ray/rllib/env/__init__.py", line 8, in <module>
    from ray.rllib.env.policy_client import PolicyClient
  File "[...]/ray/python/ray/rllib/env/policy_client.py", line 12, in <module>
    from ray.rllib.evaluation.rollout_worker import RolloutWorker
  File "[...]/ray/python/ray/rllib/evaluation/__init__.py", line 1, in <module>
    from ray.rllib.evaluation.episode import MultiAgentEpisode
  File "[...]/ray/python/ray/rllib/evaluation/episode.py", line 8, in <module>
    from ray.rllib.utils.space_utils import flatten_to_single_ndarray
  File "[...]/ray/python/ray/rllib/utils/__init__.py", line 17, in <module>
    from ray.tune.utils import merge_dicts, deep_update
  File "[...]/ray/python/ray/tune/__init__.py", line 2, in <module>
    from ray.tune.tune import run_experiments, run
  File "[...]/ray/python/ray/tune/tune.py", line 5, in <module>
    from ray.tune.analysis import ExperimentAnalysis
  File "[...]/ray/python/ray/tune/analysis/__init__.py", line 1, in <module>
    from ray.tune.analysis.experiment_analysis import ExperimentAnalysis, Analysis
  File "[...]/ray/python/ray/tune/analysis/experiment_analysis.py", line 6, in <module>
    import pandas as pd
  File "[...]/venv_ray_dev/lib/python3.6/site-packages/pandas/__init__.py", line 54, in <module>
    from pandas.core.api import (
  File "[...]/venv_ray_dev/lib/python3.6/site-packages/pandas/core/api.py", line 15, in <module>
    from pandas.core.arrays import Categorical
  File "[...]/venv_ray_dev/lib/python3.6/site-packages/pandas/core/arrays/__init__.py", line 1, in <module>
    from pandas.core.arrays.base import (
  File "[...]/venv_ray_dev/lib/python3.6/site-packages/pandas/core/arrays/base.py", line 29, in <module>
    from pandas.core.sorting import nargsort
  File "[...]/venv_ray_dev/lib/python3.6/site-packages/pandas/core/sorting.py", line 15, in <module>
    import pandas.core.algorithms as algorithms
AttributeError: module 'pandas' has no attribute 'core'

Happens after a long time, slowly but steady all the workers die.

@bllchmbrs
Copy link
Contributor

bllchmbrs commented May 7, 2020

@janblumenkamp can you give more information about the code you're running / size of the cluster / machine etc. We want to be able to track this one down!

If you have something reproducible that'd obviously be ideal.

@janblumenkamp
Copy link
Contributor

Sure! I use a custom environment based on this example, but I am not importing Pandas in my application. I am not sure if I can reproduce this error with the given example since I don't own a mujoco licence and also I can't spend computation time on that right now :/
If it helps, this is my base script:

import numpy as np
import gym
from gym import spaces
from world import World
from world_super import World as SuperWorld
from world_super_2 import World as SuperWorld2
from world_super_2_sparse import World as SuperWorld2Sparse
import matplotlib.pyplot as plt
from functools import partial
import ray
from ray import tune
from ray.rllib.utils import try_import_torch
from ray.rllib.models import ModelCatalog
from ray.tune.registry import register_env
from ray.tune import run, sample_from
from ray.tune.schedulers import PopulationBasedTraining
import random
from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
from ray.rllib.models.torch.misc import normc_initializer, valid_padding, SlimConv2d, SlimFC
from ray.rllib.utils.annotations import override
from ray.rllib.utils import try_import_torch
from torchsummary import summary
from ray.rllib.evaluation.metrics import collect_episodes, summarize_episodes

from multiagent_ppo_trainer import MultiPPOTrainer

from model_single_torch_graph import AdaptedVisionNetwork as AdaptedVisionNetworkTorchSingleGraph

def run_PPO():
    def explore(config):
        if config["num_sgd_iter"] < 1:
            config["num_sgd_iter"] = 1
        if config["gamma"] < 0.95:
            config["gamma"] = 0.95
        if config["gamma"] > 0.9999:
            config["gamma"] = 0.9999
        return config

    pbt = PopulationBasedTraining(
        time_attr="time_total_s",
        metric="episode_reward_mean",
        mode="max",
        perturbation_interval=120,
        resample_probability=0.25,
        # Specifies the mutations of these hyperparams
        hyperparam_mutations={
            "lr": lambda: random.uniform(1e-6, 2e-4),
            "clip_param": lambda: random.uniform(0.1, 0.4),
            "sgd_minibatch_size": lambda: random.randint(512, 2048),
            "rollout_fragment_length": lambda: random.randint(50, 150),
            "train_batch_size": lambda: random.randint(5000, 10000),
            "num_sgd_iter": lambda: random.randint(1, 14),
            "vf_loss_coeff": lambda: random.uniform(1e-6, 1.0),
            "gamma": lambda: random.uniform(0.95, 0.999)
        },
        custom_explore_fn=explore)

    env_config_train = {
        'world_shape': (16, 16),
        'state_size': 32,
        'collapse_state': False,
        'termination_no_new_coverage': 30,
        'max_episode_len': -1,
        'n_agents': 8,
        'n_obstacles': 30,
        'obstacle_max_size': 3,
        'communication_range': 6.0,
        'ensure_connectivity': False,
        'agents': {
            'coverage_radius': 1,
            'visibility_distance': 5,
            'map_update_radius': 100
        }
    }
    env_config_eval = env_config_train.copy()
    env_config_eval.update({
        'termination_no_new_coverage': -1,
        'max_episode_len': int((2*env_config_eval['world_shape'][0]*env_config_eval['world_shape'][1])/env_config_eval['n_agents']),
    })
    
    tune.run(
        MultiPPOTrainer,
        name="pbt_coverage_test_sparse_k3",
        scheduler=pbt,
        num_samples=30,
        checkpoint_freq=5,
        keep_checkpoints_num=1,
        config={
            "use_pytorch": True,
            "env": "superworld2_sparse",
            "lambda": 0.95,
            "kl_coeff": 0.5,
            "clip_rewards": True,
            "clip_param": 0.2,
            "vf_clip_param": 10.0,
            "vf_share_layers": True,
            "vf_loss_coeff": 1e-4,
            "entropy_coeff": 0.01,
            "train_batch_size": 7500,
            "rollout_fragment_length": 100,
            "sgd_minibatch_size": 1000,
            "num_sgd_iter": 8,
            "num_workers": 12,
            "num_envs_per_worker": 8,
            "lr": 1e-4,
            "gamma": 0.99,
            "batch_mode": "complete_episodes",
            "observation_filter": "NoFilter",
            "num_gpus": 1,
            "model": {
                "custom_model": "vis_torch_single_graph",
                "custom_options": {
                    "graph_layers": 1,
                    "graph_features": 256,
                    "graph_tabs": 3,
                    "graph_edge_features": 1,
                    "cnn_filters": [[16, [4, 4], 4], [32, [4, 4], 2], [64, [3, 3], 1]],
                    "cnn_compression": 256,
                }
            },
            "env_config": env_config_train,
        })
        
if __name__ == '__main__':
    ray.init(num_cpus=110, num_gpus=6)

    register_env("superworld2_sparse", lambda config: SuperWorld2Sparse(config))
    
    ModelCatalog.register_custom_model("vis_torch_single_graph", AdaptedVisionNetworkTorchSingleGraph)

    run_PPO()

I do not want to publish my whole training script right now, but if you think that it will be helpful in debugging this, I'd be happy to send it to you. It takes a few hours until the errors occurred though.
Other than that I compiled ray myself from commit 2a2c157 on branch releases/0.8.5 (that's why the path for Pandas is different than for ray in the above stack trace) with Python 3.6.9.
I am running everything on a VM with Ubuntu 18.04 with an Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz with 32 cores and a Tesla P100.
Let me know if you need any more information!

@janblumenkamp
Copy link
Contributor

janblumenkamp commented May 7, 2020

Here also logs of three failed trials:
failed.zip
This is not the first time I had this issue with PBT, so if you had troubles reproducing, perhaps it's worth looking into that specifically?

@richardliaw
Copy link
Contributor

richardliaw commented May 7, 2020

Oh this is nasty... one quick workaround is to make the Pandas import a "soft-import" (or get rid of this import in the first place)

In your fork, maybe move this code out:

from ray.tune.utils import merge_dicts, deep_update

into a common ray.util file?

@stale
Copy link

stale bot commented Nov 12, 2020

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

  • If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
  • If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Nov 12, 2020
@rkooo567
Copy link
Contributor

@devin-petersohn Is it resolved?

@stale stale bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Nov 13, 2020
@devin-petersohn
Copy link
Member Author

We have a workaround that @simon-mo implemented.

I think the issue can be marked resolved from my perspective. There was also the issue @janblumenkamp ran into, which seems related.

@janblumenkamp
Copy link
Contributor

I don't know if the error persists, I haven't used PBT in a long time. I suggest to close the issue for now, if it occurs again we can always re-open :)

@devin-petersohn
Copy link
Member Author

SGTM @janblumenkamp

@vnlitvinov
Copy link
Contributor

I'm really late to the party, but things like "doing import in parallel thread break" sound like incorrect locking to me.
Python has import lock built in for that purpose, is Ray somehow bypassing it somewhere?

@rkooo567
Copy link
Contributor

rkooo567 commented May 7, 2021

modin-project/modin#2999

@rkooo567 rkooo567 reopened this May 7, 2021
@rkooo567 rkooo567 added this to the Core Bugs milestone May 7, 2021
@rkooo567 rkooo567 added the P1 Issue that should be fixed within a few weeks label May 7, 2021
@rkooo567
Copy link
Contributor

rkooo567 commented May 7, 2021

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P2 Important issue, but not time-critical
Projects
None yet
Development

No branches or pull requests