[dask] [python] lightgbm.dask hangs indefinitely after an error #4771

jameslamb · 2021-11-05T02:20:12Z

Description

Shortly after dask=2021.10.0 and distributed=2021.10.0 were uploaded to the Anaconda default channels (used in LightGBM's CI), LightGBM's CI job started failing with timeouts (#4769).

Logs suggested that one of the unit tests on lightgbm.dask was hanging indefinitely. Specifically, the test after test_errors.

Reproducible example

Run inside a container with conda available.

docker run \
    --rm \
    --workdir /usr/local/testing \
    -it \
    continuumio/miniconda3:4.10.3 \
    /bin/bash

apt-get update
apt-get install \
    --no-install-recommends \
    -y \
        build-essential

Using dask and distributed v2021.10.0, after a run which produces an error out on the cluster, the next test gets stuck indefinitely.

conda create \
    --name dask-test \
    --override-channels \
    -c conda-forge \
    --yes \
        cloudpickle=2.0.0 \
        cmake=3.21.3 \
        dask=2021.10.0 \
        distributed=2021.10.0 \
        joblib=1.1.0 \
        numpy=1.21.2 \
        pandas=1.3.4 \
        pytest=6.2.4 \
        scikit-learn=1.0.1 \
        scipy=1.7.1

source activate dask-test

git clone \
    --recursive \
    https://github.com/microsoft/LightGBM.git

cd LightGBM

pushd python-package/
    python setup.py install
popd

# this will succeed
pytest -v \
    tests/python_package_test/test_dask.py \
    -k 'test_errors'

# this will succeed
pytest -v \
    tests/python_package_test/test_dask.py \
    -k 'test_training_succeeds_even_if_some_workers_do_not_have_any_data'

# running both tests, the "training_succeeds" one hangs indefinitely
pytest -v \
    tests/python_package_test/test_dask.py \
    -k 'test_errors or test_training_succeeds_even_if_some_workers_do_not_have_any_data'

Downgrading to dask / distributed 2021.9.0 fixes the issue.

source deactivate
conda env remove --name test-dask
conda create \
    --name dask-test \
    --override-channels \
    -c conda-forge \
    --yes \
        cloudpickle=2.0.0 \
        cmake=3.21.3 \
        dask=2021.9.1 \
        distributed=2021.9.1 \
        joblib=1.1.0 \
        numpy=1.21.2 \
        pandas=1.3.4 \
        pytest=6.2.4 \
        scikit-learn=1.0.1 \
        scipy=1.7.1

source activate dask-test
pushd python-package/
    python setup.py install
popd

# running both tests, they succeed!
pytest -v \
    tests/python_package_test/test_dask.py \
    -k 'test_errors or test_training_succeeds_even_if_some_workers_do_not_have_any_data'

Environment info

See commands in reproducible example.

Additional Comments

I suspect that lightgbm's tests are being affected by the bug documented here: dask/distributed#5480.

I think this because timeout errors happen immediately after a test that produces an error, and as of #4159, all of LightGBM's Dask unit tests share the same cluster.

The text was updated successfully, but these errors were encountered:

jmoralez · 2021-11-05T02:33:31Z

Thanks for the example @jameslamb, I couldn't reproduce the hang on the workers test, didn't realize it was the consecutive runs that failed. I just ran this locally and looked at the worker logs and see this:

distributed.worker - WARNING - Compute Failed Function: subgraph_callable-f29839b8-aeb0-4760-859c-87277d2d args: ('make-timeseries-7d984faa6c361aaa55a14af47f1a69fe', ([Timestamp('2000-01-31 00:00:00', freq='M'), Timestamp('2000-02-29 00:00:00', freq='M')], 914658368)) kwargs: {} Exception: "Exception('foo')"

I'd say that's strange to say the least.

jameslamb · 2021-11-05T02:35:56Z

Sure! I only had the idea of running the two consecutive tests after noticing @gjoseph92's writeup in dask/distributed#5480.

jmoralez · 2021-11-05T02:44:34Z

Oh I just realized that test_errors raises that exception, I see now it's not strange at all seeing that in the worker logs haha.

jmoralez · 2021-11-05T02:47:39Z

Adding a client.restart at the end of test_errors seems to fix the issue. Should I try in a PR?

jameslamb · 2021-11-05T03:09:58Z

Adding a client.restart at the end of test_errors seems to fix the issue. Should I try in a PR?

Yeah!

Since we have evidence that this is a bug in distributed and since Dask maintainers are talking about trying to fix it in the next distributed release (dask/community#197 (comment)), I think it's ok to client.restart() in LightGBM's tests, so that we continue to test against the newest Dask releases as they come out.

I'd support a PR that adds that .restart() and reverts #4770

gjoseph92 · 2021-11-05T04:01:00Z

@jameslamb do you have any logs from the cluster you could share? Specifically, I'm interested in evidence of dask/distributed#5482, or some other error like it—if you see Reconnecting in logs, then indeed it's probably dask/distributed#5480. I'm not familiar with your tests, but the message @jmoralez posted looks to me like an error within user code in a task, which I don't think should cause this deadlock (if it does, this is a way bigger problem).

But if you're reproducibly showing a deadlock with 2021.10.0 but not 2021.09.0, that's very helpful information.

jameslamb · 2021-11-05T04:27:30Z

do you have any logs from the cluster you could share?

Let me try running the repro from this issue's description with more verbose logging!

To be honest, I just jumped to conclusions when I saw the combination of these factors:

tests started failing as soon as we started getting distributed 2021.10.0
the specific issue was "the first test run AFTER one that produces errors on a worker hangs indefinitely"
downgrading back to 2021.9.1, no issues in tests

lightgbm.dask hasn't been changed in a few months, and only uses battle-tested parts of the Dask API:

LightGBM/python-package/lightgbm/compat.py

Lines 129 to 135 in 2394b41

    
           from dask import delayed 
        
           from dask.array import Array as dask_Array 
        
           from dask.array import from_delayed as dask_array_from_delayed 
        
           from dask.bag import from_delayed as dask_bag_from_delayed 
        
           from dask.dataframe import DataFrame as dask_DataFrame 
        
           from dask.dataframe import Series as dask_Series 
        
           from dask.distributed import Client, default_client, wait

jmoralez · 2021-11-05T14:40:45Z

@gjoseph92 the following reproduces what we're seeing in our tests:

from dask.datasets import timeseries
from dask.distributed import Client, LocalCluster


def raise_exc(part):
    raise Exception


if __name__ == '__main__':
    cluster = LocalCluster(n_workers=2, threads_per_worker=2)
    with Client(cluster) as client:
        ts = timeseries()
        ts = ts.map_partitions(raise_exc, meta={})
        try:
            ts.compute()
        except:
            pass
    with Client(cluster) as client:
        X = timeseries().compute()
        print(X.shape)

gjoseph92 · 2021-11-05T16:51:10Z

@jmoralez thanks for the excellent reproducer! Turns out this was a separate issue dask/distributed#5497. Your code is very helpful!

jameslamb added bug dask labels Nov 5, 2021

This was referenced Nov 5, 2021

Worker reconnection deadlock dask/distributed#5480

Closed

[ci] Test on Azure Pipeline Timeouts on Linux #4769

Closed

jameslamb changed the title ~~[dask] [python] LightGBM hangs indefinitely after an error~~ [dask] [python] lightgbm.dask hangs indefinitely after an error Nov 5, 2021

jmoralez mentioned this issue Nov 5, 2021

[tests][dask] fix hangs after exception in worker #4772

Closed

gjoseph92 mentioned this issue Nov 5, 2021

Release 2021.11.0 dask/community#197

Closed

gjoseph92 mentioned this issue Nov 5, 2021

Deadlock: subsequent computations deadlock after a task has errored dask/distributed#5497

Closed

jameslamb mentioned this issue Apr 14, 2022

[RFC] 4.0.0 Release #5153

Closed

60 tasks

jameslamb mentioned this issue Sep 30, 2022

[ci] prefer CPython in Windows test environment and use safer approach for cleaning up network (fixes #5509) #5510

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dask] [python] lightgbm.dask hangs indefinitely after an error #4771

[dask] [python] lightgbm.dask hangs indefinitely after an error #4771

jameslamb commented Nov 5, 2021 •

edited

Loading

jmoralez commented Nov 5, 2021

jameslamb commented Nov 5, 2021

jmoralez commented Nov 5, 2021

jmoralez commented Nov 5, 2021

jameslamb commented Nov 5, 2021

gjoseph92 commented Nov 5, 2021

jameslamb commented Nov 5, 2021

jmoralez commented Nov 5, 2021

gjoseph92 commented Nov 5, 2021

[dask] [python] lightgbm.dask hangs indefinitely after an error #4771

[dask] [python] lightgbm.dask hangs indefinitely after an error #4771

Comments

jameslamb commented Nov 5, 2021 • edited Loading

Description

Reproducible example

Environment info

Additional Comments

jmoralez commented Nov 5, 2021

jameslamb commented Nov 5, 2021

jmoralez commented Nov 5, 2021

jmoralez commented Nov 5, 2021

jameslamb commented Nov 5, 2021

gjoseph92 commented Nov 5, 2021

jameslamb commented Nov 5, 2021

jmoralez commented Nov 5, 2021

gjoseph92 commented Nov 5, 2021

jameslamb commented Nov 5, 2021 •

edited

Loading