[Dask] Expected error randomly not raised in Dask test #4099

StrikerRUS · 2021-03-23T12:23:33Z

2021-03-23T11:32:00.3655481Z         error_msg = "has multiple Dask worker processes running on it"
2021-03-23T11:32:00.3656048Z         with pytest.raises(lgb.basic.LightGBMError, match=error_msg):
2021-03-23T11:32:00.3656573Z >           dask_model3.fit(dX, dy, group=dg)
2021-03-23T11:32:00.3657297Z E           Failed: DID NOT RAISE <class 'lightgbm.basic.LightGBMError'>

LightGBM/tests/python_package_test/test_dask.py

Lines 1058 to 1060 in 77d54b3

    
           error_msg = "has multiple Dask worker processes running on it" 
        
           with pytest.raises(lgb.basic.LightGBMError, match=error_msg): 
        
               dask_model3.fit(dX, dy, group=dg)

Refer to #4068 (comment) and #4068 (comment) for full logs.

The text was updated successfully, but these errors were encountered:

jameslamb · 2021-03-23T15:17:44Z

hmmm interesting. In the logs mentioned in those comments, it looks like this is a different root cause from what was fixed in #4071. I think that' what's happening here is that the data is still all ending up on one worker somehow. This is possibly the same underlying problem as #4074, actually.

Error code 104 means "connection reset by peer" (link), which could occur in distributed training if one of the Dask workers dies and is restarted.

Similarly here, if one of the workers died before training started, then it's possible that Dask would have moved the training data back to the other worker, and that then dask_model3.fit() was fitting on only a single worker.

It's possible that one of the workers died because the two previous .fit() calls before this one left behind objects that caused it to run out of memory or caused other errors on the workers.

There's no reason that this test has to be in the same test case as the other network params tests. I just did that to try to minimize the total runtime of tests (number of times we call _create_data()). I think that moving this to its own standalone test case might reduce the risk of this specific failure.

StrikerRUS · 2021-03-23T16:39:09Z

It's possible that one of the workers died because the two previous .fit() calls before this one left behind objects that caused it to run out of memory or caused other errors on the workers.

client = <Client: 'tcp://127.0.0.1:36289' processes=2 threads=2, memory=16.70 GB>

I believe 16Gb should be enough for toy datasets we use for tests...

I think that moving this to its own standalone test case might reduce the risk of this specific failure.

Yeah, sure. But it doesn't fix underlying issue unfortunately.

I remember I asked this question before but didn't get a clear answer. Has Dask something like "global option for reproducibility"? Similarly to deterministic param in LightGBM. We use Dockers a lot to make out testing environments deterministic and don't change hardware settings for agents, so I'm quite curious why with all seeds fixed and same machine configs we get very different results and frequent failures with Dask.

jameslamb · 2021-03-23T18:01:12Z

so I'm quite curious why with all seeds fixed and same machine configs we get very different results and frequent failures with Dask

A couple points on this:

"with Dask" is a bit unfair. The Dask tests are the only automated tests that this project has had on distributed LightGBM training and I don't believe that all of the flakiness is just due to code in lightgbm.dask or its tests. Without dedicated tests on distributed training without Dask (Write tests for parallel code #3841), it's hard to know which problems are "Dask problems" and which are "LightGBM problems" (see [dask] Early stopping #3952 (review) for an example of what I think might be a "LightGBM problem").
Each test case for the Dask tests involves 4 processes: the main pytest process, a dask-scheduler process, and two dask-worker processes. These processes all compete with each other and with other processes running in the CI environment (like any Azure processes that are running in the background to communicate logs and statuses) for CPU time and memory, and so it's possible for them to randomly and occasionally fail for reasons like "this task took a little too long and triggered a timeout".

Has Dask something like "global option for reproducibility"? Similarly to deterministic param in LightGBM

This would be incredibly difficult for Dask or any distributed system to achieve. If you want to write code of the form "move this exact data to this exact worker and then run this exact task on this exact worker..." you can do it with Dask's low-level APIs, but at that point you're not really getting much benefit from Dask because you are doing all the work that its higher-level APIs are intended to abstract away.

Once you get into coordinating processes and not just threads within one process, it becomes much more difficult to predict the exact behavior of the system. LightGBM is able to offer a deterministic=True parameter because it completely controls the running code within a single process.

StrikerRUS · 2021-03-24T12:53:01Z

This would be incredibly difficult for Dask or any distributed system to achieve.

Absolutely agree with this for the case of "real world". But I thought with only two test workers and deterministic data partitioning algorithm (looks it is wrong for Dask) given the same dataset we don't have a lot of variants.

jameslamb · 2021-05-11T22:34:22Z

I haven't seen this one at all in the last month. I hope that #4132 was the fix for it.

I think this can be closed.

github-actions · 2023-08-23T14:38:17Z

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

StrikerRUS added bug dask labels Mar 23, 2021

StrikerRUS changed the title ~~Expected error randomly not raised in Dask test~~ [Dask] Expected error randomly not raised in Dask test Mar 23, 2021

StrikerRUS mentioned this issue Mar 23, 2021

[python-package] add type hints on Booster.set_network() #4068

Merged

This was referenced Mar 29, 2021

[dask] run one training task on each worker #4132

Merged

[dask] Random failures in Dask tests during teardown #3829

Closed

jameslamb closed this as completed May 11, 2021

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Dask] Expected error randomly not raised in Dask test #4099

[Dask] Expected error randomly not raised in Dask test #4099

StrikerRUS commented Mar 23, 2021

jameslamb commented Mar 23, 2021

StrikerRUS commented Mar 23, 2021

jameslamb commented Mar 23, 2021

StrikerRUS commented Mar 24, 2021

jameslamb commented May 11, 2021

github-actions bot commented Aug 23, 2023

[Dask] Expected error randomly not raised in Dask test #4099

[Dask] Expected error randomly not raised in Dask test #4099

Comments

StrikerRUS commented Mar 23, 2021

jameslamb commented Mar 23, 2021

StrikerRUS commented Mar 23, 2021

jameslamb commented Mar 23, 2021

StrikerRUS commented Mar 24, 2021

jameslamb commented May 11, 2021

github-actions bot commented Aug 23, 2023