Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci] Test on Azure Pipeline Timeouts on Linux #4769

Closed
shiyu1994 opened this issue Nov 4, 2021 · 10 comments · Fixed by #4770
Closed

[ci] Test on Azure Pipeline Timeouts on Linux #4769

shiyu1994 opened this issue Nov 4, 2021 · 10 comments · Fixed by #4770

Comments

@shiyu1994
Copy link
Collaborator

Description

Today we found that, ci test on azure pipeline timeouts on Linux machines. See e.g.
https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=11448&view=logs&j=c28dceab-947a-5848-c21f-eef3695e5f11&t=fa158246-17e2-53d4-5936-86070edbaacf

@StrikerRUS
Copy link
Collaborator

The same is true and for CUDA (Linux as well) builds on GitHub Actions self-hosted machine. For example, https://github.com/microsoft/LightGBM/runs/4101143551?check_suite_focus=true.

@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Nov 4, 2021

According to the logs with enabled timestamps, the reason is in one of the Dask tests. Need to figure out in which one...

2021-11-04T02:53:15.6285993Z ../tests/python_package_test/test_dask.py .............................. [ 14%]
2021-11-04T02:54:14.9709387Z ........................................................................ [ 26%]
2021-11-04T02:55:34.2062214Z ........................................................................ [ 37%]
2021-11-04T02:56:58.5557598Z ......s...............s...............s...............s................. [ 48%]
2021-11-04T03:48:21.2453087Z ...........................
2021-11-04T03:48:21.2583275Z ##[error]The operation was canceled.

or

2021-11-04T06:31:06.8378487Z ../tests/python_package_test/test_dask.py .............................. [ 14%]
2021-11-04T06:31:53.2817276Z ........................................................................ [ 26%]
2021-11-04T06:32:55.3978885Z ........................................................................ [ 37%]
2021-11-04T06:34:18.2781954Z ......s...............s...............s...............s................. [ 48%]
2021-11-04T07:26:37.3418892Z ##[error]The operation was canceled.

@StrikerRUS
Copy link
Collaborator

Made pytest more verbose and output one test result per line so that we will be able to see timings separately for each test:
https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=11455&view=logs&j=c28dceab-947a-5848-c21f-eef3695e5f11&t=fa158246-17e2-53d4-5936-86070edbaacf

@StrikerRUS
Copy link
Collaborator

OK, seems that the test causing CI jobs to timeout is the one after test_dask.py::test_errors. According to the

pytest ./tests -vv --collect-only

command (I used it because lightgbm-dask is not working on my local Windows machine),
it should be

<Function test_training_succeeds_even_if_some_workers_do_not_have_any_data[array-binary-classification]>

cc @jameslamb @jmoralez

@jameslamb
Copy link
Collaborator

I'm really confused.

Looking at the logs from the job @shiyu1994 linked, I see dask and distributed 2021.10.0 getting installed.

    dask-2021.10.0             |     pyhd3eb1b0_0          19 KB
    dask-core-2021.10.0        |     pyhd3eb1b0_0         718 KB
    dbus-1.13.18               |       hb2f20db_0         504 KB
    distributed-2021.10.0      |   py39h06a4308_0         994 KB

That job was 11 hours ago...but according to Anaconda's website, dask-core and distributed 2021.10.0 were only uploaded around 40 minutes ago.

  • v2021.10.0 - 40 minutes ago
  • v2021.9.1 - 14 days ago

https://anaconda.org/anaconda/dask-core/files

image

https://anaconda.org/anaconda/distributed/files

image

@StrikerRUS
Copy link
Collaborator

@jameslamb

but according to Anaconda's website, dask-core and distributed 2021.10.0 were only uploaded around 40 minutes ago.

Maybe resubmission with critical bug fixes?

@jameslamb
Copy link
Collaborator

I would be really unhappy to learn that Anaconda's approach to pushing out bug fixes was to overwrite a previously-published package instead of releasing it with a new version. But I guess that is possible.

The discussion in dask/community#160 and ContinuumIO/anaconda-issues#12447 did hint at Anaconda maybe treating dask / distributed specially.

@jameslamb
Copy link
Collaborator

For what it's worth, none of the repos at https://github.com/anaconda or https://github.com/ContinuumIO have received a commit in the last 18 hours, so I guess if Anaconda is making changes they aren't on GitHub, or they're in some other place I don't know about.

@jameslamb
Copy link
Collaborator

I was able to reproduce this locally. Documented the issue at #4771.

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot removed the blocking label Aug 23, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants