Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cpu (ubuntu-18.04, 3.9, oldest, stable) fails with segmentation fault #11184

Closed
akihironitta opened this issue Dec 20, 2021 · 10 comments · Fixed by #11217
Closed

cpu (ubuntu-18.04, 3.9, oldest, stable) fails with segmentation fault #11184

akihironitta opened this issue Dec 20, 2021 · 10 comments · Fixed by #11217
Assignees
Labels
bug Something isn't working ci Continuous Integration

Comments

@akihironitta
Copy link
Contributor

akihironitta commented Dec 20, 2021

🐛 Bug

Observed that cpu (ubuntu-18.04, 3.9, oldest, stable) quite often fails with segmentation fault after the actual tests had passed as the following:

=== 2218 passed, 532 skipped, 7 xfailed, 1470 warnings in 634.06s (0:10:34) ====
/home/runner/work/_temp/50029d2b-e6ee-470c-bd34-b5acf8834c42.sh: line 2:  3747 Segmentation fault      (core dumped) coverage run --source pytorch_lightning -m pytest pytorch_lightning tests -v --durations=50 --junitxml=junit/test-results-Linux-py3.9-oldest-stable.xml

To Reproduce

Expected behavior

No segmentation fault 🍏

Environment

  • PyTorch Lightning Version (e.g., 1.5.0):
  • PyTorch Version (e.g., 1.10):
  • Python version (e.g., 3.9):
  • OS (e.g., Linux):
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • How you installed PyTorch (conda, pip, source):
  • If compiling from source, the output of torch.__config__.show():
  • Any other relevant information:

Additional context

cc @carmocca @akihironitta @Borda

@akihironitta akihironitta added bug Something isn't working help wanted Open to be worked on ci Continuous Integration labels Dec 20, 2021
@carmocca
Copy link
Contributor

Hi @akihironitta, thanks for writing this.

This has been failing for a while now. I don't know what would be the cause but I'd say it's either caused by coverage or the destruction of a resource when testing finishes.

So I would try removing coverage to check or removing tests to see if it still segfaults.

@carmocca
Copy link
Contributor

Just saw some PRs with the job passing, so it is also flaky 😵

@akihironitta akihironitta self-assigned this Dec 26, 2021
@daniellepintz
Copy link
Contributor

Is there any progress on resolving this issue? Can we just remove this job? I believe it is causing more harm than good by making our CI red :/

@carmocca
Copy link
Contributor

carmocca commented Jan 6, 2022

I'll leave the decision to @akihironitta as he's the one trying to save us

@akihironitta
Copy link
Contributor Author

Hi @daniellepintz, sorry for the delay. I'm now working on it with Carlos's help but not sure if it'll be resolved in a reasonably short period of time.

Can we just remove this job?

IMHO, I'd keep it because it segfaults after the actual testing part finishes in the job, and thus we can still see its result in the log (although "the job" is marked red). If we just remove the job, no one is able to catch any bugs/behaviours with the environment.

@Borda
Copy link
Member

Borda commented Jan 6, 2022

btw, I would not try to install py3.9 on ubuntu 18.04, rater tries ubuntu 20.04

@akihironitta
Copy link
Contributor Author

@Borda I thought about that, too, and I'll definitely give it a try with a newer version of the OS, but why not install py3.9 on ubuntu18.04? Is there any known issue or something?

@Borda
Copy link
Member

Borda commented Jan 6, 2022

@Borda I thought about that, too, and I'll definitely give it a try with a newer version of the OS, but why not install py3.9 on ubuntu18.04? Is there any known issue or something?

I do not know about any, but why waste time on very unlikely configuration 🐰

@guotuofeng
Copy link
Contributor

It seems like a coverage bug on Ubuntu for python 3.9.9 like nedbat/coveragepy#1294? I tried locally in Ubuntu 18.04 with python 3.9.7, the coverage works fine.

Could we try python with other versions like 3.9.7?

@akihironitta
Copy link
Contributor Author

Hi @guotuofeng, thanks for your support! It seems like the problem is pytest-dev/pytest#8841 in our case as commented by @\carmocca in the linked PR: #11217 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ci Continuous Integration
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants