Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix dataloader exit terminate error #34501

Merged
merged 4 commits into from
Sep 16, 2021

Conversation

heavengate
Copy link
Contributor

@heavengate heavengate commented Jul 30, 2021

PR types

Bug fixes

PR changes

APIs

Describe

fix dataloader exit terminate error
fix terminate called without an active exception
if for loop break and program exit immediately(with no model layers processing) after iterate the first few data in distributed lauch mode, distributed launch will call terminate() to kill main process on each devices, but thread is still iterating to fullfill blocking queue caches, which may cause thread error terminate called without an active exception for terminate is a strong singal and __del__ of DataLoader may not be called, so we add a global link to the last DataLoader instance to call __del__ to clean up resources

testing script
https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/simultaneous_translation/stacl/train.py#L121

original error

terminate called without an active exception


--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::framework::SignalHandle(char const*, int)
1   paddle::platform::GetCurrentTraceBackString[abi:cxx11]()

----------------------
Error Message Summary:
----------------------
FatalError: `Process abort signal` is detected by the operating system.
  [TimeInfo: *** Aborted at 1627618860 (unix time) try "date -d @1627618860" if you are using GNU date ***]
  [SignalInfo: *** SIGABRT (@0x77e) received by PID 1918 (TID 0x7f6e7bfff700) from PID 1918 ***]

INFO 2021-07-30 04:21:09,967 launch_utils.py:327] terminate all the procs
ERROR 2021-07-30 04:21:09,968 launch_utils.py:584] ABORT!!! Out of all 2 trainers, the trainer process with rank=[0] was aborted. Please check its log.
INFO 2021-07-30 04:21:12,971 launch_utils.py:327] terminate all the procs

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

chenwhql
chenwhql previously approved these changes Jul 30, 2021
Copy link
Contributor

@chenwhql chenwhql left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@PaddlePaddle PaddlePaddle locked and limited conversation to collaborators Sep 2, 2021
@PaddlePaddle PaddlePaddle unlocked this conversation Sep 2, 2021
@heavengate heavengate merged commit e93c18a into PaddlePaddle:develop Sep 16, 2021
@heavengate heavengate deleted the fix_dataloader_exit_error branch September 16, 2021 09:52
AnnaTrainingG pushed a commit to AnnaTrainingG/Paddle that referenced this pull request Sep 29, 2021
* fix DataLoader exit with SIGABRT/SIGSEGV. test=develop
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants