Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataLoader thread killed #3

Open
huk112739 opened this issue Jan 5, 2024 · 1 comment
Open

DataLoader thread killed #3

huk112739 opened this issue Jan 5, 2024 · 1 comment

Comments

@huk112739
Copy link

Hello! I am experiencing some issues while using the DiffusionReg project and would appreciate your assistance and advice.

When running the training script, I encounter problems where the DataLoader workers exit unexpectedly and lead to segmentation faults. Specifically, during the data loading phase, the program fails to proceed normally, and I receive the following error messages:
`Data exists. Loading...
Data exists. Loading...
Data exists. Loading...
0%| | 0/1196 [00:00<?, ?it/s]ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
0%| | 0/1196 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/sun/anaconda3/envs/diff/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1120, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/sun/anaconda3/envs/diff/lib/python3.9/queue.py", line 180, in get
self.not_empty.wait(remaining)
File "/home/sun/anaconda3/envs/diff/lib/python3.9/threading.py", line 316, in wait
gotit = waiter.acquire(True, timeout)
File "/home/sun/anaconda3/envs/diff/lib/python3.9/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 975787) is killed by signal: Segmentation fault.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/sun/PycharmProjects/DiffusionReg/train.py", line 162, in
main(opts)
File "/home/sun/PycharmProjects/DiffusionReg/train.py", line 74, in main
for i, data in enumerate(tqdm(train_loader, 0)):
File "/home/sun/anaconda3/envs/diff/lib/python3.9/site-packages/tqdm/std.py", line 1182, in iter
for obj in iterable:
File "/home/sun/anaconda3/envs/diff/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 628, in next
data = self._next_data()
File "/home/sun/anaconda3/envs/diff/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1316, in _next_data
idx, data = self._get_data()
File "/home/sun/anaconda3/envs/diff/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1272, in _get_data
success, data = self._try_get_data()
File "/home/sun/anaconda3/envs/diff/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 975787) exited unexpectedly`

I set ` num in DataLoader_ Setting workers' to 0 will still cause this issue

@ricpruss
Copy link

ricpruss commented Jul 28, 2024

Turn you number of n_cores=0. This is a common mistake when people with a GPU try multithread from the CPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants