Race condition when prepare pretrained model in distributed training #44

llidev · 2018-11-20T09:40:25Z

Hi,

I launched two processes per node to run distributed run_classifier.py. However, I am occasionally get below error:

11/20/2018 09:31:48 - INFO - pytorch_pretrained_bert.file_utils -   copying /tmp/tmpa25_y4es to cache at /root/.pytorch_pretrained_bert/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba

 93%|█████████▎| 381028352/407873900 [00:11<00:01, 14366075.22B/s]
 94%|█████████▍| 383812608/407873900 [00:11<00:01, 16210783.00B/s]
 95%|█████████▍| 386455552/407873900 [00:11<00:01, 16205260.89B/s]11/20/2018 09:31:49 - INFO - pytorch_pretrained_bert.file_utils -   creating metadata file for /root/.pytorch_pretrained_bert/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba
11/20/2018 09:31:49 - INFO - pytorch_pretrained_bert.file_utils -   removing temp file /tmp/tmpa25_y4es

 95%|█████████▌| 388946944/407873900 [00:11<00:01, 18097539.03B/s]11/20/2018 09:31:49 - INFO - pytorch_pretrained_bert.modeling -   loading archive file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz from cache at /root/.pytorch_pretrained_bert/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba
11/20/2018 09:31:49 - INFO - pytorch_pretrained_bert.modeling -   extracting archive file /root/.pytorch_pretrained_bert/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba to temp dir /tmp/tmpvxvnr8_1

 97%|█████████▋| 393660416/407873900 [00:11<00:00, 22199883.93B/s]
 98%|█████████▊| 399411200/407873900 [00:11<00:00, 27211860.00B/s]
 99%|█████████▉| 405128192/407873900 [00:11<00:00, 32287252.94B/s]
100%|██████████| 407873900/407873900 [00:11<00:00, 34098120.40B/s]
11/20/2018 09:31:49 - INFO - pytorch_pretrained_bert.file_utils -   copying /tmp/tmp5fcm4v8x to cache at /root/.pytorch_pretrained_bert/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba
Traceback (most recent call last):
  File "examples/run_classifier.py", line 629, in <module>
    main()
  File "examples/run_classifier.py", line 485, in main
    model = BertForSequenceClassification.from_pretrained(args.bert_model, len(label_list))
  File "/azureml-envs/azureml_49b6ba977c83839baa597001c9b55a6f/lib/python3.6/site-packages/pytorch_pretrained_bert-0.1.2-py3.6.egg/pytorch_pretrained_bert/modeling.py", line 495, in from_pretrained
    archive.extractall(tempdir)
  File "/azureml-envs/azureml_49b6ba977c83839baa597001c9b55a6f/lib/python3.6/tarfile.py", line 2007, in extractall
    numeric_owner=numeric_owner)
  File "/azureml-envs/azureml_49b6ba977c83839baa597001c9b55a6f/lib/python3.6/tarfile.py", line 2049, in extract
    numeric_owner=numeric_owner)
  File "/azureml-envs/azureml_49b6ba977c83839baa597001c9b55a6f/lib/python3.6/tarfile.py", line 2119, in _extract_member
    self.makefile(tarinfo, targetpath)
  File "/azureml-envs/azureml_49b6ba977c83839baa597001c9b55a6f/lib/python3.6/tarfile.py", line 2168, in makefile
    copyfileobj(source, target, tarinfo.size, ReadError, bufsize)
  File "/azureml-envs/azureml_49b6ba977c83839baa597001c9b55a6f/lib/python3.6/tarfile.py", line 248, in copyfileobj
    buf = src.read(bufsize)
  File "/azureml-envs/azureml_49b6ba977c83839baa597001c9b55a6f/lib/python3.6/gzip.py", line 276, in read
    return self._buffer.read(size)
  File "/azureml-envs/azureml_49b6ba977c83839baa597001c9b55a6f/lib/python3.6/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/azureml-envs/azureml_49b6ba977c83839baa597001c9b55a6f/lib/python3.6/gzip.py", line 482, in read
    raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached

It looks like a race-condition that two processes are simultaneously writing model file to /root/.pytorch_pretrained_bert/.

Please help to advice any workaround. Thanks!

The text was updated successfully, but these errors were encountered:

llidev · 2018-11-20T19:20:24Z

My current workaround is to set the env var PYTORCH_PRETRAINED_BERT_CACHE to a different path per process before import pytorch_pretrained_bert. But I think the module itself should handle this properly

thomwolf · 2018-11-21T09:08:13Z

I see, thanks for the feedback. I will find a way to make that better in the next release. Not sure we need to store the model gzipped anyway since they mostly contains a torch dump which is already compressed.

thomwolf · 2018-11-26T09:23:02Z

Ok, I've added a cache_dir option in from_pretrained in the master to specify a different cache dir for a script. I will release the updated version today on pip. Thanks for the feedback.

llidev · 2018-11-27T09:15:35Z

Thanks for fixing this.

Since the way I use this repo is to add ./pytorch_pretrained_bert in PYTHONPATH, so I think directly add the following import in run_classifier.py and run_squad.py is more appropriate in my case

from pytorch_pretrained_bert.file_utils import PYTORCH_PRETRAINED_BERT_CACHE

which is included in my PR: #58

combine multitask training + adversarial training for mat-roberta-qa

Fix bug cuda

thomwolf closed this as completed Nov 26, 2018

maeotaku mentioned this issue May 23, 2019

bert->onnx ->caffe2 weird error #633

Closed

BramVanroy mentioned this issue Oct 15, 2019

Downloading model in distributed mode #1521

Closed

stevezheng23 added a commit to stevezheng23/transformers that referenced this issue Mar 24, 2020

Merge pull request huggingface#44 from stevezheng23/dev/zheng/coqa

cb9d9a0

combine multitask training + adversarial training for mat-roberta-qa

lwmlyy mentioned this issue Aug 15, 2023

add util for ram efficient loading of model when using fsdp #25107

Merged

1 task

ZYC-ModelCloud pushed a commit to ZYC-ModelCloud/transformers that referenced this issue Nov 14, 2024

Merge pull request huggingface#44 from PanQiWei/fix-bug-cuda

709bd75

Fix bug cuda

ZYC-ModelCloud pushed a commit to ZYC-ModelCloud/transformers that referenced this issue Nov 14, 2024

remove __all__ (huggingface#44)

d4a86ba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition when prepare pretrained model in distributed training #44

Race condition when prepare pretrained model in distributed training #44

llidev commented Nov 20, 2018

llidev commented Nov 20, 2018

thomwolf commented Nov 21, 2018

thomwolf commented Nov 26, 2018

llidev commented Nov 27, 2018 •

edited

Loading

Race condition when prepare pretrained model in distributed training #44

Race condition when prepare pretrained model in distributed training #44

Comments

llidev commented Nov 20, 2018

llidev commented Nov 20, 2018

thomwolf commented Nov 21, 2018

thomwolf commented Nov 26, 2018

llidev commented Nov 27, 2018 • edited Loading

llidev commented Nov 27, 2018 •

edited

Loading