Description
Context
While running the distributed/FSDP/T5_training.py
example, I encountered an error when loading the wikihow
dataset. I would like to know if this is a bug or if there is a way to resolve it.
- PyTorch version: 2.7.0a0+git49bdc41
- Operating System and version: Ubuntu 20.04
Your Environment
- Installed using source? [yes/no]: yes
- Are you planning to deploy it using docker container? [yes/no]: no
- Is it a CPU or GPU environment?: GPU (CUDA)
- Which example are you using:
distributed/FSDP/T5_training.py
- Link to code or data to repro [if any]: N/A
- version: commit 1bef748
Expected Behavior
The wikihow
dataset should be successfully loaded using the following command:
dataset = load_dataset('wikihow', 'all', data_dir='data/')
Current Behavior
The script fails with a ConnectionError
, indicating that the dataset could not be downloaded from the specified URL.
Possible Solution
The issue might be related to the URL used in the dataset script:
https://raw.githubusercontent.com/mahnazkoupaee/WikiHow-Dataset/master/all_train.txt
.
If the file is unavailable, an updated URL or alternative data source could resolve the issue.
Steps to Reproduce
- Install the requirements:
sh download_dataset.sh pip install -r requirements.txt
- Run the following command:
OMP_NUM_THREADS=1 torchrun --nproc_per_node 1 T5_training.py
Failure Logs [if any]
Downloading and preparing dataset wikihow/all (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/tateiwa/.cache/huggingface/datasets/wikihow/all/1.2.0/cfb412ca2191fac028cae9a5a9a03ba21b08ff2b4bf46f8a0473d7303a3e3683...
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/tateiwa/pytorch_test/examples/distributed/FSDP/T5_training.py", line 222, in <module>
[rank0]: fsdp_main(args)
[rank0]: File "/home/tateiwa/pytorch_test/examples/distributed/FSDP/T5_training.py", line 90, in fsdp_main
[rank0]: dataset = load_dataset('wikihow', 'all', data_dir='data/')
...
[rank0]: ConnectionError: Couldn't reach https://raw.githubusercontent.com/mahnazkoupaee/WikiHow-Dataset/master/all_train.txt
full log is here.
ownloading and preparing dataset wikihow/all (download: Unknown size, generated: Unknown size, post-processed: Unknown sizetotal: Unknown size) to /home/tateiwa/.cache/huggingface/datasets/wikihow/all/1.2.0/cfb412ca2191fac028cae9a5a9a03ba21b08ff2b4bf46f8a0473d7303a3e3683...
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/tateiwa/pytorch_test/examples/distributed/FSDP/T5_training.py", line 222, in <module>
[rank0]: fsdp_main(args)
[rank0]: File "/home/tateiwa/pytorch_test/examples/distributed/FSDP/T5_training.py", line 90, in fsdp_main
[rank0]: dataset = load_dataset('wikihow', 'all', data_dir='data/')
[rank0]: File "/home/tateiwa/pytorch_test/venv/lib/python3.10/site-packages/nlp/load.py", line 548, in load_dataset
[rank0]: builder_instance.download_and_prepare(
[rank0]: File "/home/tateiwa/pytorch_test/venv/lib/python3.10/site-packages/nlp/builder.py", line 462, in download_and_prepare
[rank0]: self._download_and_prepare(
[rank0]: File "/home/tateiwa/pytorch_test/venv/lib/python3.10/site-packages/nlp/builder.py", line 518, in _download_and_prepare
[rank0]: split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
[rank0]: File "/home/tateiwa/pytorch_test/venv/lib/python3.10/site-packages/nlp/datasets/wikihow/cfb412ca2191fac028cae9a5a9a03ba21b08ff2b4bf46f8a0473d7303a3e3683/wikihow.py", line 126, in _split_generators
[rank0]: dl_path = dl_manager.download_and_extract(_URLS)
[rank0]: File "/home/tateiwa/pytorch_test/venv/lib/python3.10/site-packages/nlp/utils/download_manager.py", line 220, in download_and_extract
[rank0]: return self.extract(self.download(url_or_urls))
[rank0]: File "/home/tateiwa/pytorch_test/venv/lib/python3.10/site-packages/nlp/utils/download_manager.py", line 155, in download
[rank0]: downloaded_path_or_paths = map_nested(
[rank0]: File "/home/tateiwa/pytorch_test/venv/lib/python3.10/site-packages/nlp/utils/py_utils.py", line 163, in map_nested
[rank0]: return {
[rank0]: File "/home/tateiwa/pytorch_test/venv/lib/python3.10/site-packages/nlp/utils/py_utils.py", line 164, in <dictcomp>
[rank0]: k: map_nested(
[rank0]: File "/home/tateiwa/pytorch_test/venv/lib/python3.10/site-packages/nlp/utils/py_utils.py", line 191, in map_nested
[rank0]: return function(data_struct)
[rank0]: File "/home/tateiwa/pytorch_test/venv/lib/python3.10/site-packages/nlp/utils/download_manager.py", line 156, in <lambda>
[rank0]: lambda url: cached_path(url, download_config=self._download_config,), url_or_urls,
[rank0]: File "/home/tateiwa/pytorch_test/venv/lib/python3.10/site-packages/nlp/utils/file_utils.py", line 191, in cached_path
[rank0]: output_path = get_from_cache(
[rank0]: File "/home/tateiwa/pytorch_test/venv/lib/python3.10/site-packages/nlp/utils/file_utils.py", line 356, in get_from_cache
[rank0]: raise ConnectionError("Couldn't reach {}".format(url))
[rank0]: ConnectionError: Couldn't reach https://raw.githubusercontent.com/mahnazkoupaee/WikiHow-Dataset/master/all_train.txt
E0118 06:12:01.182000 1191202 torch/distributed/elastic/multiprocessing/api.py:870] failed (exitcode: 1) local_rank: 0 (pid: 1191213) of binary: /home/tateiwa/pytorch_test/venv/bin/python
Traceback (most recent call last):
File "/home/tateiwa/pytorch_test/venv/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.7.0a0+git49bdc41', 'console_scripts', 'torchrun')())
File "/home/tateiwa/pytorch_test/venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
File "/home/tateiwa/pytorch_test/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/home/tateiwa/pytorch_test/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/home/tateiwa/pytorch_test/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/tateiwa/pytorch_test/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
T5_training.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-01-18_06:12:01
host : snail03
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1191213)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Could you provide guidance on how to resolve this issue? Alternatively, if this is a bug, are there any workarounds or fixes available?
Thank you for your help in advance!