Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error While Running distributed/FSDP/T5_training.py #1310

Open
nariaki3551 opened this issue Jan 18, 2025 · 0 comments
Open

Error While Running distributed/FSDP/T5_training.py #1310

nariaki3551 opened this issue Jan 18, 2025 · 0 comments

Comments

@nariaki3551
Copy link

nariaki3551 commented Jan 18, 2025

Context

While running the distributed/FSDP/T5_training.py example, I encountered an error when loading the wikihow dataset. I would like to know if this is a bug or if there is a way to resolve it.

  • PyTorch version: 2.7.0a0+git49bdc41
  • Operating System and version: Ubuntu 20.04

Your Environment

  • Installed using source? [yes/no]: yes
  • Are you planning to deploy it using docker container? [yes/no]: no
  • Is it a CPU or GPU environment?: GPU (CUDA)
  • Which example are you using: distributed/FSDP/T5_training.py
  • Link to code or data to repro [if any]: N/A
  • version: commit 1bef748

Expected Behavior

The wikihow dataset should be successfully loaded using the following command:

dataset = load_dataset('wikihow', 'all', data_dir='data/')

Current Behavior

The script fails with a ConnectionError, indicating that the dataset could not be downloaded from the specified URL.

Possible Solution

The issue might be related to the URL used in the dataset script:
https://raw.githubusercontent.com/mahnazkoupaee/WikiHow-Dataset/master/all_train.txt.
If the file is unavailable, an updated URL or alternative data source could resolve the issue.

Steps to Reproduce

  1. Install the requirements:
    sh download_dataset.sh
    pip install -r requirements.txt
  2. Run the following command:
    OMP_NUM_THREADS=1 torchrun --nproc_per_node 1 T5_training.py

Failure Logs [if any]

Downloading and preparing dataset wikihow/all (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/tateiwa/.cache/huggingface/datasets/wikihow/all/1.2.0/cfb412ca2191fac028cae9a5a9a03ba21b08ff2b4bf46f8a0473d7303a3e3683...
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/tateiwa/pytorch_test/examples/distributed/FSDP/T5_training.py", line 222, in <module>
[rank0]:     fsdp_main(args)
[rank0]:   File "/home/tateiwa/pytorch_test/examples/distributed/FSDP/T5_training.py", line 90, in fsdp_main
[rank0]:     dataset = load_dataset('wikihow', 'all', data_dir='data/')
...
[rank0]: ConnectionError: Couldn't reach https://raw.githubusercontent.com/mahnazkoupaee/WikiHow-Dataset/master/all_train.txt
full log is here.
ownloading and preparing dataset wikihow/all (download: Unknown size, generated: Unknown size, post-processed: Unknown sizetotal: Unknown size) to /home/tateiwa/.cache/huggingface/datasets/wikihow/all/1.2.0/cfb412ca2191fac028cae9a5a9a03ba21b08ff2b4bf46f8a0473d7303a3e3683...
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/tateiwa/pytorch_test/examples/distributed/FSDP/T5_training.py", line 222, in <module>
[rank0]:     fsdp_main(args)
[rank0]:   File "/home/tateiwa/pytorch_test/examples/distributed/FSDP/T5_training.py", line 90, in fsdp_main
[rank0]:     dataset = load_dataset('wikihow', 'all', data_dir='data/')
[rank0]:   File "/home/tateiwa/pytorch_test/venv/lib/python3.10/site-packages/nlp/load.py", line 548, in load_dataset
[rank0]:     builder_instance.download_and_prepare(
[rank0]:   File "/home/tateiwa/pytorch_test/venv/lib/python3.10/site-packages/nlp/builder.py", line 462, in download_and_prepare
[rank0]:     self._download_and_prepare(
[rank0]:   File "/home/tateiwa/pytorch_test/venv/lib/python3.10/site-packages/nlp/builder.py", line 518, in _download_and_prepare
[rank0]:     split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
[rank0]:   File "/home/tateiwa/pytorch_test/venv/lib/python3.10/site-packages/nlp/datasets/wikihow/cfb412ca2191fac028cae9a5a9a03ba21b08ff2b4bf46f8a0473d7303a3e3683/wikihow.py", line 126, in _split_generators
[rank0]:     dl_path = dl_manager.download_and_extract(_URLS)
[rank0]:   File "/home/tateiwa/pytorch_test/venv/lib/python3.10/site-packages/nlp/utils/download_manager.py", line 220, in download_and_extract
[rank0]:     return self.extract(self.download(url_or_urls))
[rank0]:   File "/home/tateiwa/pytorch_test/venv/lib/python3.10/site-packages/nlp/utils/download_manager.py", line 155, in download
[rank0]:     downloaded_path_or_paths = map_nested(
[rank0]:   File "/home/tateiwa/pytorch_test/venv/lib/python3.10/site-packages/nlp/utils/py_utils.py", line 163, in map_nested
[rank0]:     return {
[rank0]:   File "/home/tateiwa/pytorch_test/venv/lib/python3.10/site-packages/nlp/utils/py_utils.py", line 164, in <dictcomp>
[rank0]:     k: map_nested(
[rank0]:   File "/home/tateiwa/pytorch_test/venv/lib/python3.10/site-packages/nlp/utils/py_utils.py", line 191, in map_nested
[rank0]:     return function(data_struct)
[rank0]:   File "/home/tateiwa/pytorch_test/venv/lib/python3.10/site-packages/nlp/utils/download_manager.py", line 156, in <lambda>
[rank0]:     lambda url: cached_path(url, download_config=self._download_config,), url_or_urls,
[rank0]:   File "/home/tateiwa/pytorch_test/venv/lib/python3.10/site-packages/nlp/utils/file_utils.py", line 191, in cached_path
[rank0]:     output_path = get_from_cache(
[rank0]:   File "/home/tateiwa/pytorch_test/venv/lib/python3.10/site-packages/nlp/utils/file_utils.py", line 356, in get_from_cache
[rank0]:     raise ConnectionError("Couldn't reach {}".format(url))
[rank0]: ConnectionError: Couldn't reach https://raw.githubusercontent.com/mahnazkoupaee/WikiHow-Dataset/master/all_train.txt
E0118 06:12:01.182000 1191202 torch/distributed/elastic/multiprocessing/api.py:870] failed (exitcode: 1) local_rank: 0 (pid: 1191213) of binary: /home/tateiwa/pytorch_test/venv/bin/python
Traceback (most recent call last):
  File "/home/tateiwa/pytorch_test/venv/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.7.0a0+git49bdc41', 'console_scripts', 'torchrun')())
  File "/home/tateiwa/pytorch_test/venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/home/tateiwa/pytorch_test/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main
    run(args)
  File "/home/tateiwa/pytorch_test/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
    elastic_launch(
  File "/home/tateiwa/pytorch_test/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/tateiwa/pytorch_test/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
T5_training.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-01-18_06:12:01
  host      : snail03
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1191213)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Could you provide guidance on how to resolve this issue? Alternatively, if this is a bug, are there any workarounds or fixes available?

Thank you for your help!

@nariaki3551 nariaki3551 changed the title Issue: Error of Running distributed/FSDP/T5_training.py Error While Running distributed/FSDP/T5_training.py Jan 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant