Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

num_nodes of DDPFullyShardedNativeStrategy is not set correctly when training with multiple nodes #17028

Closed
ShinoharaHare opened this issue Mar 10, 2023 · 2 comments · Fixed by #17438
Assignees
Labels
bug Something isn't working distributed Generic distributed-related topic strategy: fsdp Fully Sharded Data Parallel
Milestone

Comments

@ShinoharaHare
Copy link

ShinoharaHare commented Mar 10, 2023

Bug description

AcceleratorConnector change the num_nodes of a strategy by assigning a value to strategy._num_nodes.
https://github.com/Lightning-AI/lightning/blob/3bee81960a6f8979c8e1b5e747a17124feee652d/src/pytorch_lightning/trainer/connectors/accelerator_connector.py#L839-L840

For DDPStrategy, it works fine.
https://github.com/Lightning-AI/lightning/blob/3bee81960a6f8979c8e1b5e747a17124feee652d/src/pytorch_lightning/strategies/ddp.py#L98
https://github.com/Lightning-AI/lightning/blob/3bee81960a6f8979c8e1b5e747a17124feee652d/src/pytorch_lightning/strategies/ddp.py#L120-L122

However, for DDPFullyShardedNativeStrategy, it won't work, 'cause it has no _num_nodes.

https://github.com/Lightning-AI/lightning/blob/3bee81960a6f8979c8e1b5e747a17124feee652d/src/pytorch_lightning/strategies/fully_sharded_native.py#L143

How to reproduce the bug

No response

Error messages and logs

With replace_sampler_ddp=True, the DistributedSampler will complain about the the rank is invalid.

ValueError: Invalid rank 1, rank should be in the interval [0, 0]

Environment

Current environment
  • CUDA:
    - GPU:
    - Tesla V100-SXM2-32GB
    - Tesla V100-SXM2-32GB
    - Tesla V100-SXM2-32GB
    - Tesla V100-SXM2-32GB
    - Tesla V100-SXM2-32GB
    - Tesla V100-SXM2-32GB
    - Tesla V100-SXM2-32GB
    - Tesla V100-SXM2-32GB
    - available: True
    - version: 11.7
  • Lightning:
    - lightning: 1.9.4
    - lightning-cloud: 0.5.31
    - lightning-colossalai: 0.1.0.dev1
    - lightning-lite: 1.8.0
    - lightning-utilities: 0.8.0
    - pytorch-lightning: 1.9.4
    - torch: 1.13.1
    - torchdistx: 0.2.0+cu116
    - torchmetrics: 0.11.1
  • Packages:
    - absl-py: 1.4.0
    - accelerate: 0.16.0
    - aiohttp: 3.8.4
    - aiosignal: 1.3.1
    - alabaster: 0.7.13
    - anyio: 3.6.2
    - appdirs: 1.4.4
    - argon2-cffi: 21.3.0
    - argon2-cffi-bindings: 21.2.0
    - arrow: 1.2.3
    - asttokens: 2.2.1
    - async-timeout: 4.0.2
    - attrs: 22.2.0
    - babel: 2.12.1
    - backcall: 0.2.0
    - bcrypt: 4.0.1
    - beautifulsoup4: 4.11.2
    - bleach: 6.0.0
    - blessed: 1.20.0
    - boto3: 1.26.84
    - botocore: 1.29.84
    - cachetools: 5.3.0
    - certifi: 2022.12.7
    - cffi: 1.15.1
    - cfgv: 3.3.1
    - chardet: 4.0.0
    - charset-normalizer: 3.0.1
    - click: 7.0
    - colorama: 0.4.4
    - colorclass: 2.2.2
    - colossalai: 0.2.4
    - comm: 0.1.2
    - contexttimer: 0.3.3
    - croniter: 1.3.8
    - cryptography: 39.0.1
    - dacite: 1.8.0
    - datasets: 2.10.1
    - dateutils: 0.6.12
    - debugpy: 1.6.6
    - decorator: 5.1.1
    - deepdiff: 6.2.3
    - deepspeed: 0.7.7
    - defusedxml: 0.7.1
    - dill: 0.3.6
    - distlib: 0.3.6
    - distro: 1.8.0
    - dnspython: 2.3.0
    - docker-pycreds: 0.4.0
    - docutils: 0.16
    - email-validator: 1.3.1
    - executing: 1.2.0
    - fabric: 3.0.0
    - fairscale: 0.4.13
    - fastapi: 0.88.0
    - fastjsonschema: 2.16.3
    - feedparser: 6.0.10
    - filelock: 3.9.0
    - fire: 0.5.0
    - flit-core: 3.6.0
    - fqdn: 1.5.1
    - frozenlist: 1.3.3
    - fsspec: 2023.1.0
    - gitdb: 4.0.10
    - gitpython: 3.1.31
    - google-auth: 2.16.2
    - google-auth-oauthlib: 0.4.6
    - grpcio: 1.51.3
    - h11: 0.14.0
    - hjson: 3.1.0
    - httpcore: 0.16.3
    - httptools: 0.5.0
    - httpx: 0.23.3
    - huggingface-hub: 0.13.0rc1
    - identify: 2.5.18
    - idna: 3.4
    - imagesize: 1.4.1
    - inquirer: 3.1.2
    - invoke: 2.0.0
    - ipykernel: 6.21.3
    - ipython: 8.11.0
    - ipython-genutils: 0.2.0
    - ipywidgets: 7.7.1
    - isoduration: 20.11.0
    - itsdangerous: 2.1.2
    - jedi: 0.18.2
    - jinja2: 3.1.2
    - jmespath: 1.0.1
    - jsonpointer: 2.3
    - jsonschema: 4.17.3
    - jupyter: 1.0.0
    - jupyter-client: 8.0.3
    - jupyter-console: 6.6.3
    - jupyter-core: 5.2.0
    - jupyter-events: 0.6.3
    - jupyter-server: 2.4.0
    - jupyter-server-terminals: 0.4.4
    - jupyterlab-pygments: 0.2.2
    - jupyterlab-widgets: 3.0.5
    - lightning: 1.9.4
    - lightning-cloud: 0.5.31
    - lightning-colossalai: 0.1.0.dev1
    - lightning-lite: 1.8.0
    - lightning-utilities: 0.8.0
    - llama: 0.0.0
    - loguru: 0.6.0
    - markdown: 3.4.1
    - markdown-it-py: 2.2.0
    - markupsafe: 2.1.2
    - matplotlib-inline: 0.1.6
    - mdit-py-plugins: 0.3.5
    - mdurl: 0.1.2
    - mistune: 2.0.5
    - multidict: 6.0.4
    - multiprocess: 0.70.14
    - myst-parser: 0.19.1
    - nbclassic: 0.5.3
    - nbclient: 0.7.2
    - nbconvert: 7.2.9
    - nbformat: 5.7.3
    - nest-asyncio: 1.5.6
    - netaddr: 0.8.0
    - ninja: 1.11.1
    - nodeenv: 1.7.0
    - notebook: 6.5.3
    - notebook-shim: 0.2.2
    - numpy: 1.24.2
    - nvidia-cublas-cu11: 11.10.3.66
    - nvidia-cuda-nvrtc-cu11: 11.7.99
    - nvidia-cuda-runtime-cu11: 11.7.99
    - nvidia-cudnn-cu11: 8.5.0.96
    - oauthlib: 3.2.2
    - ordered-set: 4.1.0
    - orjson: 3.8.7
    - packaging: 23.0
    - pandas: 1.5.3
    - pandocfilters: 1.5.0
    - parallelformers: 1.2.7
    - paramiko: 3.0.0
    - parso: 0.8.3
    - pathtools: 0.1.2
    - pexpect: 4.8.0
    - pickleshare: 0.7.5
    - pip: 22.3.1
    - platformdirs: 3.1.0
    - pre-commit: 3.1.1
    - prometheus-client: 0.16.0
    - prompt-toolkit: 3.0.38
    - protobuf: 4.22.0
    - psutil: 5.9.4
    - ptyprocess: 0.7.0
    - pure-eval: 0.2.2
    - py-cpuinfo: 9.0.0
    - pyarrow: 11.0.0
    - pyasn1: 0.4.8
    - pyasn1-modules: 0.2.8
    - pycparser: 2.21
    - pydantic: 1.10.5
    - pygments: 2.14.0
    - pyjwt: 2.6.0
    - pynacl: 1.5.0
    - pynvml: 11.5.0
    - pyrsistent: 0.19.3
    - python-dateutil: 2.8.2
    - python-dotenv: 1.0.0
    - python-editor: 1.0.4
    - python-json-logger: 2.0.7
    - python-magic: 0.4.27
    - python-multipart: 0.0.6
    - pytorch-lightning: 1.9.4
    - pytz: 2022.7.1
    - pyyaml: 6.0
    - pyzmq: 25.0.0
    - qtconsole: 5.4.0
    - qtpy: 2.3.0
    - readchar: 4.0.3
    - regex: 2022.10.31
    - requests: 2.28.2
    - requests-oauthlib: 1.3.1
    - responses: 0.18.0
    - rfc3339-validator: 0.1.4
    - rfc3986: 1.5.0
    - rfc3986-validator: 0.1.1
    - rich: 13.3.1
    - rsa: 4.7.2
    - s3cmd: 2.3.0
    - s3transfer: 0.6.0
    - send2trash: 1.8.0
    - sentencepiece: 0.1.97
    - sentry-sdk: 1.16.0
    - setproctitle: 1.3.2
    - setuptools: 65.6.3
    - sgmllib3k: 1.0.0
    - six: 1.16.0
    - smmap: 5.0.0
    - sniffio: 1.3.0
    - snowballstemmer: 2.2.0
    - soupsieve: 2.4
    - sphinx: 6.1.3
    - sphinx-click: 4.4.0
    - sphinxcontrib-applehelp: 1.0.4
    - sphinxcontrib-devhelp: 1.0.2
    - sphinxcontrib-htmlhelp: 2.0.1
    - sphinxcontrib-jsmath: 1.0.1
    - sphinxcontrib-qthelp: 1.0.3
    - sphinxcontrib-serializinghtml: 1.1.5
    - stack-data: 0.6.2
    - starlette: 0.22.0
    - starsessions: 1.3.0
    - tensor-parallel: 1.1.0
    - tensorboard: 2.12.0
    - tensorboard-data-server: 0.7.0
    - tensorboard-plugin-wit: 1.8.1
    - termcolor: 2.2.0
    - terminado: 0.17.1
    - terminaltables: 3.1.10
    - tinycss2: 1.2.1
    - tokenizers: 0.13.2
    - torch: 1.13.1
    - torchdistx: 0.2.0+cu116
    - torchmetrics: 0.11.1
    - tornado: 6.2
    - tqdm: 4.64.1
    - traitlets: 5.9.0
    - transformers: 4.26.1
    - tvllm: 0.0.0
    - twcc-cli: 0.6.0
    - typing-extensions: 4.5.0
    - ujson: 5.7.0
    - uri-template: 1.2.0
    - urllib3: 1.26.14
    - uvicorn: 0.20.0
    - uvloop: 0.17.0
    - virtualenv: 20.20.0
    - wandb: 0.13.11
    - watchfiles: 0.18.1
    - wcwidth: 0.2.6
    - webcolors: 1.12
    - webencodings: 0.5.1
    - websocket-client: 1.5.1
    - websockets: 10.4
    - werkzeug: 2.2.3
    - wheel: 0.38.4
    - widgetsnbextension: 3.6.2
    - xxhash: 3.2.0
    - yarl: 1.8.2
  • System:
    - OS: Linux
    - architecture:
    - 64bit
    - ELF
    - processor: x86_64
    - python: 3.10.9
    - version: Proposal for help #1 SMP Tue Mar 31 23:36:51 UTC 2020

More info

No response

cc @awaelchli @carmocca

@ShinoharaHare ShinoharaHare added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Mar 10, 2023
@carmocca carmocca added distributed Generic distributed-related topic strategy: fsdp Fully Sharded Data Parallel and removed needs triage Waiting to be triaged by maintainers labels Mar 16, 2023
@carmocca carmocca added this to the v1.9.x milestone Mar 16, 2023
@awaelchli awaelchli self-assigned this Mar 16, 2023
@ShinoharaHare
Copy link
Author

I notice that this issue is marked for v1.9.x, but FSDPStrategy in v2.x seems to have the same problem.
I have not tested it in person, but the code looks the same.
Should we open another issue for it?

@carmocca
Copy link
Contributor

No need. We'll take care of it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed Generic distributed-related topic strategy: fsdp Fully Sharded Data Parallel
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants