num_nodes of DDPFullyShardedNativeStrategy is not set correctly when training with multiple nodes #17028
Labels
bug
Something isn't working
distributed
Generic distributed-related topic
strategy: fsdp
Fully Sharded Data Parallel
Milestone
Bug description
AcceleratorConnector
change thenum_nodes
of a strategy by assigning a value tostrategy._num_nodes
.https://github.com/Lightning-AI/lightning/blob/3bee81960a6f8979c8e1b5e747a17124feee652d/src/pytorch_lightning/trainer/connectors/accelerator_connector.py#L839-L840
For
DDPStrategy
, it works fine.https://github.com/Lightning-AI/lightning/blob/3bee81960a6f8979c8e1b5e747a17124feee652d/src/pytorch_lightning/strategies/ddp.py#L98
https://github.com/Lightning-AI/lightning/blob/3bee81960a6f8979c8e1b5e747a17124feee652d/src/pytorch_lightning/strategies/ddp.py#L120-L122
However, for
DDPFullyShardedNativeStrategy
, it won't work, 'cause it has no_num_nodes
.https://github.com/Lightning-AI/lightning/blob/3bee81960a6f8979c8e1b5e747a17124feee652d/src/pytorch_lightning/strategies/fully_sharded_native.py#L143
How to reproduce the bug
No response
Error messages and logs
With
replace_sampler_ddp=True
, theDistributedSampler
will complain about the the rank is invalid.Environment
Current environment
- GPU:
- Tesla V100-SXM2-32GB
- Tesla V100-SXM2-32GB
- Tesla V100-SXM2-32GB
- Tesla V100-SXM2-32GB
- Tesla V100-SXM2-32GB
- Tesla V100-SXM2-32GB
- Tesla V100-SXM2-32GB
- Tesla V100-SXM2-32GB
- available: True
- version: 11.7
- lightning: 1.9.4
- lightning-cloud: 0.5.31
- lightning-colossalai: 0.1.0.dev1
- lightning-lite: 1.8.0
- lightning-utilities: 0.8.0
- pytorch-lightning: 1.9.4
- torch: 1.13.1
- torchdistx: 0.2.0+cu116
- torchmetrics: 0.11.1
- absl-py: 1.4.0
- accelerate: 0.16.0
- aiohttp: 3.8.4
- aiosignal: 1.3.1
- alabaster: 0.7.13
- anyio: 3.6.2
- appdirs: 1.4.4
- argon2-cffi: 21.3.0
- argon2-cffi-bindings: 21.2.0
- arrow: 1.2.3
- asttokens: 2.2.1
- async-timeout: 4.0.2
- attrs: 22.2.0
- babel: 2.12.1
- backcall: 0.2.0
- bcrypt: 4.0.1
- beautifulsoup4: 4.11.2
- bleach: 6.0.0
- blessed: 1.20.0
- boto3: 1.26.84
- botocore: 1.29.84
- cachetools: 5.3.0
- certifi: 2022.12.7
- cffi: 1.15.1
- cfgv: 3.3.1
- chardet: 4.0.0
- charset-normalizer: 3.0.1
- click: 7.0
- colorama: 0.4.4
- colorclass: 2.2.2
- colossalai: 0.2.4
- comm: 0.1.2
- contexttimer: 0.3.3
- croniter: 1.3.8
- cryptography: 39.0.1
- dacite: 1.8.0
- datasets: 2.10.1
- dateutils: 0.6.12
- debugpy: 1.6.6
- decorator: 5.1.1
- deepdiff: 6.2.3
- deepspeed: 0.7.7
- defusedxml: 0.7.1
- dill: 0.3.6
- distlib: 0.3.6
- distro: 1.8.0
- dnspython: 2.3.0
- docker-pycreds: 0.4.0
- docutils: 0.16
- email-validator: 1.3.1
- executing: 1.2.0
- fabric: 3.0.0
- fairscale: 0.4.13
- fastapi: 0.88.0
- fastjsonschema: 2.16.3
- feedparser: 6.0.10
- filelock: 3.9.0
- fire: 0.5.0
- flit-core: 3.6.0
- fqdn: 1.5.1
- frozenlist: 1.3.3
- fsspec: 2023.1.0
- gitdb: 4.0.10
- gitpython: 3.1.31
- google-auth: 2.16.2
- google-auth-oauthlib: 0.4.6
- grpcio: 1.51.3
- h11: 0.14.0
- hjson: 3.1.0
- httpcore: 0.16.3
- httptools: 0.5.0
- httpx: 0.23.3
- huggingface-hub: 0.13.0rc1
- identify: 2.5.18
- idna: 3.4
- imagesize: 1.4.1
- inquirer: 3.1.2
- invoke: 2.0.0
- ipykernel: 6.21.3
- ipython: 8.11.0
- ipython-genutils: 0.2.0
- ipywidgets: 7.7.1
- isoduration: 20.11.0
- itsdangerous: 2.1.2
- jedi: 0.18.2
- jinja2: 3.1.2
- jmespath: 1.0.1
- jsonpointer: 2.3
- jsonschema: 4.17.3
- jupyter: 1.0.0
- jupyter-client: 8.0.3
- jupyter-console: 6.6.3
- jupyter-core: 5.2.0
- jupyter-events: 0.6.3
- jupyter-server: 2.4.0
- jupyter-server-terminals: 0.4.4
- jupyterlab-pygments: 0.2.2
- jupyterlab-widgets: 3.0.5
- lightning: 1.9.4
- lightning-cloud: 0.5.31
- lightning-colossalai: 0.1.0.dev1
- lightning-lite: 1.8.0
- lightning-utilities: 0.8.0
- llama: 0.0.0
- loguru: 0.6.0
- markdown: 3.4.1
- markdown-it-py: 2.2.0
- markupsafe: 2.1.2
- matplotlib-inline: 0.1.6
- mdit-py-plugins: 0.3.5
- mdurl: 0.1.2
- mistune: 2.0.5
- multidict: 6.0.4
- multiprocess: 0.70.14
- myst-parser: 0.19.1
- nbclassic: 0.5.3
- nbclient: 0.7.2
- nbconvert: 7.2.9
- nbformat: 5.7.3
- nest-asyncio: 1.5.6
- netaddr: 0.8.0
- ninja: 1.11.1
- nodeenv: 1.7.0
- notebook: 6.5.3
- notebook-shim: 0.2.2
- numpy: 1.24.2
- nvidia-cublas-cu11: 11.10.3.66
- nvidia-cuda-nvrtc-cu11: 11.7.99
- nvidia-cuda-runtime-cu11: 11.7.99
- nvidia-cudnn-cu11: 8.5.0.96
- oauthlib: 3.2.2
- ordered-set: 4.1.0
- orjson: 3.8.7
- packaging: 23.0
- pandas: 1.5.3
- pandocfilters: 1.5.0
- parallelformers: 1.2.7
- paramiko: 3.0.0
- parso: 0.8.3
- pathtools: 0.1.2
- pexpect: 4.8.0
- pickleshare: 0.7.5
- pip: 22.3.1
- platformdirs: 3.1.0
- pre-commit: 3.1.1
- prometheus-client: 0.16.0
- prompt-toolkit: 3.0.38
- protobuf: 4.22.0
- psutil: 5.9.4
- ptyprocess: 0.7.0
- pure-eval: 0.2.2
- py-cpuinfo: 9.0.0
- pyarrow: 11.0.0
- pyasn1: 0.4.8
- pyasn1-modules: 0.2.8
- pycparser: 2.21
- pydantic: 1.10.5
- pygments: 2.14.0
- pyjwt: 2.6.0
- pynacl: 1.5.0
- pynvml: 11.5.0
- pyrsistent: 0.19.3
- python-dateutil: 2.8.2
- python-dotenv: 1.0.0
- python-editor: 1.0.4
- python-json-logger: 2.0.7
- python-magic: 0.4.27
- python-multipart: 0.0.6
- pytorch-lightning: 1.9.4
- pytz: 2022.7.1
- pyyaml: 6.0
- pyzmq: 25.0.0
- qtconsole: 5.4.0
- qtpy: 2.3.0
- readchar: 4.0.3
- regex: 2022.10.31
- requests: 2.28.2
- requests-oauthlib: 1.3.1
- responses: 0.18.0
- rfc3339-validator: 0.1.4
- rfc3986: 1.5.0
- rfc3986-validator: 0.1.1
- rich: 13.3.1
- rsa: 4.7.2
- s3cmd: 2.3.0
- s3transfer: 0.6.0
- send2trash: 1.8.0
- sentencepiece: 0.1.97
- sentry-sdk: 1.16.0
- setproctitle: 1.3.2
- setuptools: 65.6.3
- sgmllib3k: 1.0.0
- six: 1.16.0
- smmap: 5.0.0
- sniffio: 1.3.0
- snowballstemmer: 2.2.0
- soupsieve: 2.4
- sphinx: 6.1.3
- sphinx-click: 4.4.0
- sphinxcontrib-applehelp: 1.0.4
- sphinxcontrib-devhelp: 1.0.2
- sphinxcontrib-htmlhelp: 2.0.1
- sphinxcontrib-jsmath: 1.0.1
- sphinxcontrib-qthelp: 1.0.3
- sphinxcontrib-serializinghtml: 1.1.5
- stack-data: 0.6.2
- starlette: 0.22.0
- starsessions: 1.3.0
- tensor-parallel: 1.1.0
- tensorboard: 2.12.0
- tensorboard-data-server: 0.7.0
- tensorboard-plugin-wit: 1.8.1
- termcolor: 2.2.0
- terminado: 0.17.1
- terminaltables: 3.1.10
- tinycss2: 1.2.1
- tokenizers: 0.13.2
- torch: 1.13.1
- torchdistx: 0.2.0+cu116
- torchmetrics: 0.11.1
- tornado: 6.2
- tqdm: 4.64.1
- traitlets: 5.9.0
- transformers: 4.26.1
- tvllm: 0.0.0
- twcc-cli: 0.6.0
- typing-extensions: 4.5.0
- ujson: 5.7.0
- uri-template: 1.2.0
- urllib3: 1.26.14
- uvicorn: 0.20.0
- uvloop: 0.17.0
- virtualenv: 20.20.0
- wandb: 0.13.11
- watchfiles: 0.18.1
- wcwidth: 0.2.6
- webcolors: 1.12
- webencodings: 0.5.1
- websocket-client: 1.5.1
- websockets: 10.4
- werkzeug: 2.2.3
- wheel: 0.38.4
- widgetsnbextension: 3.6.2
- xxhash: 3.2.0
- yarl: 1.8.2
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.10.9
- version: Proposal for help #1 SMP Tue Mar 31 23:36:51 UTC 2020
More info
No response
cc @awaelchli @carmocca
The text was updated successfully, but these errors were encountered: