num_nodes of DDPFullyShardedNativeStrategy is not set correctly when training with multiple nodes #17028

ShinoharaHare opened this issue Mar 10, 2023 · 2 comments · Fixed by #17438
bug Something isn't working distributed Generic distributed-related topic strategy: fsdp Fully Sharded Data Parallel


ShinoharaHare commented Mar 10, 2023

Bug description

AcceleratorConnector change the num_nodes of a strategy by assigning a value to strategy._num_nodes.

For DDPStrategy, it works fine.

However, for DDPFullyShardedNativeStrategy, it won't work, 'cause it has no _num_nodes.

How to reproduce the bug

Error messages and logs

With replace_sampler_ddp=True, the DistributedSampler will complain about the the rank is invalid.

ValueError: Invalid rank 1, rank should be in the interval [0, 0]


Current environment
  • CUDA:
    - GPU:
    - Tesla V100-SXM2-32GB
    - Tesla V100-SXM2-32GB
    - Tesla V100-SXM2-32GB
    - Tesla V100-SXM2-32GB
    - Tesla V100-SXM2-32GB
    - Tesla V100-SXM2-32GB
    - Tesla V100-SXM2-32GB
    - Tesla V100-SXM2-32GB
    - available: True
    - version: 11.7
  • Lightning:
    - lightning: 1.9.4
    - lightning-cloud: 0.5.31
    - lightning-colossalai: 0.1.0.dev1
    - lightning-lite: 1.8.0
    - lightning-utilities: 0.8.0
    - pytorch-lightning: 1.9.4
    - torch: 1.13.1
    - torchdistx: 0.2.0+cu116
    - torchmetrics: 0.11.1
  • Packages:
    - absl-py: 1.4.0
    - accelerate: 0.16.0
    - aiohttp: 3.8.4
    - aiosignal: 1.3.1
    - alabaster: 0.7.13
    - anyio: 3.6.2
    - appdirs: 1.4.4
    - argon2-cffi: 21.3.0
    - argon2-cffi-bindings: 21.2.0
    - arrow: 1.2.3
    - asttokens: 2.2.1
    - async-timeout: 4.0.2
    - attrs: 22.2.0
    - babel: 2.12.1
    - backcall: 0.2.0
    - bcrypt: 4.0.1
    - beautifulsoup4: 4.11.2
    - bleach: 6.0.0
    - blessed: 1.20.0
    - boto3: 1.26.84
    - botocore: 1.29.84
    - cachetools: 5.3.0
    - certifi: 2022.12.7
    - cffi: 1.15.1
    - cfgv: 3.3.1
    - chardet: 4.0.0
    - charset-normalizer: 3.0.1
    - click: 7.0
    - colorama: 0.4.4
    - colorclass: 2.2.2
    - colossalai: 0.2.4
    - comm: 0.1.2
    - contexttimer: 0.3.3
    - croniter: 1.3.8
    - cryptography: 39.0.1
    - dacite: 1.8.0
    - datasets: 2.10.1
    - dateutils: 0.6.12
    - debugpy: 1.6.6
    - decorator: 5.1.1
    - deepdiff: 6.2.3
    - deepspeed: 0.7.7
    - defusedxml: 0.7.1
    - dill: 0.3.6
    - distlib: 0.3.6
    - distro: 1.8.0
    - dnspython: 2.3.0
    - docker-pycreds: 0.4.0
    - docutils: 0.16
    - email-validator: 1.3.1
    - executing: 1.2.0
    - fabric: 3.0.0
    - fairscale: 0.4.13
    - fastapi: 0.88.0
    - fastjsonschema: 2.16.3
    - feedparser: 6.0.10
    - filelock: 3.9.0
    - fire: 0.5.0
    - flit-core: 3.6.0
    - fqdn: 1.5.1
    - frozenlist: 1.3.3
    - fsspec: 2023.1.0
    - gitdb: 4.0.10
    - gitpython: 3.1.31
    - google-auth: 2.16.2
    - google-auth-oauthlib: 0.4.6
    - grpcio: 1.51.3
    - h11: 0.14.0
    - hjson: 3.1.0
    - httpcore: 0.16.3
    - httptools: 0.5.0
    - httpx: 0.23.3
    - huggingface-hub: 0.13.0rc1
    - identify: 2.5.18
    - idna: 3.4
    - imagesize: 1.4.1
    - inquirer: 3.1.2
    - invoke: 2.0.0
    - ipykernel: 6.21.3
    - ipython: 8.11.0
    - ipython-genutils: 0.2.0
    - ipywidgets: 7.7.1
    - isoduration: 20.11.0
    - itsdangerous: 2.1.2
    - jedi: 0.18.2
    - jinja2: 3.1.2
    - jmespath: 1.0.1
    - jsonpointer: 2.3
    - jsonschema: 4.17.3
    - jupyter: 1.0.0
    - jupyter-client: 8.0.3
    - jupyter-console: 6.6.3
    - jupyter-core: 5.2.0
    - jupyter-events: 0.6.3
    - jupyter-server: 2.4.0
    - jupyter-server-terminals: 0.4.4
    - jupyterlab-pygments: 0.2.2
    - jupyterlab-widgets: 3.0.5
    - lightning: 1.9.4
    - lightning-cloud: 0.5.31
    - lightning-colossalai: 0.1.0.dev1
    - lightning-lite: 1.8.0
    - lightning-utilities: 0.8.0
    - llama: 0.0.0
    - loguru: 0.6.0
    - markdown: 3.4.1
    - markdown-it-py: 2.2.0
    - markupsafe: 2.1.2
    - matplotlib-inline: 0.1.6
    - mdit-py-plugins: 0.3.5
    - mdurl: 0.1.2
    - mistune: 2.0.5
    - multidict: 6.0.4
    - multiprocess: 0.70.14
    - myst-parser: 0.19.1
    - nbclassic: 0.5.3
    - nbclient: 0.7.2
    - nbconvert: 7.2.9
    - nbformat: 5.7.3
    - nest-asyncio: 1.5.6
    - netaddr: 0.8.0
    - ninja: 1.11.1
    - nodeenv: 1.7.0
    - notebook: 6.5.3
    - notebook-shim: 0.2.2
    - numpy: 1.24.2
    - nvidia-cublas-cu11:
    - nvidia-cuda-nvrtc-cu11: 11.7.99
    - nvidia-cuda-runtime-cu11: 11.7.99
    - nvidia-cudnn-cu11:
    - oauthlib: 3.2.2
    - ordered-set: 4.1.0
    - orjson: 3.8.7
    - packaging: 23.0
    - pandas: 1.5.3
    - pandocfilters: 1.5.0
    - parallelformers: 1.2.7
    - paramiko: 3.0.0
    - parso: 0.8.3
    - pathtools: 0.1.2
    - pexpect: 4.8.0
    - pickleshare: 0.7.5
    - pip: 22.3.1
    - platformdirs: 3.1.0
    - pre-commit: 3.1.1
    - prometheus-client: 0.16.0
    - prompt-toolkit: 3.0.38
    - protobuf: 4.22.0
    - psutil: 5.9.4
    - ptyprocess: 0.7.0
    - pure-eval: 0.2.2
    - py-cpuinfo: 9.0.0
    - pyarrow: 11.0.0
    - pyasn1: 0.4.8
    - pyasn1-modules: 0.2.8
    - pycparser: 2.21
    - pydantic: 1.10.5
    - pygments: 2.14.0
    - pyjwt: 2.6.0
    - pynacl: 1.5.0
    - pynvml: 11.5.0
    - pyrsistent: 0.19.3
    - python-dateutil: 2.8.2
    - python-dotenv: 1.0.0
    - python-editor: 1.0.4
    - python-json-logger: 2.0.7
    - python-magic: 0.4.27
    - python-multipart: 0.0.6
    - pytorch-lightning: 1.9.4
    - pytz: 2022.7.1
    - pyyaml: 6.0
    - pyzmq: 25.0.0
    - qtconsole: 5.4.0
    - qtpy: 2.3.0
    - readchar: 4.0.3
    - regex: 2022.10.31
    - requests: 2.28.2
    - requests-oauthlib: 1.3.1
    - responses: 0.18.0
    - rfc3339-validator: 0.1.4
    - rfc3986: 1.5.0
    - rfc3986-validator: 0.1.1
    - rich: 13.3.1
    - rsa: 4.7.2
    - s3cmd: 2.3.0
    - s3transfer: 0.6.0
    - send2trash: 1.8.0
    - sentencepiece: 0.1.97
    - sentry-sdk: 1.16.0
    - setproctitle: 1.3.2
    - setuptools: 65.6.3
    - sgmllib3k: 1.0.0
    - six: 1.16.0
    - smmap: 5.0.0
    - sniffio: 1.3.0
    - snowballstemmer: 2.2.0
    - soupsieve: 2.4
    - sphinx: 6.1.3
    - sphinx-click: 4.4.0
    - sphinxcontrib-applehelp: 1.0.4
    - sphinxcontrib-devhelp: 1.0.2
    - sphinxcontrib-htmlhelp: 2.0.1
    - sphinxcontrib-jsmath: 1.0.1
    - sphinxcontrib-qthelp: 1.0.3
    - sphinxcontrib-serializinghtml: 1.1.5
    - stack-data: 0.6.2
    - starlette: 0.22.0
    - starsessions: 1.3.0
    - tensor-parallel: 1.1.0
    - tensorboard: 2.12.0
    - tensorboard-data-server: 0.7.0
    - tensorboard-plugin-wit: 1.8.1
    - termcolor: 2.2.0
    - terminado: 0.17.1
    - terminaltables: 3.1.10
    - tinycss2: 1.2.1
    - tokenizers: 0.13.2
    - torch: 1.13.1
    - torchdistx: 0.2.0+cu116
    - torchmetrics: 0.11.1
    - tornado: 6.2
    - tqdm: 4.64.1
    - traitlets: 5.9.0
    - transformers: 4.26.1
    - tvllm: 0.0.0
    - twcc-cli: 0.6.0
    - typing-extensions: 4.5.0
    - ujson: 5.7.0
    - uri-template: 1.2.0
    - urllib3: 1.26.14
    - uvicorn: 0.20.0
    - uvloop: 0.17.0
    - virtualenv: 20.20.0
    - wandb: 0.13.11
    - watchfiles: 0.18.1
    - wcwidth: 0.2.6
    - webcolors: 1.12
    - webencodings: 0.5.1
    - websocket-client: 1.5.1
    - websockets: 10.4
    - werkzeug: 2.2.3
    - wheel: 0.38.4
    - widgetsnbextension: 3.6.2
    - xxhash: 3.2.0
    - yarl: 1.8.2
  • System:
    - OS: Linux
    - architecture:
    - 64bit
    - ELF
    - processor: x86_64
    - python: 3.10.9
    - version: Proposal for help #1 SMP Tue Mar 31 23:36:51 UTC 2020

More info

cc @awaelchli @carmocca

I notice that this issue is marked for v1.9.x, but FSDPStrategy in v2.x seems to have the same problem.
I have not tested it in person, but the code looks the same.
Should we open another issue for it?

No need. We'll take care of it

