Skip to content

Comments

Improve overflow handling in ZeRO#6976

Merged
loadams merged 99 commits intomasterfrom
olruwase/ds_5241
Jun 9, 2025
Merged

Improve overflow handling in ZeRO#6976
loadams merged 99 commits intomasterfrom
olruwase/ds_5241

Conversation

@tjruwase
Copy link
Contributor

@tjruwase tjruwase commented Jan 28, 2025

Fix #5241: Improve overflow handling

  • ZeRO 1
  • ZeRO 2
  • ZeRO 3
  • BF16Optimizer

Enable pydantic configuration for mixed precision

  • bf16
  • fp16

@tjruwase
Copy link
Contributor Author

@delock, @inkcherry, can you please help investigate the failing xpu-max1100 CI? Thanks!

@delock
Copy link
Collaborator

delock commented Feb 5, 2025

@delock, @inkcherry, can you please help investigate the failing xpu-max1100 CI? Thanks!

@tjruwase thanks! Our engineer is looking into it.

@sayakpaul
Copy link

Any ETA on this for merge?

@tjruwase
Copy link
Contributor Author

tjruwase commented Jun 6, 2025

Any ETA on this for merge?
Since CI looks to now be fine, this should be merged by 06/13/25. Thanks for the patience.

@loadams loadams enabled auto-merge (squash) June 9, 2025 16:39
@loadams loadams merged commit e440506 into master Jun 9, 2025
12 checks passed
@loadams loadams deleted the olruwase/ds_5241 branch June 9, 2025 17:30
deepcharm pushed a commit to deepcharm/DeepSpeed that referenced this pull request Jun 16, 2025
Fix deepspeedai#5241: Improve overflow handling
- [x] ZeRO 1
- [x] ZeRO 2
- [ ] ZeRO 3
- [ ] BF16Optimizer

Enable pydantic configuration for mixed precision
- [x] bf16
- [x] fp16

---------

Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: inkcherry <mingzhi.liu@intel.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Xinyu Lian <lian7@illinois.edu>
Co-authored-by: loadams <loadams@users.noreply.github.com>
Co-authored-by: Omar Elayan <142979319+oelayan7@users.noreply.github.com>
Co-authored-by: Fabio Geraci <118277438+fabiosanger@users.noreply.github.com>
Co-authored-by: Sam Foreman <saforem2@gmail.com>
Co-authored-by: Fabien Dupont <fabiendupont@fabiendupont.fr>
Co-authored-by: Liangliang Ma <1906710196@qq.com>
Co-authored-by: inkcherry <mingzhi.liu@intel.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Max Kovalenko <mkovalenko@habana.ai>
Antlera pushed a commit to Antlera/DeepSpeed that referenced this pull request Jun 27, 2025
Fix deepspeedai#5241: Improve overflow handling 
- [x] ZeRO 1
- [x] ZeRO 2
- [ ] ZeRO 3
- [ ] BF16Optimizer

Enable pydantic configuration for mixed precision
- [x] bf16
- [x] fp16

---------

Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: inkcherry <mingzhi.liu@intel.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Xinyu Lian <lian7@illinois.edu>
Co-authored-by: loadams <loadams@users.noreply.github.com>
Co-authored-by: Omar Elayan <142979319+oelayan7@users.noreply.github.com>
Co-authored-by: Fabio Geraci <118277438+fabiosanger@users.noreply.github.com>
Co-authored-by: Sam Foreman <saforem2@gmail.com>
Co-authored-by: Fabien Dupont <fabiendupont@fabiendupont.fr>
Co-authored-by: Liangliang Ma <1906710196@qq.com>
Co-authored-by: inkcherry <mingzhi.liu@intel.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
mauryaavinash95 pushed a commit to DataStates/DeepSpeed that referenced this pull request Oct 4, 2025
Fix deepspeedai#5241: Improve overflow handling 
- [x] ZeRO 1
- [x] ZeRO 2
- [ ] ZeRO 3
- [ ] BF16Optimizer

Enable pydantic configuration for mixed precision
- [x] bf16
- [x] fp16

---------

Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: inkcherry <mingzhi.liu@intel.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Xinyu Lian <lian7@illinois.edu>
Co-authored-by: loadams <loadams@users.noreply.github.com>
Co-authored-by: Omar Elayan <142979319+oelayan7@users.noreply.github.com>
Co-authored-by: Fabio Geraci <118277438+fabiosanger@users.noreply.github.com>
Co-authored-by: Sam Foreman <saforem2@gmail.com>
Co-authored-by: Fabien Dupont <fabiendupont@fabiendupont.fr>
Co-authored-by: Liangliang Ma <1906710196@qq.com>
Co-authored-by: inkcherry <mingzhi.liu@intel.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Zero2 offload overflow