Model with ZeRO-1 converges worse than model without ZeRO-1 

![Screenshot from 2021-07-12 12-45-29](https://user-images.githubusercontent.com/1654957/125267033-17e79680-e30f-11eb-81a8-dca0564b14bc.png)

This result was obtained when training Megatron LM from the examples.

Here is DS configuration:
`
{
    "train_batch_size": 32,
    "train_micro_batch_size_per_gpu":1,
    "steps_per_print":10,
    "prescale_gradients":false,
    "gradient_clipping":1.0,
    "wall_clock_breakdown":false,
    "fp16": {
        "enabled":true,
        "loss_scale":0
    },
    "zero_optimization": {
        "stage":1
    },
    "gradient_predivide_factor": 1,
    "zero_allow_untested_optimizer": true
}
`

We use apex Lamb as optimizer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model with ZeRO-1 converges worse than model without ZeRO-1 #1217

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model with ZeRO-1 converges worse than model without ZeRO-1 #1217

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions