-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Trying out the Cifar-10 examples and it appears that adding in some arguments doesn't work because they are being overwritten somewhere.
I tried changing the number of epochs to run by using the flag --epochs but it looks like the cifar10_deepspeed.py script has hard coded 2 epochs:
for epoch in range(2): # loop over the dataset multiple times
I also tried to change the learning rate to 0.0005 by changing the ds_config.json file and it seems like that gets pick up in some parts but overwritten in other parts.
For example I see
worker-0: [2020-10-01 22:41:49,395] [INFO] [config.py:624:print] optimizer_params ............. {'lr': 0.0005, 'betas': [0.8, 0.999], 'eps': 1e-08, 'weight_decay': 3e-07}
Which seems to have picked it up, but when it actually runs the training it always says:
worker-0: [2020-10-01 22:42:57,190] [INFO] [logging.py:60:log_dist] [Rank 0] step=18000, skipped=0, lr=[0.001], mom=[[0.8, 0.999]]
Which seems to have not picked up the lr change (it stays at 0.001 throughout as well which suggests it's not doing any lr_warmup either). I haven't tracked down where in the script the learning rate got overwritten... but it does seem to be happening.