-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about the final test_acc in cifar10 experiment. #16
Comments
Hi, I'm not sure exactly what experimental results you're trying to reproduce. I'm not sure whether we had any results in the paper that have mode=fedavg, num_clients=200, num_workers=10, local_batch_size=1. Could you tell me what results you're trying to reproduce so I can tell you what hparams to use? Thanks. |
python cv_train.py --dataset_name CIFAR10 --model ResNet9 --mode sketch --num_clients 10000 --num_workers 100 --num_rows 1 --num_cols 50000 --error_type virtual --local_momentum 0.0 --virtual_momentum 0.9 --max_grad_norm 10.0 --num_devices=1 --lr_scale 0.4 --local_batch_size -1 --share_ps_gpu |
Thanks very much! I will try it now. |
I used the hparams you give, but I got the results: |
Can you change num_cols -> 500000? |
I will try, thanks! |
The result has some improvement, If I want more improvements, try num_rows->5? |
try this
|
It seems that the error is occurring because the labs we are passing in is just an integer denoting the class label and for some reason the cuda kernel doesn't work with ints? That's pretty weird. What's your torch version and cuda version? Can you print out the types of the inputs in the backward pass? Can you try just casting the label to a torch data type? |
My CUDA version is 12.0 and my torch version is 2.1.1 |
Hi, I tried to reproduce the experiment results in the paper. I am using the following commands:
python cv_train.py --dataset_name CIFAR10 --model ResNet9 --mode fedavg --num_clients 200 --num_workers 10 --num_rows 1 --num_cols 50000 --error_type none --local_momentum 0.0 --virtual_momentum 0.9 --max_grad_norm 1.0 --num_devices=1 --lr_scale 0 --local_batch_size -1 --share_ps_gpu
the accuracy seems not correct, could you help me solve the problem? the logs are:
MY PID: 3424
Namespace(do_test=False, mode='fedavg', robustagg='none', use_tensorboard=False, seed=21, model='ResNet9', do_finetune=False, do_dp_finetune=False, do_checkpoint=False, checkpoint_path='/data/nvme/ashwinee/CommEfficient/CommEfficient/checkpoints/', finetune_path='./finetune', finetuned_from=None, num_results_train=2, num_results_val=2, dataset_name='CIFAR10', dataset_dir='./dataset', do_batchnorm=False, nan_threshold=999, k=50000, num_cols=50000, num_rows=1, num_blocks=20, do_topk_down=False, local_momentum=0.0, virtual_momentum=0.9, weight_decay=0.0005, num_epochs=24, num_fedavg_epochs=1, fedavg_batch_size=-1, fedavg_lr_decay=1.0, error_type='none', lr_scale=0.4, pivot_epoch=5, port=5315, num_clients=200, num_workers=10, device='cuda', num_devices=1, share_ps_gpu=True, do_iid=False, train_dataloader_workers=0, val_dataloader_workers=0, model_checkpoint='gpt2', num_candidates=2, max_history=2, local_batch_size=-1, valid_batch_size=8, microbatch_size=-1, lm_coef=1.0, mc_coef=1.0, max_grad_norm=1.0, personality_permutations=1, eval_before_start=False, checkpoint_epoch=-1, finetune_epoch=12, do_malicious=False, mal_targets=1, mal_boost=1.0, mal_epoch=0, mal_type=None, do_mal_forecast=False, do_pgd=False, do_data_ownership=False, mal_num_clients=-1, layer_freeze_idx=0, mal_layer_freeze_idx=0, mal_num_epochs=1, backdoor=-1, do_perfect_knowledge=False, do_dp=False, dp_mode='worker', l2_norm_clip=1.0, noise_multiplier=0.0, client_lr=0.1)
50000 125
Using BatchNorm: False
grad size 6568640
Finished initializing in 1.91 seconds
epoch lr train_time train_loss train_acc test_loss test_acc total_time
1 0.0800 25.2243 2.3028 0.1038 2.3012 0.1405 30.4418
2 0.1600 23.8377 2.3025 0.1017 2.2936 0.1460 57.5426
3 0.2400 24.0562 2.2886 0.1157 2.2449 0.1507 84.8176
4 0.3200 23.0985 2.2479 0.1461 2.1887 0.1535 111.1938
5 0.4000 22.0071 2.2901 0.0944 2.2941 0.0930 136.4487
6 0.3789 21.9321 2.3150 0.1301 3.3015 0.0997 161.6546
7 0.3579 21.9460 2.3782 0.1078 2.2771 0.1324 186.8818
8 0.3368 21.8156 2.2793 0.1264 2.2281 0.1360 211.9292
9 0.3158 21.6892 2.2410 0.1775 2.2307 0.1417 236.9210
10 0.2947 21.9432 2.2989 0.1024 2.2831 0.1175 262.0983
11 0.2737 21.9095 2.2511 0.1332 2.1657 0.1901 287.2876
12 0.2526 27.3621 2.1729 0.1771 2.1231 0.1734 321.2075
13 0.2316 37.6449 2.1274 0.1580 2.1067 0.2008 365.1934
14 0.2105 32.6825 2.3116 0.1308 2.0721 0.2026 401.1535
15 0.1895 22.3018 2.1435 0.1707 2.0014 0.2332 426.7760
16 0.1684 30.7159 2.0729 0.1982 2.1173 0.2312 460.7642
17 0.1474 22.4368 2.1110 0.2006 2.0027 0.2580 489.7420
18 0.1263 39.1600 2.0538 0.1897 2.0412 0.2377 535.3520
19 0.1053 38.9138 2.0614 0.2156 2.0193 0.2655 580.7346
20 0.0842 21.9821 1.9763 0.2441 2.0301 0.2679 605.9769
21 0.0632 32.8850 1.9892 0.2655 2.0524 0.2711 645.6084
22 0.0421 38.2427 1.9478 0.2627 1.8612 0.3094 690.2626
23 0.0211 38.3396 1.9010 0.2778 1.8869 0.2993 735.1110
HACK STEP
WARNING: LR is 0
WARNING: LR is 0
24 0.0000 33.1543 1.9016 0.2929 1.8394 0.3032 771.5566
done training
The text was updated successfully, but these errors were encountered: