Skip to content
This repository has been archived by the owner on Dec 18, 2024. It is now read-only.

Questions about training and inference configuration #17

Closed
chufengt opened this issue Mar 30, 2022 · 6 comments
Closed

Questions about training and inference configuration #17

chufengt opened this issue Mar 30, 2022 · 6 comments

Comments

@chufengt
Copy link

chufengt commented Mar 30, 2022

Hi,

Thanks for open-sourcing such great work. I have some questions when using this code:

  1. Does the test_lseg.py script support multi-GPU inference? When using a single GPU, it takes about 2~3 hours for inference on ade20k.
  2. I tried to evaluate the provided demo_e200.ckpt on ade20k and got (pixAcc: 0.8078, mIoU: 0.3207), is that correct? It seems lower than the values in the paper.
  3. I trained a model on ade20k (the same config as train.sh, backbone is vit_l16_384) with 8*V100 but found it needs ~90 hours for training 240 epochs. Is it reasonable (it seems much longer than you said in Training configuration #7)?
  4. When I use this code for other datasets like cityscapes, what changes should I make? The only difference I found is get_labels()in lseg_module.py. Have you evaluated the mIoU on cityscapes?

Thanks in advance.

@Boyiliee
Copy link
Collaborator

Boyiliee commented Apr 2, 2022

Hi @chufengt,

Thanks for your interest in LSeg!

  1. Since for test, the time is ok, so currently we don't try multi-GPU inference.
  2. There might be some misunderstanding. Demo model is only for qualitative trial on the fly. For experiments and ablation study, we try on different settings. Please take detailed look at the section 5.1 and experimental setup part of the paper.
  3. Same with 2, please strictly follow the setting of the paper. For all the ablation study such as the results in 5.1, we train LSeg with DPT and a smaller ViT-B/32 backbone together with the CLIP ViT-B/32 text encoder on ADE20k dataset. Therefore, you can follow the training and testing instruction in README. The primary thing to change is to set --backbone clip_vitb32_384, you can check details via this link and Difference between the settings for demo and those in your paper #13 .
  4. You should add label files to https://github.com/isl-org/lang-seg/tree/main/label_files and change the get_labels function to choose how to process your label file. For the quantitative results of the paper, we don't evaluate the mIOU on cityscapes.

Hope this helps.

@Boyiliee Boyiliee closed this as completed Apr 2, 2022
@chufengt
Copy link
Author

chufengt commented Apr 7, 2022

Hi, @Boyiliee,

Thanks for your reply. It really helps. I have some extra questions:

  1. For the training time mentioned above, I noticed that you said '1-2days for ade20k' in Training configuration #7, does it measure with vit_b32 or vit_l16? I'm not sure whether the ~90h training time for vit_l16 on ade20k is reasonable or not. The config is the same as train.sh.
  2. Does this code support multi-node (e.g., 8*2 GPUs) training?
  3. When I tried to train LSeg on Cityscapes, I got the 'out of cuda memory' error with the crop size of 768 (line 31 in lseg_module.py), but using 480 is ok. The backbone is vit_l16 and I use 32G V100 * 8. Is it reasonable?
  4. For Cityscapes, I got mIoU≈60% with the vit_l16 backbone. Other configs are the same as train.sh. It seems much lower than the SoTA results on semantic segmentation. Can you give me some suggestions about how to improve the results?

Thanks again.

@chufengt
Copy link
Author

chufengt commented Apr 9, 2022

Another quick question.

In test_lseg.py:

scales = (
        [0.75, 1.0, 1.25, 1.5, 1.75, 2.0, 2.25]
        if "citys" in args.dataset
        else [0.5, 0.75, 1.0, 1.25, 1.5, 1.75]
    )  

Could you give some references for these selected scales?
I'm not very familiar with semantic segmentation but I found different scales used in HRNet: https://github.com/HRNet/HRNet-Semantic-Segmentation

Performance on the Cityscapes dataset. The models are trained and tested with the input size of 512x1024 and 1024x2048 respectively. If multi-scale testing is used, we adopt scales: 0.5,0.75,1.0,1.25,1.5,1.75.

Performance on the ADE20K dataset. The models are trained and tested with the input size of 520x520. If multi-scale testing is used, we adopt scales: 0.5,0.75,1.0,1.25,1.5,1.75,2.0 (the same as EncNet, DANet etc.).

@Boyiliee
Copy link
Collaborator

We don't conduct experiments on cityscapes. For semantic segmentation, we strictly follow the setting of DPT: https://github.com/isl-org/DPT. Please find the github for more details. Hope this helps!

@chufengt
Copy link
Author

Hi, @Boyiliee,

Thanks for your reply.

It seems that DPT did not release the training code as well as the detailed settings for semantic segmentation.

  1. how about the training time for ade20k mentioned above? is it reasonable?
  2. which scale range did you use for ade20k evaluation? is it [0.5, 0.75, 1.0, 1.25, 1.5, 1.75]?

Thanks again.

@TB5z035
Copy link

TB5z035 commented May 15, 2022

Hi, @Boyiliee,

Thanks for your reply. It really helps. I have some extra questions:

  1. For the training time mentioned above, I noticed that you said '1-2days for ade20k' in Training configuration #7, does it measure with vit_b32 or vit_l16? I'm not sure whether the ~90h training time for vit_l16 on ade20k is reasonable or not. The config is the same as train.sh.
  2. Does this code support multi-node (e.g., 8*2 GPUs) training?
  3. When I tried to train LSeg on Cityscapes, I got the 'out of cuda memory' error with the crop size of 768 (line 31 in lseg_module.py), but using 480 is ok. The backbone is vit_l16 and I use 32G V100 * 8. Is it reasonable?
  4. For Cityscapes, I got mIoU≈60% with the vit_l16 backbone. Other configs are the same as train.sh. It seems much lower than the SoTA results on semantic segmentation. Can you give me some suggestions about how to improve the results?

Thanks again.

Hi!

Thanks for your work, and it's really impressive. But I would suggest you put the 4th point about adding label files in the README and also raise an error or warning when args.dataset is not ade20k, since the dataset choice is hardcoded in the LSegModule class. This may save like a few hours for anyone who hopes to use your codebase on other datasets.

Thanks again!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants