Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference Time Issue #32

Closed
HYUNJS opened this issue Apr 17, 2024 · 2 comments
Closed

Inference Time Issue #32

HYUNJS opened this issue Apr 17, 2024 · 2 comments

Comments

@HYUNJS
Copy link

HYUNJS commented Apr 17, 2024

Appreciate your efforts in maintaining this project!

While I ran the zero-shot VQA inference (generating results) on the MSRVTT dataset, it took 28 hours (using 4 A5000) to finish. I recognize that it is caused by too many video-question pairs (~70K), but have you solved this problem by implementing a better dataloader? Otherwise, have you experimented with small subset during the development?

Also, I have a minor question why zero2 setting is used for fine-tuning, instead of zero3, compared to the pre-training stage used zero3. This is reserve setting of LLaVA which used zero2 for pre-training and zeor3 for fine-tuning.

And, may I ask for memory consumption during fine-tuning the 7B model since even a batch size of 1 is not enough with using 4 A100 40GB. In the case of using lora for fine-tuning, may I know the configuration that you used, (e.g., lora_r, lora_alpha, etc., and whether the same learning rate was used for mm_projector)?

Thanks!

@jpthu17
Copy link
Member

jpthu17 commented Apr 23, 2024

  • If GPU utilization is not high during inference, then the speed bottleneck is video loading, you can try partitioning the MSRVTT dataset into n subsets and then processing these subsets in parallel, allocating one GPU for each subset. I suspect this will significantly accelerate the reasoning process.
  • We opted not to use zero3 due to a bug in deepspeed. Specifically, training hangs when each process is trained on a different number of batches. In our code, it is very easy to hang with zero3 because the lengths of the pictures and videos vary greatly.
  • For training using lora, we use the default settings of LLaVA, that is
    lora_r: int = 64
    lora_alpha: int = 16
    lora_dropout: float = 0.05
    lora_weight_path: str = ""
    lora_bias: str = "none"

@HYUNJS
Copy link
Author

HYUNJS commented Apr 26, 2024

I see. Thank you for your answer!

@HYUNJS HYUNJS closed this as completed Apr 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants