-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't hear the audio #112
Comments
Thanks for your attention! Currently, our audio branch mainly focuses on understanding audio events, and has not yet included speech recognition functions, so the model cannot identify the specific content of the speaker. Besides, you should switch to the audio_visual branch (https://github.com/DAMO-NLP-SG/VideoLLaMA2/tree/audio_visual) and clone the repository to run inference for audio_visual related tasks. |
Thank you for your response. I have a few more questions. First question: I have some video data that I want to fine-tune, and in va_joint.sh, I use --data_path ${DATA_DIR}/stage3_video_audio.json,${DATA_DIR}/stage2_audio_subset_new.json,${DATA_DIR}/stage2_video_subset.json . How should I design this? My understanding is that stage3_video_audio.json and stage2_audio_subset_new.json use the same set of videos, while ${DATA_DIR}/stage2_video_subset.json uses the audio from the videos. Second question: I want to further train using VideoLLaMA2.1-7B-AV. How should I modify va_joint.sh? Additionally, what should I pay attention to during this process? Is it possible to see the prompts you used in your paper? Looking forward to your response, and thank you again! |
Thank you again for your response. Can I use only Thank you for taking the time to answer my question amidst your busy schedule! |
You can fine-tune the model using only stage3_video_audio.json like this --data_path ${DATA_DIR}/stage3_video_audio.json; you can also use --model_path DAMO-NLP-SG/VideoLLaMA2.1-7B-AV to continue training VideoLLaMA2.1-7B-AV. |
Thank you very much for your response. When executing bash scripts/custom/va_joint.sh, I encountered the following error: Environment VariablesARG_WORLD_SIZE=${1:-1} Multiple conditionsif [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then echo "WORLD_SIZE: $WORLD_SIZE" Training ArgumentsGLOBAL_BATCH_SIZE=128 Log Argumentsexport TRANSFORMERS_OFFLINE=1 |
I solved this by adding |
Thank you for your response. I would like to ask what size GPU you used to get it running. I used 8 A100-40G GPUs, but I keep getting the following error: [2024-10-27 17:04:36,988] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 6 (pid: 2829472) of binary: /opt/conda/envs/Videollama2/bin/python In addition, I made the following adjustments: GLOBAL_BATCH_SIZE=32 But it still cannot train properly. |
I encountered the same issue. Does further fine-tuning of the |
Has your problem been solved? I'm also having issues with OOM |
No... Without an official response and unsure how to solve it, I have temporarily put it aside. |
I used 6 A100-80G and set the local batch size to 2 it worked... |
How did you solve this problem? |
Hello, how you modified the va_joint.sh script to fine-tune the AV model? |
import sys
sys.path.append('./')
from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init
def inference():
disable_torch_init()
if name == "main":
inference()
The output is: The person in the video spoke a few words, but they were not audible.
I input a video with sound, but it seems the model didn't pick it up. Is it because the audio branch isn't functioning properly? Also, I changed "mm_audio_tower" in VideoLLaMA2.1-7B-AV/config.json to the provided BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt. Is this the correct place to make the change? Thanks for your reply!
The text was updated successfully, but these errors were encountered: