Skip to content

Latest commit

 

History

History
41 lines (32 loc) · 3.95 KB

training.md

File metadata and controls

41 lines (32 loc) · 3.95 KB

Instructions for Training on HowTo100M

Please note that downloading all HowTo100M videos requires a large storage capacity. The features alone (2D/3D visual features, wav files, and spectrograms) use over 20TB on our servers. The raw audio files take up the most space - about 15 TB. The raw visual frames would use additional space.

  1. Please check the instructions in the ReadMe to setup the data folder. HowTo100M_1166_videopaths.txt contains the list of videos we used during training.

  2. Then download the HowTo100M videos (we downloaded them directly from YouTube - see the Appendix of our paper for more details) and extract the visual features (2D and 3D) using the following script https://github.com/antoine77340/video_feature_extractor. We stored the audio wav files separately.

  3. We re-sampled the audio wav files to 16 kHz and pre-computed spectrograms. We used the code in audio_to_spectrograms.py to save spectrograms in npz format (we also used half-precision to use less space). Please mind the sampling rate.

  4. Our code assumes that all video features are in the same directory. Ex: /data/parsed_videos/video_category/video_subfolder/video_id{_2d_npz,_3d.npz,.wav} Please generate a csv/txt file that lists the paths to these features (the file is called HowTo100M_1166_videopaths.txt in our code and can be specified with --train_csv). Continuing the example above, this file would list everything after /data/parsed_videos/. For example, our file looks like this:

path
hobbies_and_crafts/video_8/8N4mGHodjss
food_and_entertaining/video_j/jEyaTUyJ-rs
...
  1. Our code assume that the audio is stored in a directory structure that matches that of the visual features. You can optionally specify a different root directory for the audio files using --features_path_audio.

  2. Then, the following commands will train the models. Please replace --features_path='/your/path/to/features/here' and --features_path_audio='/your/path/to/audio_features/here' to the root directory of the features in your csv file. Continuing the example above, this would be --features_path='/data/parsed_videos/'.

Train AVLnet on HowTo100M:

python train.py --num_thread_reader=74 --epochs=30 --batch_size=128 --n_pair=32 --embd_dim=4096 --howto_audio_frames=1000 --lr=0.001 --apex_level=1 --checkpoint_dir=model/AVLnet --features_path='/your/path/to/features/here' --features_path_audio='/your/path/to/audio_features/here'

Train AVLnet-Text-Tri on HowTo100M: (Note that in this command --tri_modal_fuse=1 is removed).

python train.py --num_thread_reader=74 --epochs=15 --batch_size=256 --n_pair=32 --embd_dim=6144 --howto_audio_frames=800 --min_time=8.0 --random_audio_windows=0 --lr=0.00025 --tri_modal=1 --apex_level=1 --checkpoint_dir=model/AVLnet_Text_Tri --features_path='/your/path/to/features/here' --features_path_audio='/your/path/to/audio_features/here'

Train AVLnet-Text-Fused on HowTo100M:

python train.py --num_thread_reader=74 --epochs=15 --batch_size=64 --n_pair=32 --embd_dim=4096 --howto_audio_frames=1000 --min_time=10.0 --random_audio_windows=0 --lr=0.0001 --tri_modal=1 --tri_modal_fuse=1 --apex_level=1 --checkpoint_dir=model/AVLnet_Text --features_path='/your/path/to/features/here' --features_path_audio='/your/path/to/audio_features/here'

Notes

  • We used 2 V100 GPUs each w/ 32 GBs memory to train AVLnet and AVLnet-Text-Fused, and 4 V100 GPUs to train AVLnet-Text-Tri.
  • --num_thread_reader=74 specifies 74 data loading workers. We found this value worked the fastest on our machines with 80 CPU threads. This should be adjusted based on your machine.
  • If you do not have enough GPU memory, you can reduce some or all of --batch_size, --n_pair, --howto_audio_frames, however your performance may differ from ours.
  • The model weights will be saved every epoch to the directory specified by --checkpoint_dir. You can resume training from a checkpoint with the --pretrain_path flag (make sure to adjust the remaining number of epochs accordingly)