Please note that downloading all HowTo100M videos requires a large storage capacity. The features alone (2D/3D visual features, wav files, and spectrograms) use over 20TB on our servers. The raw audio files take up the most space - about 15 TB. The raw visual frames would use additional space.
-
Please check the instructions in the ReadMe to setup the data folder.
HowTo100M_1166_videopaths.txt
contains the list of videos we used during training. -
Then download the HowTo100M videos (we downloaded them directly from YouTube - see the Appendix of our paper for more details) and extract the visual features (2D and 3D) using the following script https://github.com/antoine77340/video_feature_extractor. We stored the audio wav files separately.
-
We re-sampled the audio wav files to 16 kHz and pre-computed spectrograms. We used the code in
audio_to_spectrograms.py
to save spectrograms in npz format (we also used half-precision to use less space). Please mind the sampling rate. -
Our code assumes that all video features are in the same directory. Ex:
/data/parsed_videos/video_category/video_subfolder/video_id{_2d_npz,_3d.npz,.wav}
Please generate a csv/txt file that lists the paths to these features (the file is calledHowTo100M_1166_videopaths.txt
in our code and can be specified with--train_csv
). Continuing the example above, this file would list everything after/data/parsed_videos/
. For example, our file looks like this:
path
hobbies_and_crafts/video_8/8N4mGHodjss
food_and_entertaining/video_j/jEyaTUyJ-rs
...
-
Our code assume that the audio is stored in a directory structure that matches that of the visual features. You can optionally specify a different root directory for the audio files using
--features_path_audio
. -
Then, the following commands will train the models. Please replace
--features_path='/your/path/to/features/here'
and--features_path_audio='/your/path/to/audio_features/here'
to the root directory of the features in your csv file. Continuing the example above, this would be--features_path='/data/parsed_videos/'
.
Train AVLnet on HowTo100M:
python train.py --num_thread_reader=74 --epochs=30 --batch_size=128 --n_pair=32 --embd_dim=4096 --howto_audio_frames=1000 --lr=0.001 --apex_level=1 --checkpoint_dir=model/AVLnet --features_path='/your/path/to/features/here' --features_path_audio='/your/path/to/audio_features/here'
Train AVLnet-Text-Tri on HowTo100M:
(Note that in this command --tri_modal_fuse=1
is removed).
python train.py --num_thread_reader=74 --epochs=15 --batch_size=256 --n_pair=32 --embd_dim=6144 --howto_audio_frames=800 --min_time=8.0 --random_audio_windows=0 --lr=0.00025 --tri_modal=1 --apex_level=1 --checkpoint_dir=model/AVLnet_Text_Tri --features_path='/your/path/to/features/here' --features_path_audio='/your/path/to/audio_features/here'
Train AVLnet-Text-Fused on HowTo100M:
python train.py --num_thread_reader=74 --epochs=15 --batch_size=64 --n_pair=32 --embd_dim=4096 --howto_audio_frames=1000 --min_time=10.0 --random_audio_windows=0 --lr=0.0001 --tri_modal=1 --tri_modal_fuse=1 --apex_level=1 --checkpoint_dir=model/AVLnet_Text --features_path='/your/path/to/features/here' --features_path_audio='/your/path/to/audio_features/here'
- We used 2 V100 GPUs each w/ 32 GBs memory to train AVLnet and AVLnet-Text-Fused, and 4 V100 GPUs to train AVLnet-Text-Tri.
--num_thread_reader=74
specifies 74 data loading workers. We found this value worked the fastest on our machines with 80 CPU threads. This should be adjusted based on your machine.- If you do not have enough GPU memory, you can reduce some or all of
--batch_size
,--n_pair
,--howto_audio_frames
, however your performance may differ from ours. - The model weights will be saved every epoch to the directory specified by
--checkpoint_dir
. You can resume training from a checkpoint with the--pretrain_path
flag (make sure to adjust the remaining number of epochs accordingly)