Instructions for Training on HowTo100M

Please note that downloading all HowTo100M videos requires a large storage capacity. The features alone (2D/3D visual features, wav files, and spectrograms) use over 20TB on our servers. The raw audio files take up the most space - about 15 TB. The raw visual frames would use additional space.

Please check the instructions in the ReadMe to setup the data folder. HowTo100M_1166_videopaths.txt contains the list of videos we used during training.
Then download the HowTo100M videos (we downloaded them directly from YouTube - see the Appendix of our paper for more details) and extract the visual features (2D and 3D) using the following script https://github.com/antoine77340/video_feature_extractor. We stored the audio wav files separately.
We re-sampled the audio wav files to 16 kHz and pre-computed spectrograms. We used the code in audio_to_spectrograms.py to save spectrograms in npz format (we also used half-precision to use less space). Please mind the sampling rate.
Our code assumes that all video features are in the same directory. Ex: /data/parsed_videos/video_category/video_subfolder/video_id{_2d_npz,_3d.npz,.wav} Please generate a csv/txt file that lists the paths to these features (the file is called HowTo100M_1166_videopaths.txt in our code and can be specified with --train_csv). Continuing the example above, this file would list everything after /data/parsed_videos/. For example, our file looks like this:

path
hobbies_and_crafts/video_8/8N4mGHodjss
food_and_entertaining/video_j/jEyaTUyJ-rs
...

Our code assume that the audio is stored in a directory structure that matches that of the visual features. You can optionally specify a different root directory for the audio files using --features_path_audio.
Then, the following commands will train the models. Please replace --features_path='/your/path/to/features/here' and --features_path_audio='/your/path/to/audio_features/here' to the root directory of the features in your csv file. Continuing the example above, this would be --features_path='/data/parsed_videos/'.

Train AVLnet on HowTo100M:

python train.py --num_thread_reader=74 --epochs=30 --batch_size=128 --n_pair=32 --embd_dim=4096 --howto_audio_frames=1000 --lr=0.001 --apex_level=1 --checkpoint_dir=model/AVLnet --features_path='/your/path/to/features/here' --features_path_audio='/your/path/to/audio_features/here'

Train AVLnet-Text-Tri on HowTo100M: (Note that in this command --tri_modal_fuse=1 is removed).

python train.py --num_thread_reader=74 --epochs=15 --batch_size=256 --n_pair=32 --embd_dim=6144 --howto_audio_frames=800 --min_time=8.0 --random_audio_windows=0 --lr=0.00025 --tri_modal=1 --apex_level=1 --checkpoint_dir=model/AVLnet_Text_Tri --features_path='/your/path/to/features/here' --features_path_audio='/your/path/to/audio_features/here'

Train AVLnet-Text-Fused on HowTo100M:

python train.py --num_thread_reader=74 --epochs=15 --batch_size=64 --n_pair=32 --embd_dim=4096 --howto_audio_frames=1000 --min_time=10.0 --random_audio_windows=0 --lr=0.0001 --tri_modal=1 --tri_modal_fuse=1 --apex_level=1 --checkpoint_dir=model/AVLnet_Text --features_path='/your/path/to/features/here' --features_path_audio='/your/path/to/audio_features/here'

Notes

We used 2 V100 GPUs each w/ 32 GBs memory to train AVLnet and AVLnet-Text-Fused, and 4 V100 GPUs to train AVLnet-Text-Tri.
--num_thread_reader=74 specifies 74 data loading workers. We found this value worked the fastest on our machines with 80 CPU threads. This should be adjusted based on your machine.
If you do not have enough GPU memory, you can reduce some or all of --batch_size, --n_pair, --howto_audio_frames, however your performance may differ from ours.
The model weights will be saved every epoch to the directory specified by --checkpoint_dir. You can resume training from a checkpoint with the --pretrain_path flag (make sure to adjust the remaining number of epochs accordingly)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!