In this repo, we demonstrate that the FVD implementation from StyleGAN-V paper is equivalent to the original one when the videos are already loaded into memory and resized to a necessary resolution. The main difference of our FVD evaluation protocol from the paper is that we strictly specify how data should be processed, clips sampled, etc.
Disclaimer: this repo is just for verifying that our pytorch FVD implementation is identical to the one from tensorflow. If you want to compute FVD on your videos — please, use the src/scripts/calc_metrics_for_dataset.py
script in the main repo.
The problem with the original implementation is that it does not handle:
- data processing: in which format videos are being stored (JPG/PNG directories of frames or MP4, etc.), how frames are being resized, normalized, etc.
- clip sampling strategy: how clips are being selected (from the beginning of the video, or randomly, with which framerate, how many clips per video, etc.)
- how many fake and real videos should be used
That's why every project computes FVD in their own way and this leads to a lot of discrepancies. In StyleGAN-V, we seek to establish a unified evaluation pipeline.
Also, the original tensorflow snippet is implemented in TensorFlow v1, which final release was done on January 6, 2021 (i.e. more than a year ago) and it won't be updated since then: https://github.com/tensorflow/tensorflow/releases/tag/v1.15.5
We provide two comparisons:
- Comparison between
tf.hub
's I3D model and ourtorchscript
port to demonstrate that our port is a perfectly precise copy (up to numerical precision) oftf.hub
's one. - Comparison between FVD metrics itself. It is done by generating two dummy datasets of 256 videos each with two different random seeds.
We put all the dependencies used into requirements.txt
.
You can install them by running:
pip install -r requirements.txt
To compare the models between each other (in terms of L2 distance of their output), run:
python compare_models.py
In our case, it gives the output:
L_2 difference is 0.00026316816207043225
Which means that both models perform equivalent operations (note that even two equivalent convolutional layers in TF and PyTorch would produce slightly different outputs due to numerical percision).
To compare the metrics between each other, run:
python compare_models.py
On our machine, this gives the output:
[FVD scores] Theirs: 10.13808536529541. Ours: 10.138084766713924
So, the difference is 1e-6, which is negligible.
Note: computing FVD for TensorFlow's implementation might take time since they use exact the square root. In our case, we use a very accurate approximate square root.