Skip to content

Conversation

scotts
Copy link
Contributor

@scotts scotts commented Oct 18, 2025

Adds a benchmark to compare the runtime performance between:

  • Applying a decoder-native transform in TorchCodec.
  • Decoding an unchanged frame from TorchCodec, and applying the equivalent TorchVision v2 transform.

Initially, I wanted to extend the existing benchmark_decoders_library.py, but it got too awkward for a variety of reasons. Instead, I took inspiration from what @NicolasHug implemented for benchmark_audio_decoders.py.

In my results, I'm using a video generated from:

ffmpeg -y -f lavfi -i "mandelbrot=s=1920x1080" -t 120 -c:v libopenh264 -r 60 -g 600 -pix_fmt yuv420p mandelbrot.mp4

This produces a video with the metadata:

>>> mandelbrot.metadata
VideoStreamMetadata:
  duration_seconds_from_header: 120.0
  begin_stream_seconds_from_header: 0.0
  bit_rate: 9846643.0
  codec: h264
  stream_index: 0
  begin_stream_seconds_from_content: None
  end_stream_seconds_from_content: None
  width: 1920
  height: 1080
  num_frames_from_header: 7200
  num_frames_from_content: None
  average_fps_from_header: 60.0
  pixel_aspect_ratio: 1
  duration_seconds: 120.0
  begin_stream_seconds: 0
  end_stream_seconds: 120.0
  num_frames: 7200
  average_fps: 60.0

Raw results from running the benchmark:

[scottas@devvm24339 torchcodec] time python benchmarks/decoders/benchmark_transforms.py --path mandelbrot.mp4 --num-exp 5
Benchmarking mandelbrot.mp4, duration: 120.0, codec: h264, averaging over 5 runs:
Sampling 0.5%, 36, of 7200 frames
torchvision_resize((540, 960))                med = 3127.76, mean = 3130.26 +- 67.88, min = 3052.72, max = 3207.07 - in ms
decoder_native_resize((540, 960))             med = 2859.82, mean = 2846.57 +- 48.54, min = 2796.28, max = 2910.01 - in ms

torchvision_crop((540, 960), 270, 480)        med = 2918.80, mean = 2914.02 +- 38.80, min = 2866.04, max = 2963.62 - in ms
decoder_native_crop((540, 960), 270, 480)     med = 2868.21, mean = 2877.06 +- 65.62, min = 2789.38, max = 2956.73 - in ms

torchvision_resize((270, 480))                med = 3130.23, mean = 3119.50 +- 35.71, min = 3065.44, max = 3161.76 - in ms
decoder_native_resize((270, 480))             med = 2892.15, mean = 2884.34 +- 45.83, min = 2823.83, max = 2941.37 - in ms

torchvision_crop((270, 480), 405, 720)        med = 3021.95, mean = 3025.29 +- 54.71, min = 2960.59, max = 3111.12 - in ms
decoder_native_crop((270, 480), 405, 720)     med = 3012.73, mean = 3002.23 +- 41.48, min = 2935.82, max = 3046.63 - in ms

torchvision_resize((135, 240))                med = 3115.35, mean = 3127.73 +- 40.55, min = 3097.86, max = 3199.07 - in ms
decoder_native_resize((135, 240))             med = 2998.91, mean = 2984.52 +- 51.88, min = 2926.61, max = 3052.63 - in ms

torchvision_crop((135, 240), 472, 840)        med = 2927.12, mean = 2961.23 +- 58.39, min = 2905.33, max = 3032.24 - in ms
decoder_native_crop((135, 240), 472, 840)     med = 2887.55, mean = 2889.95 +- 26.76, min = 2854.67, max = 2929.89 - in ms

Sampling 1.0%, 72, of 7200 frames
torchvision_resize((540, 960))                med = 4010.83, mean = 4035.35 +- 92.37, min = 3914.23, max = 4143.24 - in ms
decoder_native_resize((540, 960))             med = 3653.65, mean = 3633.58 +- 51.07, min = 3548.28, max = 3674.40 - in ms

torchvision_crop((540, 960), 270, 480)        med = 3583.87, mean = 3580.71 +- 18.79, min = 3551.56, max = 3603.99 - in ms
decoder_native_crop((540, 960), 270, 480)     med = 3629.21, mean = 3642.67 +- 98.56, min = 3502.33, max = 3754.71 - in ms

torchvision_resize((270, 480))                med = 3903.53, mean = 3884.51 +- 78.43, min = 3755.74, max = 3969.27 - in ms
decoder_native_resize((270, 480))             med = 3492.23, mean = 3468.99 +- 48.85, min = 3386.13, max = 3504.75 - in ms

torchvision_crop((270, 480), 405, 720)        med = 3634.28, mean = 3640.66 +- 27.02, min = 3604.47, max = 3671.69 - in ms
decoder_native_crop((270, 480), 405, 720)     med = 3577.07, mean = 3597.95 +- 98.52, min = 3476.51, max = 3722.25 - in ms

torchvision_resize((135, 240))                med = 3819.38, mean = 3855.92 +- 86.06, min = 3768.17, max = 3989.23 - in ms
decoder_native_resize((135, 240))             med = 3609.79, mean = 3590.03 +- 86.38, min = 3468.03, max = 3674.19 - in ms

torchvision_crop((135, 240), 472, 840)        med = 3680.36, mean = 3674.32 +- 57.59, min = 3589.65, max = 3738.58 - in ms
decoder_native_crop((135, 240), 472, 840)     med = 3652.02, mean = 3623.59 +- 74.67, min = 3502.72, max = 3688.00 - in ms

Sampling 5.0%, 360, of 7200 frames
torchvision_resize((540, 960))                med = 6893.08, mean = 6930.55 +- 103.00, min = 6814.68, max = 7072.31 - in ms
decoder_native_resize((540, 960))             med = 5037.04, mean = 5035.49 +- 27.90, min = 5003.21, max = 5068.74 - in ms

torchvision_crop((540, 960), 270, 480)        med = 5031.53, mean = 5076.52 +- 77.11, min = 5020.56, max = 5195.28 - in ms
decoder_native_crop((540, 960), 270, 480)     med = 4395.75, mean = 4438.98 +- 114.75, min = 4332.14, max = 4593.07 - in ms

torchvision_resize((270, 480))                med = 6250.84, mean = 6257.17 +- 87.29, min = 6142.59, max = 6378.98 - in ms
decoder_native_resize((270, 480))             med = 4417.45, mean = 4526.08 +- 227.19, min = 4328.54, max = 4795.60 - in ms

torchvision_crop((270, 480), 405, 720)        med = 4967.48, mean = 4973.30 +- 54.71, min = 4896.89, max = 5038.94 - in ms
decoder_native_crop((270, 480), 405, 720)     med = 4375.91, mean = 4389.28 +- 48.45, min = 4355.25, max = 4474.55 - in ms

torchvision_resize((135, 240))                med = 6218.58, mean = 6225.40 +- 62.10, min = 6168.38, max = 6329.95 - in ms
decoder_native_resize((135, 240))             med = 4351.39, mean = 4385.69 +- 58.66, min = 4337.63, max = 4453.15 - in ms

torchvision_crop((135, 240), 472, 840)        med = 4968.77, mean = 4991.85 +- 61.76, min = 4922.92, max = 5075.04 - in ms
decoder_native_crop((135, 240), 472, 840)     med = 4397.08, mean = 4392.68 +- 52.71, min = 4336.45, max = 4470.58 - in ms

Sampling 10.0%, 720, of 7200 frames
torchvision_resize((540, 960))                med = 9193.76, mean = 9161.19 +- 183.73, min = 8925.95, max = 9335.10 - in ms
decoder_native_resize((540, 960))             med = 5452.89, mean = 5438.05 +- 54.90, min = 5347.82, max = 5492.74 - in ms

torchvision_crop((540, 960), 270, 480)        med = 5482.16, mean = 5484.13 +- 36.88, min = 5431.92, max = 5535.65 - in ms
decoder_native_crop((540, 960), 270, 480)     med = 4708.52, mean = 4731.11 +- 57.57, min = 4692.12, max = 4829.76 - in ms

torchvision_resize((270, 480))                med = 8153.42, mean = 8129.92 +- 72.84, min = 8014.58, max = 8210.66 - in ms
decoder_native_resize((270, 480))             med = 4802.11, mean = 4835.30 +- 143.86, min = 4682.88, max = 5073.51 - in ms

torchvision_crop((270, 480), 405, 720)        med = 5440.87, mean = 5455.93 +- 131.09, min = 5323.52, max = 5663.85 - in ms
decoder_native_crop((270, 480), 405, 720)     med = 4592.39, mean = 4617.36 +- 72.49, min = 4550.59, max = 4740.84 - in ms

torchvision_resize((135, 240))                med = 7760.04, mean = 7791.98 +- 113.75, min = 7696.63, max = 7988.92 - in ms
decoder_native_resize((135, 240))             med = 4803.85, mean = 4793.58 +- 89.88, min = 4695.10, max = 4916.40 - in ms

torchvision_crop((135, 240), 472, 840)        med = 5454.57, mean = 5446.10 +- 106.46, min = 5308.42, max = 5599.37 - in ms
decoder_native_crop((135, 240), 472, 840)     med = 4611.31, mean = 4614.23 +- 20.13, min = 4591.76, max = 4637.14 - in ms

Some conclusions:

  1. The benefit of decoder-native transforms increases with the number of frames being decoded from a video.
  2. When the number of frames is ~10, the performance difference is minor.
  3. The relative improvement seems stable when the output dimension changes.
  4. Resize sees more improvement than Crop.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 18, 2025
@scotts scotts marked this pull request as ready for review October 18, 2025 03:02
input_height = metadata.height
input_width = metadata.width
fraction_of_total_frames_to_sample = [0.005, 0.01, 0.05, 0.1]
fraction_of_input_dimensions = [0.5, 0.25, 0.125]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking note: it's worth upsampling too (fraction > 1) to get a more complete view: models typically expect one given size, but images from the dataset can be either larger (and downsampled) or smaller (and upsampled)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants