Decoder-native transforms benchmark #982

scotts · 2025-10-18T02:42:19Z

Adds a benchmark to compare the runtime performance between:

Applying a decoder-native transform in TorchCodec.
Decoding an unchanged frame from TorchCodec, and applying the equivalent TorchVision v2 transform.

Initially, I wanted to extend the existing benchmark_decoders_library.py, but it got too awkward for a variety of reasons. Instead, I took inspiration from what @NicolasHug implemented for benchmark_audio_decoders.py.

In my results, I'm using a video generated from:

ffmpeg -y -f lavfi -i "mandelbrot=s=1920x1080" -t 120 -c:v libopenh264 -r 60 -g 600 -pix_fmt yuv420p mandelbrot.mp4

This produces a video with the metadata:

>>> mandelbrot.metadata
VideoStreamMetadata:
  duration_seconds_from_header: 120.0
  begin_stream_seconds_from_header: 0.0
  bit_rate: 9846643.0
  codec: h264
  stream_index: 0
  begin_stream_seconds_from_content: None
  end_stream_seconds_from_content: None
  width: 1920
  height: 1080
  num_frames_from_header: 7200
  num_frames_from_content: None
  average_fps_from_header: 60.0
  pixel_aspect_ratio: 1
  duration_seconds: 120.0
  begin_stream_seconds: 0
  end_stream_seconds: 120.0
  num_frames: 7200
  average_fps: 60.0

Raw results from running the benchmark:

[scottas@devvm24339 torchcodec] time python benchmarks/decoders/benchmark_transforms.py --path mandelbrot.mp4 --num-exp 5
Benchmarking mandelbrot.mp4, duration: 120.0, codec: h264, averaging over 5 runs:
Sampling 0.5%, 36, of 7200 frames
torchvision_resize((540, 960))                med = 3127.76, mean = 3130.26 +- 67.88, min = 3052.72, max = 3207.07 - in ms
decoder_native_resize((540, 960))             med = 2859.82, mean = 2846.57 +- 48.54, min = 2796.28, max = 2910.01 - in ms

torchvision_crop((540, 960), 270, 480)        med = 2918.80, mean = 2914.02 +- 38.80, min = 2866.04, max = 2963.62 - in ms
decoder_native_crop((540, 960), 270, 480)     med = 2868.21, mean = 2877.06 +- 65.62, min = 2789.38, max = 2956.73 - in ms

torchvision_resize((270, 480))                med = 3130.23, mean = 3119.50 +- 35.71, min = 3065.44, max = 3161.76 - in ms
decoder_native_resize((270, 480))             med = 2892.15, mean = 2884.34 +- 45.83, min = 2823.83, max = 2941.37 - in ms

torchvision_crop((270, 480), 405, 720)        med = 3021.95, mean = 3025.29 +- 54.71, min = 2960.59, max = 3111.12 - in ms
decoder_native_crop((270, 480), 405, 720)     med = 3012.73, mean = 3002.23 +- 41.48, min = 2935.82, max = 3046.63 - in ms

torchvision_resize((135, 240))                med = 3115.35, mean = 3127.73 +- 40.55, min = 3097.86, max = 3199.07 - in ms
decoder_native_resize((135, 240))             med = 2998.91, mean = 2984.52 +- 51.88, min = 2926.61, max = 3052.63 - in ms

torchvision_crop((135, 240), 472, 840)        med = 2927.12, mean = 2961.23 +- 58.39, min = 2905.33, max = 3032.24 - in ms
decoder_native_crop((135, 240), 472, 840)     med = 2887.55, mean = 2889.95 +- 26.76, min = 2854.67, max = 2929.89 - in ms

Sampling 1.0%, 72, of 7200 frames
torchvision_resize((540, 960))                med = 4010.83, mean = 4035.35 +- 92.37, min = 3914.23, max = 4143.24 - in ms
decoder_native_resize((540, 960))             med = 3653.65, mean = 3633.58 +- 51.07, min = 3548.28, max = 3674.40 - in ms

torchvision_crop((540, 960), 270, 480)        med = 3583.87, mean = 3580.71 +- 18.79, min = 3551.56, max = 3603.99 - in ms
decoder_native_crop((540, 960), 270, 480)     med = 3629.21, mean = 3642.67 +- 98.56, min = 3502.33, max = 3754.71 - in ms

torchvision_resize((270, 480))                med = 3903.53, mean = 3884.51 +- 78.43, min = 3755.74, max = 3969.27 - in ms
decoder_native_resize((270, 480))             med = 3492.23, mean = 3468.99 +- 48.85, min = 3386.13, max = 3504.75 - in ms

torchvision_crop((270, 480), 405, 720)        med = 3634.28, mean = 3640.66 +- 27.02, min = 3604.47, max = 3671.69 - in ms
decoder_native_crop((270, 480), 405, 720)     med = 3577.07, mean = 3597.95 +- 98.52, min = 3476.51, max = 3722.25 - in ms

torchvision_resize((135, 240))                med = 3819.38, mean = 3855.92 +- 86.06, min = 3768.17, max = 3989.23 - in ms
decoder_native_resize((135, 240))             med = 3609.79, mean = 3590.03 +- 86.38, min = 3468.03, max = 3674.19 - in ms

torchvision_crop((135, 240), 472, 840)        med = 3680.36, mean = 3674.32 +- 57.59, min = 3589.65, max = 3738.58 - in ms
decoder_native_crop((135, 240), 472, 840)     med = 3652.02, mean = 3623.59 +- 74.67, min = 3502.72, max = 3688.00 - in ms

Sampling 5.0%, 360, of 7200 frames
torchvision_resize((540, 960))                med = 6893.08, mean = 6930.55 +- 103.00, min = 6814.68, max = 7072.31 - in ms
decoder_native_resize((540, 960))             med = 5037.04, mean = 5035.49 +- 27.90, min = 5003.21, max = 5068.74 - in ms

torchvision_crop((540, 960), 270, 480)        med = 5031.53, mean = 5076.52 +- 77.11, min = 5020.56, max = 5195.28 - in ms
decoder_native_crop((540, 960), 270, 480)     med = 4395.75, mean = 4438.98 +- 114.75, min = 4332.14, max = 4593.07 - in ms

torchvision_resize((270, 480))                med = 6250.84, mean = 6257.17 +- 87.29, min = 6142.59, max = 6378.98 - in ms
decoder_native_resize((270, 480))             med = 4417.45, mean = 4526.08 +- 227.19, min = 4328.54, max = 4795.60 - in ms

torchvision_crop((270, 480), 405, 720)        med = 4967.48, mean = 4973.30 +- 54.71, min = 4896.89, max = 5038.94 - in ms
decoder_native_crop((270, 480), 405, 720)     med = 4375.91, mean = 4389.28 +- 48.45, min = 4355.25, max = 4474.55 - in ms

torchvision_resize((135, 240))                med = 6218.58, mean = 6225.40 +- 62.10, min = 6168.38, max = 6329.95 - in ms
decoder_native_resize((135, 240))             med = 4351.39, mean = 4385.69 +- 58.66, min = 4337.63, max = 4453.15 - in ms

torchvision_crop((135, 240), 472, 840)        med = 4968.77, mean = 4991.85 +- 61.76, min = 4922.92, max = 5075.04 - in ms
decoder_native_crop((135, 240), 472, 840)     med = 4397.08, mean = 4392.68 +- 52.71, min = 4336.45, max = 4470.58 - in ms

Sampling 10.0%, 720, of 7200 frames
torchvision_resize((540, 960))                med = 9193.76, mean = 9161.19 +- 183.73, min = 8925.95, max = 9335.10 - in ms
decoder_native_resize((540, 960))             med = 5452.89, mean = 5438.05 +- 54.90, min = 5347.82, max = 5492.74 - in ms

torchvision_crop((540, 960), 270, 480)        med = 5482.16, mean = 5484.13 +- 36.88, min = 5431.92, max = 5535.65 - in ms
decoder_native_crop((540, 960), 270, 480)     med = 4708.52, mean = 4731.11 +- 57.57, min = 4692.12, max = 4829.76 - in ms

torchvision_resize((270, 480))                med = 8153.42, mean = 8129.92 +- 72.84, min = 8014.58, max = 8210.66 - in ms
decoder_native_resize((270, 480))             med = 4802.11, mean = 4835.30 +- 143.86, min = 4682.88, max = 5073.51 - in ms

torchvision_crop((270, 480), 405, 720)        med = 5440.87, mean = 5455.93 +- 131.09, min = 5323.52, max = 5663.85 - in ms
decoder_native_crop((270, 480), 405, 720)     med = 4592.39, mean = 4617.36 +- 72.49, min = 4550.59, max = 4740.84 - in ms

torchvision_resize((135, 240))                med = 7760.04, mean = 7791.98 +- 113.75, min = 7696.63, max = 7988.92 - in ms
decoder_native_resize((135, 240))             med = 4803.85, mean = 4793.58 +- 89.88, min = 4695.10, max = 4916.40 - in ms

torchvision_crop((135, 240), 472, 840)        med = 5454.57, mean = 5446.10 +- 106.46, min = 5308.42, max = 5599.37 - in ms
decoder_native_crop((135, 240), 472, 840)     med = 4611.31, mean = 4614.23 +- 20.13, min = 4591.76, max = 4637.14 - in ms

Some conclusions:

The benefit of decoder-native transforms increases with the number of frames being decoded from a video.
When the number of frames is ~10, the performance difference is minor.
The relative improvement seems stable when the output dimension changes.
Resize sees more improvement than Crop.

…enchmark

NicolasHug · 2025-10-20T17:21:04Z

benchmarks/decoders/benchmark_transforms.py

+    input_height = metadata.height
+    input_width = metadata.width
+    fraction_of_total_frames_to_sample = [0.005, 0.01, 0.05, 0.1]
+    fraction_of_input_dimensions = [0.5, 0.25, 0.125]


Non-blocking note: it's worth upsampling too (fraction > 1) to get a more complete view: models typically expect one given size, but images from the dataset can be either larger (and downsampled) or smaller (and upsampled)

scotts added 19 commits October 10, 2025 10:32

Basic implementation and test

b171309

Refactored tests, full tensor equality

3248f2f

Add stream validation

74fe47f

Add torchvision to test environments

c77fbd7

Remove redundant test

0511233

Correct error message

bba2696

Merge branch 'main' of github.com:pytorch/torchcodec into crop_transform

4f121cb

Put subprocess back in test_ops

07e7f60

Lint

67478dd

(w, h) -> (h, w); better filenames

88bc94a

More comments, add pytest to reference resources

a15d458

Wrong includes

8f4507b

Merge branch 'main' of github.com:pytorch/torchcodec into crop_transform

fca8b83

Lint

8dfbee9

Make generate resources install torchcodec from source

c1836b8

needs: build

abb80eb

Actuallly install torchcodec

f819ac1

Decoder native transforms benchmark

e559838

Merge branch 'main' of github.com:pytorch/torchcodec into transform_b…

ef5fb6a

…enchmark

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 18, 2025

scotts marked this pull request as ready for review October 18, 2025 03:02

scotts added 2 commits October 18, 2025 09:47

Smaller minimum number of samples

c15935f

Approximate mode. Duh.

4681de7

NicolasHug approved these changes Oct 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Decoder-native transforms benchmark #982

Decoder-native transforms benchmark #982

scotts commented Oct 18, 2025 •

edited

Loading

Uh oh!

NicolasHug Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Decoder-native transforms benchmark #982

Are you sure you want to change the base?

Decoder-native transforms benchmark #982

Conversation

scotts commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NicolasHug Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

scotts commented Oct 18, 2025 •

edited

Loading