⚗️ Benchmark nvTIFF CUDA GPU-based decoding #4

weiji14 · 2025-11-11T07:40:41Z

Benchmark reading the LZW-compressed GeoTIFF to CUDA GPU memory via DLPack. Using cog3pio's CudaCogReader which uses bindings to the nvTIFF library.

Steps

Install nvTIFF, then patch nvtiff.h file using:

sed --in-place "s/memLimit=0/memLimit/g" /usr/include/nvtiff.h
sed --in-place "s/stream=0/stream/g" /usr/include/nvtiff.h
sed --in-place "s/nvtiffTagDataType type/enum nvtiffTagDataType type/g" /usr/include/nvtiff.h

Run using cargo bench --features cuda

Results

Ran on my laptop with a NVIDIA RTX A2000 8GB Laptop GPU. CPU benchmarks ran on 12th Gen Intel® Core™ i5-12600HX with 16 threads. Note that CPU benches include host to device memory copy too, so should be slightly slower than #3.

    Finished `bench` profile [optimized] target(s) in 2.40s
     Running benches/read_geotiff.rs (target/release/deps/read_geotiff-3ec348ddc581de8f)

Warning: It is not recommended to reduce nresamples below 1000.
Benchmarking read_geotiff/0_nvTIFF_GPU/Sentinel-2 TCI: Warming up for 1.0000 ms
...
read_geotiff/0_nvTIFF_GPU/Sentinel-2 TCI
                        time:   [164.68 ms 164.30 ms 164.71 ms]
                        thrpt:  [1.6527 GB/s 1.6567 GB/s 1.6529 GB/s]
                 change:
                        time:   [+0.7430% +0.4804% +0.8175%] (p = 1.00 > 0.05)
                        thrpt:  [-0.8109% -0.4781% -0.7375%]
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
Benchmarking read_geotiff/1.1_gdal_CPU_threads=16/Sentinel-2 TCI: Warming up for 1.0000 ms
Warning: Unable to complete 30 samples in 2.0s. You may wish to increase target time to 8.1s, or reduce sample count to 10.
read_geotiff/1.1_gdal_CPU_threads=16/Sentinel-2 TCI
                        time:   [277.16 ms 277.23 ms 277.95 ms]
                        thrpt:  [979.32 MB/s 981.88 MB/s 982.12 MB/s]
                 change:
                        time:   [-2.8661% -2.0534% -1.6517%] (p = 0.00 < 0.05)
                        thrpt:  [+1.6794% +2.0964% +2.9507%]
                        Performance has improved.
Found 4 outliers among 30 measurements (13.33%)
  1 (3.33%) low mild
  2 (6.67%) high mild
  1 (3.33%) high severe
Benchmarking read_geotiff/2.1_async-tiff_CPU_threads=16/Sentinel-2 TCI: Warming up for 1.0000 ms
Warning: Unable to complete 30 samples in 2.0s. You may wish to increase target time to 15.3s, or reduce sample count to 10.
read_geotiff/2.1_async-tiff_CPU_threads=16/Sentinel-2 TCI
                        time:   [472.70 ms 475.37 ms 473.92 ms]
                        thrpt:  [574.37 MB/s 572.62 MB/s 575.85 MB/s]
                 change:
                        time:   [-2.2161% -2.4830% -1.3742%] (p = 0.00 < 0.05)
                        thrpt:  [+1.3934% +2.5463% +2.2663%]
                        Performance has improved.
Found 3 outliers among 30 measurements (10.00%)
  2 (6.67%) high mild
  1 (3.33%) high severe

nvTIFF is nice and fast. LiberTIFF is holding up nicely, with negligible (<10ms) overhead from cuda transfer. Not sure what's going on with async-tiff, seems like the host to device overhead is larger, almost 140ms (maybe some copying is happening on the Bytes -> u8 -> DLPack conversion?

TODO

Initial implementation
Add host to device copy code on LiberTIFF and async-tiff benchmarks, gated by 'cuda' feature flag
Extra documentation

Specifically a version with the experimental `CudaCogReader` struct that can read a GeoTIFF into CUDA memory via DLPack. Also installed bytes, and upgraded dlpark to a newer version with CudaView impl feature.

Run benchmarks reading the LZW-compressed GeoTIFF to CUDA GPU memory via DLPack. Using cog3pio's CudaCogReader which uses bindings to the nvTIFF library.

codspeed-hq · 2025-11-11T07:58:00Z

CodSpeed Performance Report

Merging #4 will degrade performances by 23.05%

_{Comparing bench/cudacogreader (81202a8) with main (43809ea)}

Summary

❌ 1 regression
✅ 1 untouched

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

	Benchmark	`BASE`	`HEAD`	Change
❌	`2_async-tiff_CPU_threads=4[Sentinel-2 TCI]`	12.4 s	16.1 s	-23.05%

When 'cuda' feature flag is enabled, copy decoded bytes from host (CPU) to device (GPU) to allow fair comparison with nvTIFF benchmark where data resides in CUDA memory. Well, not exactly fair since nvTIFF is winning, but still need this. Note that async-tiff's decoded byte length seems longer than expected, not sure why... Added some extra docs and links to the main README.md too.

No need for separate `.flat_map` and `.map`. Can coerce Bytes into u8 directly apparently. Still need to figure out if there's a more efficient way of multi-threaded decoding to raw bytes though.

weiji14 · 2025-11-12T04:50:07Z

Might do some extra optimizations later if I have time, merging for now.

weiji14 added 2 commits November 11, 2025 19:40

➕ Add cog3pio with 'cuda' feature flag

5cb8ce7

Specifically a version with the experimental `CudaCogReader` struct that can read a GeoTIFF into CUDA memory via DLPack. Also installed bytes, and upgraded dlpark to a newer version with CudaView impl feature.

⚗️ Benchmark nvTIFF CUDA GPU-based decoding

f8ab974

Run benchmarks reading the LZW-compressed GeoTIFF to CUDA GPU memory via DLPack. Using cog3pio's CudaCogReader which uses bindings to the nvTIFF library.

weiji14 self-assigned this Nov 11, 2025

weiji14 added 2 commits November 12, 2025 09:35

♻️ Collapse async-tiff tile decode into single flat_map_iter call

81202a8

No need for separate `.flat_map` and `.map`. Can coerce Bytes into u8 directly apparently. Still need to figure out if there's a more efficient way of multi-threaded decoding to raw bytes though.

weiji14 marked this pull request as ready for review November 12, 2025 04:49

weiji14 merged commit 14d9b27 into main Nov 12, 2025
2 of 3 checks passed

weiji14 deleted the bench/cudacogreader branch November 12, 2025 04:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

⚗️ Benchmark nvTIFF CUDA GPU-based decoding #4

⚗️ Benchmark nvTIFF CUDA GPU-based decoding #4

Uh oh!

weiji14 commented Nov 11, 2025 •

edited

Loading

Uh oh!

codspeed-hq bot commented Nov 11, 2025 •

edited

Loading

Uh oh!

weiji14 commented Nov 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

⚗️ Benchmark nvTIFF CUDA GPU-based decoding #4

⚗️ Benchmark nvTIFF CUDA GPU-based decoding #4

Uh oh!

Conversation

weiji14 commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Steps

Results

TODO

Uh oh!

codspeed-hq bot commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CodSpeed Performance Report

Merging #4 will degrade performances by 23.05%

Summary

Benchmarks breakdown

Uh oh!

weiji14 commented Nov 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

weiji14 commented Nov 11, 2025 •

edited

Loading

codspeed-hq bot commented Nov 11, 2025 •

edited

Loading