Skip to content

Conversation

@weiji14
Copy link
Owner

@weiji14 weiji14 commented Nov 11, 2025

Benchmark reading the LZW-compressed GeoTIFF to CUDA GPU memory via DLPack. Using cog3pio's CudaCogReader which uses bindings to the nvTIFF library.

Steps

  1. Install nvTIFF, then patch nvtiff.h file using:
    sed --in-place "s/memLimit=0/memLimit/g" /usr/include/nvtiff.h
    sed --in-place "s/stream=0/stream/g" /usr/include/nvtiff.h
    sed --in-place "s/nvtiffTagDataType type/enum nvtiffTagDataType type/g" /usr/include/nvtiff.h
  2. Run using cargo bench --features cuda

Results

Ran on my laptop with a NVIDIA RTX A2000 8GB Laptop GPU. CPU benchmarks ran on 12th Gen Intel® Core™ i5-12600HX with 16 threads. Note that CPU benches include host to device memory copy too, so should be slightly slower than #3.

image
    Finished `bench` profile [optimized] target(s) in 2.40s
     Running benches/read_geotiff.rs (target/release/deps/read_geotiff-3ec348ddc581de8f)

Warning: It is not recommended to reduce nresamples below 1000.
Benchmarking read_geotiff/0_nvTIFF_GPU/Sentinel-2 TCI: Warming up for 1.0000 ms
...
read_geotiff/0_nvTIFF_GPU/Sentinel-2 TCI
                        time:   [164.68 ms 164.30 ms 164.71 ms]
                        thrpt:  [1.6527 GB/s 1.6567 GB/s 1.6529 GB/s]
                 change:
                        time:   [+0.7430% +0.4804% +0.8175%] (p = 1.00 > 0.05)
                        thrpt:  [-0.8109% -0.4781% -0.7375%]
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
Benchmarking read_geotiff/1.1_gdal_CPU_threads=16/Sentinel-2 TCI: Warming up for 1.0000 ms
Warning: Unable to complete 30 samples in 2.0s. You may wish to increase target time to 8.1s, or reduce sample count to 10.
read_geotiff/1.1_gdal_CPU_threads=16/Sentinel-2 TCI
                        time:   [277.16 ms 277.23 ms 277.95 ms]
                        thrpt:  [979.32 MB/s 981.88 MB/s 982.12 MB/s]
                 change:
                        time:   [-2.8661% -2.0534% -1.6517%] (p = 0.00 < 0.05)
                        thrpt:  [+1.6794% +2.0964% +2.9507%]
                        Performance has improved.
Found 4 outliers among 30 measurements (13.33%)
  1 (3.33%) low mild
  2 (6.67%) high mild
  1 (3.33%) high severe
Benchmarking read_geotiff/2.1_async-tiff_CPU_threads=16/Sentinel-2 TCI: Warming up for 1.0000 ms
Warning: Unable to complete 30 samples in 2.0s. You may wish to increase target time to 15.3s, or reduce sample count to 10.
read_geotiff/2.1_async-tiff_CPU_threads=16/Sentinel-2 TCI
                        time:   [472.70 ms 475.37 ms 473.92 ms]
                        thrpt:  [574.37 MB/s 572.62 MB/s 575.85 MB/s]
                 change:
                        time:   [-2.2161% -2.4830% -1.3742%] (p = 0.00 < 0.05)
                        thrpt:  [+1.3934% +2.5463% +2.2663%]
                        Performance has improved.
Found 3 outliers among 30 measurements (10.00%)
  2 (6.67%) high mild
  1 (3.33%) high severe

nvTIFF is nice and fast. LiberTIFF is holding up nicely, with negligible (<10ms) overhead from cuda transfer. Not sure what's going on with async-tiff, seems like the host to device overhead is larger, almost 140ms (maybe some copying is happening on the Bytes -> u8 -> DLPack conversion?

TODO

  • Initial implementation
  • Add host to device copy code on LiberTIFF and async-tiff benchmarks, gated by 'cuda' feature flag
  • Extra documentation

Specifically a version with the experimental `CudaCogReader` struct that can read a GeoTIFF into CUDA memory via DLPack. Also installed bytes, and upgraded dlpark to a newer version with CudaView impl feature.
Run benchmarks reading the LZW-compressed GeoTIFF to CUDA GPU memory via DLPack. Using cog3pio's CudaCogReader which uses bindings to the nvTIFF library.
@weiji14 weiji14 self-assigned this Nov 11, 2025
@codspeed-hq
Copy link

codspeed-hq bot commented Nov 11, 2025

CodSpeed Performance Report

Merging #4 will degrade performances by 23.05%

Comparing bench/cudacogreader (81202a8) with main (43809ea)

Summary

❌ 1 regression
✅ 1 untouched

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark BASE HEAD Change
2_async-tiff_CPU_threads=4[Sentinel-2 TCI] 12.4 s 16.1 s -23.05%

When 'cuda' feature flag is enabled, copy decoded bytes from host (CPU) to device (GPU) to allow fair comparison with nvTIFF benchmark where data resides in CUDA memory. Well, not exactly fair since nvTIFF is winning, but still need this.

Note that async-tiff's decoded byte length seems longer than expected, not sure why... Added some extra docs and links to the main README.md too.
No need for separate `.flat_map` and `.map`. Can coerce Bytes into u8 directly apparently. Still need to figure out if there's a more efficient way of multi-threaded decoding to raw bytes though.
@weiji14 weiji14 marked this pull request as ready for review November 12, 2025 04:49
@weiji14
Copy link
Owner Author

weiji14 commented Nov 12, 2025

Might do some extra optimizations later if I have time, merging for now.

@weiji14 weiji14 merged commit 14d9b27 into main Nov 12, 2025
2 of 3 checks passed
@weiji14 weiji14 deleted the bench/cudacogreader branch November 12, 2025 04:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant