Explore timing/speed issues when using tiles that match block size vs our own spec #82

njtierney · 2024-07-17T03:24:31Z

I'm curious to see what the difference in speed is for breaking a raster into different tile size.

Like if we have a big raster and break it into 10 tiles, but the blocksize is more like 4 tiles, is there any appreciable difference using 10 tiles vs 4?

Aariq · 2024-07-17T17:22:59Z

I'm unfamiliar with the term "blocksize"—can you elaborate? I assume there is a "sweet spot" because there is going to be overhead for making the tiles and for doing computations on them (marshaling/unmarshaling, etc.). I also assume it will depend on number of workers and RAM. Providing some kind of rough recommendation to users would be great though. Doing some benchmarking in a vignette or pkgdown article might be a good way to do this.

njtierney · 2024-07-18T00:44:01Z

My understanding of blocksize is kind of like the default tile size. When you start to read a raster from file into memory, if your query is only within a single block, only a single block is read into memory. But for example if you read a really wide raster and wanted to read just the first 2 rows but all columns, then (depending on how blocksize is specified), you could end up reading in all blocks just to get a few rows out.

Like you said, it's a bit of a tricky thing to find the sweet spot, since it depends on RAM/workers/read-write speed, and probably other things. I believe this is what the zarr project https://zarr.dev/ is about - handling that kind of meta data properly.

I like your idea for providing some recommendations to users - perhaps one thing that could be handy would be producing a schema/plot from https://github.com/hypertidy/grout that shows your raster and the blocksize that is currently specified. This could give you a general sense of how many blocks might be needed, or maybe you might want to use some integer multiple of the blocksize. Or maybe there's just two - it's not always easy to know!

Totally agree that some benchmarking in a vignette or article would be a nice way to demonstrate this to users. My inkling is that it might only really matter for large rasters, but I think we'll only really know once we write it down.

It's also slightly complicated by the fact that GDAL will cache your read when you initially read in a set of blocks. So if it takes like 20 seconds to do a first long wide pass reading in just a few rows from all blocks, subsequent reads in the same session will be 10x (or more) faster as those blocks are now in memory.

Aariq · 2024-09-16T16:42:18Z

I'll try to work on a vignette specifically about tiling and dynamic branching soon. I'll try to explore some benchmarking in it.

Aariq · 2024-09-25T20:10:52Z

Blocked by #99

Aariq mentioned this issue Jul 19, 2024

[geotargets] Thinking about tiling cct-datascience/organization#2172

Closed

Aariq self-assigned this Sep 16, 2024

Aariq linked a pull request Nov 22, 2024 that will close this issue

Vignette for dynamic branching with tiles #111

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore timing/speed issues when using tiles that match block size vs our own spec #82

Explore timing/speed issues when using tiles that match block size vs our own spec #82

njtierney commented Jul 17, 2024

Aariq commented Jul 17, 2024

njtierney commented Jul 18, 2024

Aariq commented Sep 16, 2024

Aariq commented Sep 25, 2024

Explore timing/speed issues when using tiles that match block size vs our own spec #82

Explore timing/speed issues when using tiles that match block size vs our own spec #82

Comments

njtierney commented Jul 17, 2024

Aariq commented Jul 17, 2024

njtierney commented Jul 18, 2024

Aariq commented Sep 16, 2024

Aariq commented Sep 25, 2024