-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explore timing/speed issues when using tiles that match block size vs our own spec #82
Comments
I'm unfamiliar with the term "blocksize"—can you elaborate? I assume there is a "sweet spot" because there is going to be overhead for making the tiles and for doing computations on them (marshaling/unmarshaling, etc.). I also assume it will depend on number of workers and RAM. Providing some kind of rough recommendation to users would be great though. Doing some benchmarking in a vignette or pkgdown article might be a good way to do this. |
My understanding of blocksize is kind of like the default tile size. When you start to read a raster from file into memory, if your query is only within a single block, only a single block is read into memory. But for example if you read a really wide raster and wanted to read just the first 2 rows but all columns, then (depending on how blocksize is specified), you could end up reading in all blocks just to get a few rows out. Like you said, it's a bit of a tricky thing to find the sweet spot, since it depends on RAM/workers/read-write speed, and probably other things. I believe this is what the zarr project https://zarr.dev/ is about - handling that kind of meta data properly. I like your idea for providing some recommendations to users - perhaps one thing that could be handy would be producing a schema/plot from https://github.com/hypertidy/grout that shows your raster and the blocksize that is currently specified. This could give you a general sense of how many blocks might be needed, or maybe you might want to use some integer multiple of the blocksize. Or maybe there's just two - it's not always easy to know! Totally agree that some benchmarking in a vignette or article would be a nice way to demonstrate this to users. My inkling is that it might only really matter for large rasters, but I think we'll only really know once we write it down. It's also slightly complicated by the fact that GDAL will cache your read when you initially read in a set of blocks. So if it takes like 20 seconds to do a first long wide pass reading in just a few rows from all blocks, subsequent reads in the same session will be 10x (or more) faster as those blocks are now in memory. |
I'll try to work on a vignette specifically about tiling and dynamic branching soon. I'll try to explore some benchmarking in it. |
Blocked by #99 |
I'm curious to see what the difference in speed is for breaking a raster into different tile size.
Like if we have a big raster and break it into 10 tiles, but the blocksize is more like 4 tiles, is there any appreciable difference using 10 tiles vs 4?
The text was updated successfully, but these errors were encountered: