-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
speeding up windowed writes #439
Comments
This is definitely an interesting use case. I am curious why you need to use I think in your scenario, being able to have a lock on a per machine basis would be your best solution. That way each machine can write to the file on its drive safely. Not sure if such a lock exists though. |
Well im still testing but I was premature to say it could do arbitrarily large arrays though. Unfortunately the gdal cog driver seems to ignore GDAL_CACHEMAX and crash when it runs out of memory. On the other hand the gtiff driver works fine with large arrays and with 512x512 internal blocks its not hard to turn it into a cog after that. I just wrote a compressed 47GB file on a machine with only 32GB of ram without the issues that the cog driver had. With the modifications in the OP windowed writes seem to be faster then non windowed however i find it not easy with dask to isolate out all the things going on in the background to know if this is real. Ive got to admit ive not looked at the lock thing much as I cant see how it can help me if the workers cant all see the same file unless im writing a whole collection of separate files in which case i would probably look at using zarr in an s3 bucket instead. |
with #410 that issue is because when writing with a lock the file is written in windows in separate sessions for each chunk on this line rioxarray/rioxarray/raster_writer.py Line 195 in 199c051
im guessing the cog overview creation is only triggered when the file is first populated not on subsequent modification when opened as r+ I dont see the issue that was shown in #410 when doing windowed writes in a single thread to a cog. |
https://gdal.org/drivers/raster/cog.html I am wondering if it works with dask if you use: |
Ive been exploring some refinements
|
That makes sense to me as a solution for writing dask chunks to disk when processing using multiple machines that don't share the same file system. I think this would be a good solution for the mechanism for writing to a raster using dask when a lock isn't provided. |
Ive been writing COGs using the windowed option
da.rio.to_raster(...... driver="COG",windowed=True)
It works fine but is rather slow.
my array is on a set of distributed dask workers that dont share a file system so as far is i can tell using a lock isnt an option.
It looks to me that it is slow because of the windows being written are very small as they match the cog internal tiling.
I did a bit of testing and for instance if you use extra options such as:
the writing is much faster and approaching the speed of the non windowed write and makes better use of the available ram and CPU for compression. Obviously its no longer a COG though.
However if i change the raster_writer code with the following I get a COG and I get fast writes and as far as ive tested it works for arbitrarily large arrays.
Basically im decoupling the block_windows used for writing from the internal block_windows of the COG. Obviously done in a slightly hacky way and it would need to be refined so that the windows are suitable for the ram available on the system.
I think if refined enough there is no reason why not to make this the standard way of writing (no need for a windowed=False option, It would also be interesting to see how it performs compared to using a lock
Thanks
The text was updated successfully, but these errors were encountered: