Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalization of compression for spatial targets with GDAL #37

Open
njtierney opened this issue Mar 16, 2024 · 10 comments
Open

Generalization of compression for spatial targets with GDAL #37

njtierney opened this issue Mar 16, 2024 · 10 comments

Comments

@njtierney
Copy link
Owner

njtierney commented Mar 16, 2024

Just pulling this from #4 as I'm not sure we captured this as an issue?

Generalization of the "multiple file target compression" GDAL /vsizip/ approach to all backends and formats that support it

From @brownag

The /vsizip/ GDAL virtual file system functionality used in format_shapefile() is an example of something that can be generalized further with a focus on generic GDAL data source paths. I think the idea of being able to compress files that are in the target store (and keep them compressed) is attractive for spatial data which can be quite large--even if targets are not comprised of multiple files.

Since GDAL can read from the compressed target store efficiently, you get the benefit of less file size footprint while also being able to read the file without fully extracting it.
Also should consider some of the other archive file formats/virtual file system types, and providing interfaces in R to produce them e.g. /vsigzip/ or /vsitar/ analogs to /vsizip/ + utils::zip().
Even without creating specific compressed archive files, there should be robust tools available for controlling GDAL file compression options, supported by many drivers, that are used to write target objects
The ZIP approach is useful for GeoTIFF files where category information is stored in the .tif.aux.xml sidecar file. Convenience methods for terra SpatRaster objects could automatically store a target as a ZIP file (and give warnings about target naming) if the input SpatRaster is categorical and output format is GeoTIFF.

@brownag
Copy link
Contributor

brownag commented Mar 16, 2024

Thanks for splitting this out, I wanted to make one after closing of #4 but didnt get to it yet.

I have been tinkering with some implementations for this issue and will have a draft PR in not too distant future

@Aariq
Copy link
Collaborator

Aariq commented Apr 17, 2024

Is there a reason to not just zip outputs of all GDAL drivers, even ones that are a single file? Are there downsides to using /vsizip/ ? e.g. is it not available in some instances?.

@mdsumner
Copy link

mdsumner commented Apr 18, 2024

Having the extra zip layer is a bit weird for formats that are both single-file and include internal compression. And, there's the zip layer to read through so it's less efficient. Note that GDAL added SOZip capability, which cloud-i-fied storing file/s within zip and made it very fast (not all zips will be as efficient). I don't think you'd want logic to determine if a GeoTIFF is not compressed to pivot on, even that has some explosion of option combinations. I think these kinds of choices are out of scope for this project (but very keen to discuss).

Its support is GDAL and build dependent, so on CRAN you are at the behest currently of the Windows maintainer's efforts, mostly guided by Roger Bivand in the past, and similarly for Mac, and then the binary installers that align to linux builds. That's probably a good level to track to specify 1) version/s and 2) capabilities to make some boundaries.

There's a lot of other subtleties too, because files like GeoTIFF and Geopackage could have sidecar files (that's how GDAL supports categorical rasters Raster Attribute Tables, RAT) for GeoTIFF for example, and there are controls about whether sidecar files are searched for at URLs and directories ... so, apologies all I can think of are details and complications. I think generally it's not a good idea to add a zip or any other layer unless you really need to, it's better to move to and advise modern formats (GeoTIFF, (Geo)Parquet, FlatGeobuf, Zarr) - but if you need to, the zip container can be a good solution (bundle up one or many shapefiles, or MapInfo files, or CSVs or many other options). Note there are also virtual file system support for gzip, tar, Azure, AWS, Google storage, on and on so I tend to suggest stay as close to what GDAL can do without adding layers (but, that's not a straightforward topic without putting some pretty tight boundaries on the scope).

@Aariq
Copy link
Collaborator

Aariq commented Apr 18, 2024

Oof, yeah I can forsee us having to do a lot of thinking around this and it might be best for geotargets to be opinionated and only allow/recommend certain file formats that we are confident will work with targets with different GDAL versions and OSs. E.g., just today I noticed that the "COG" driver produces sidecar aux.xml files on my university HPC but not when the same code is run locally (different GDAL versions).

@Aariq
Copy link
Collaborator

Aariq commented Nov 20, 2024

I think this has been solved by #109, yeah?

@njtierney
Copy link
Owner Author

I think so? One of the things I wanted to help implement was using things like vsizip, but my understanding is that that is something that is on the user and their managing of GDAL versions, since not all vsizip stuff is available with older versions of GDAL?

@Aariq
Copy link
Collaborator

Aariq commented Nov 22, 2024

I didn't end up using /vsizip/ in #109 because it doesn't solve the missing metadata problem. This might be a bug in terra that eventually gets fixed (rspatial/terra#1629), but it might not be fixable since the aux.json is a terra thing, not a GDAL thing.

@brownag
Copy link
Contributor

brownag commented Nov 22, 2024

The original suggestion is a bit different than the current implemented solution, but the core of this issue was aspirational... I think that we can close this, but may be worth revisiting in the future

With the handling of GDAL options we now have users should be able to control aspects of compression that they couldn't when this discussion originated.

To actually have /vsizip/ work, and SOZip storage for compatible formats for both write and read we would need either an update to terra (as @Aariq pointed out in rspatial/terra#1629) or bring in {gdalraster} or similar to add files to SOZip manually. As mentioned I think the ability to write using /vsizip/ using terra is not fully developed, or at least the {terra} convention of aux files is not currently set up to work with it.

In the future, {gdalraster} (#48) could be used to assemble SOZip files--this would allow for sidecar files to be included (it appears they are not stored using /vsizip/ to write at this time, need to investigate), and drivers that do not support direct write of SOZip to be supported.

Originally posted by @brownag in #62

@Aariq
Copy link
Collaborator

Aariq commented Nov 22, 2024

@brownag would you mind adding some more details to #48? It would be good to have a record of what features of gdalraster in particular would be useful. Like, is it an additional package to support with tar_gdalraster() or is it something we can use under the hood to do read/write better? Sounds like it's maybe both?

@brownag
Copy link
Contributor

brownag commented Nov 22, 2024

@Aariq Yes, I can do that. I think you are right that it is both of those things.
Will add some clarifying context and examples to #48

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants