Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating a lightweight version of gdal dependencies #722

Closed
RenaudLN opened this issue Jan 23, 2023 · 46 comments
Closed

Creating a lightweight version of gdal dependencies #722

RenaudLN opened this issue Jan 23, 2023 · 46 comments
Labels

Comments

@RenaudLN
Copy link

Comment:

Currently the libgdal feedstock includes a large number of dependencies that are not strictly required to use GDAL (e.g. poppler, postgres, ...). This bloats image sizes everytime we want to install a GDAL-related package like geopandas.

Would it be possible to create a lightweight version of the feedstock?

Maybe something like

  - name: libgdal-lite
    script: install_lib.sh  # [unix], probably needs to be updated?
    script: install_lib.bat  # [win], probably needs to be updated?
    build:
      run_exports:
        # no idea, going with minor pin
        - {{ pin_subpackage('libgdal', max_pin='x.x') }}
    requirements:
      build:
        - cmake
        # ranlib used at install time
        - {{ compiler('c') }}
        # libstdc++ is needed in requirements/run
        - {{ compiler('cxx') }}
        - pkg-config  # [not win]
        - make  # [unix]
        - ninja  # [win]
        - sysroot_linux-64 2.17  # [linux64]
      host:
        - blosc
        - geos
        - icu
        - json-c  # [not win]
        - libcurl
        - libdeflate
        - libiconv
        - libkml
        - libnetcdf
        - libpng
        - libpq
        - libspatialite
        - libtiff
        - libuuid  # [linux]
        - libwebp-base
        - libxml2
        - lz4-c
        - proj
        - zlib
        - zstd
      run:
        - json-c  # [not win]
        - libpq
        - libspatialite
        - libuuid  # [linux]
        - libwebp-base
        - proj
        - zstd
    test:
      files:
        - test_data
        - run_test.bat  # probably needs to be updated?
        - run_test.sh  # probably needs to be updated?
    about:
      summary: The Geospatial Data Abstraction Library (GDAL)
      license: MIT
      license_file: LICENSE.TXT
@ocefpaf
Copy link
Member

ocefpaf commented Jan 25, 2023

I considered this is the past. I'm not sure what are the implications of mixing the variants and/or making them incompatible with one another. For example, if we build fiona with the lite version but install the full, will it work as expected? Maybe if all the symbols are there but someone looking for a codec in the full version may be frustrated? Or will it work? I don't know the answer to all these questions.

One could create lite versions of all the packages downstream but that would be quite confusing to the users.

Maybe we can reduce the size of the end package with some binary stripping and removing/moving some files that are not usually used, like docs, etc.

@rouault
Copy link
Contributor

rouault commented Jan 25, 2023

For example, if we build fiona with the lite version but install the full, will it work as expected?

Normally, yes, as the API of a lite or a full build is the same. Well, to be more exact: it is almost the same. The only difference is in the GDALRegister_XXXX() or RegisterOGRXXXX() methods of each driver XXXX. So that could be an issue if some GDAL user would explicitly call a GDALRegister_XXXX()/RegisterOGRXXXX() that would be in the full package but not the lite one, but I bet > 99% of GDAL users just call GDALAllRegister(). Individual registration of drivers has never been promoted as a good pratice in GDAL documentation, and the doc points at GDALAllRegister() instead.

Typically, searching in Debian sources, the only code that explicitly registers the GTiff driver (https://codesearch.debian.net/search?q=GDALRegister_GTiff&perpkg=1) or the Shapefile driver (https://codesearch.debian.net/search?q=RegisterOGRShape&perpkg=1) is GDAL itself

That said the definition of a minimum/lite version of GDAL is going to be difficult to agree. Someone with vector-only workflows will have a very different idea from someone with raster-only workflows.

An alternative would be to have a smaller libgdal and additional libgdal-XXXXX as we have done for Arrow/Parquet. But I should point that doing a plugin approach for too many plugins has consequences: in my development build, I build with all drivers as plugins (for drivers that support being built as plugin) to detect issues specific to plugin building, and this significantly increases the GDALRegisterAll() time to ~ 200 ms (instead of ~ 10 ms for a all-drivers-in-libgdal approach). Of course that's a bit extreme. For < 10 plugins, the perf should still be reasonable. Another downside of the plugin approach is that users must remember to install them...

@gillins
Copy link
Contributor

gillins commented Jan 25, 2023

An alternative would be to have a smaller libgdal and additional libgdal-XXXXX as we have done for Arrow/Parquet.

I like your idea of having all non-core drivers in separate packages. Then we could have various metapackages that include different subsets (ie. all raster drivers, all vector drivers, common drivers, all drivers, minimal drivers etc). Then we wouldn't need to install all drivers each time we want to build a package that uses GDAL.... But yes, it won't be obvious to a user who is trying to read their file that they have the wrong driver metapackage installed.

@RenaudLN
Copy link
Author

That said the definition of a minimum/lite version of GDAL is going to be difficult to agree. Someone with vector-only workflows will have a very different idea from someone with raster-only workflows.

How about a version that has the drivers to read vectors and rasters, but without the graphing/printing libraries? Also maybe without postgres, not sure how GDAL uses this for.

Another downside of the plugin approach is that users must remember to install them...

My idea was more to keep the existing libgdal so people would have them by default but if you need the lightweight version there is a way to not install stuff like poppler and jpeg libraries (poppler+poppler-data alone is 53MB)

@PostholerCom
Copy link

How about a version that has the drivers to read vectors and rasters, but without the graphing/printing libraries? Also maybe without postgres, not sure how GDAL uses this for.

GDAL interaction with PostgreSQL/PostGIS is paramount. I couldn't imagine a build without it. No PG, no ETL. Any lite build should support it.

@rouault
Copy link
Contributor

rouault commented Jan 28, 2023

Just seeing that Alpine Linux has packaged GDAL 3.6.2 with a number of drivers as plugins in extra packages.
Cf click on Subpackages at right of https://pkgs.alpinelinux.org/package/edge/community/x86_64/gdal to see this list.
Not that I find it always super relevant as some of the packages, like Carto, Elastic, etc. depend only on dependencies of the core libgdal and thus are very small (~ 100 kB each), so they would be much better as builtin.
And they've decided to make the PostgreSQL vector & raster drivers as plugins.
Nobody agrees on what a minimum GDAL is :-)

@olivier-lacroix
Copy link

Hi all, I understand that everyone will have a different idea of what a minimal gdal is depending on their workflow. However, maybe we could take a non-controversial first step to make gdal more modular,

  • using the existing approach (as for arrow/parquet) as proposed by @rouault and @gillins, for consistency
  • allowing to get some feedback on how that could work for users, shall a decision to go further be made

I think poppler may be a good candidate for that:

  • poppler and its dependencies are pretty big
  • I expect the usage of it to be fairly limited, unless I am missing something

What do you guys think?

@gillins
Copy link
Contributor

gillins commented Feb 1, 2023

So this would mean that gdal wouldn't support pdf's 'out of the box'? Would this be confusing for users? I don't use pdf's with gdal personally, but would be interested to hear from anyone who does...

@rouault
Copy link
Contributor

rouault commented Feb 1, 2023

I don't use pdf's with gdal personally, but would be interested to hear from anyone who does...

QGIS GeoPDF functionality relies on the GDAL PDF driver

@olivier-lacroix
Copy link

@gillins yes, it would mean that users who need the GDAL PDF driver would have to install libgdal-poppler, in the same way users who need the arrow/parquet functionalities currently have to install libgdal-arrow-parquet.

QGIS could choose to depend on libgdal-poppler as a core dependency to keep the functionality in by default.

And yes, users / downstream packages may be impacted by any split of GDAL functionality into separate packages. I still think it is worth it in the long term. And that for poppler, that number may be rather limited.

@jorisvandenbossche
Copy link
Member

If we want, we can also avoid a change for existing users by making the core libgdal a package named libgdal-core, and the existing libgdal could then be a meta-package that just depends on libgdal-core, libgdal-poppler, ..., to ensure that people currently relying on libgdal as a dependency keep getting the same feature set.

Of course that also limits the usefulness of the change, as initially everyone will still use the meta package with everything, but it allows for packages depending on gdal to gradually move to depending on libgdal-core, leaving it to the end user to add specific gdal drivers to their requirements.

(in any case, I am a big +1 on the idea of having a smaller core package!)

@ocefpaf
Copy link
Member

ocefpaf commented Feb 3, 2023

Just to reinforce what @jorisvandenbossche said above. If we do this, in conda-forge, it has to be that way to avoid breakages. While that limits the impact, users who want a smaller gdal will know what to look for, while current uses who are OK with the current package won't feel the change.

@gillins
Copy link
Contributor

gillins commented Feb 6, 2023

+1 from me about not breaking existing users, but being able to select a smaller gdal. How do we progress this @conda-forge/gdal ? Do we have a vote?

@olivier-lacroix
Copy link

Sounds good @jorisvandenbossche. This also means that libgdal-core can be more minimal this way!

@xylar
Copy link
Contributor

xylar commented Feb 6, 2023

Do we have a vote?

I would prefer to make sure we have consensus than to have a vote.

I suggest we put forward @jorisvandenbossche's suggestion in #722 (comment) as a way to proceed. It feels like we have a pretty good consensus on that option. If you like that idea, give it a thumbs up (if you haven't already). If you have concerns, give it a thumbs-down for now and comment about what your objection would be. Hopefully, we can address it and you'll become a thumbs up.

There are remaining questions about which packages go in libgdal-core. I suggest whoever has time and energy to get an initial PR going gets the first pass at what goes in. Then, maintainers and anyone else interested can review.

How does that sound?

@kmuehlbauer
Copy link
Contributor

That sounds like a plan @xylar, thanks!

@rouault has already expressed some insight. But I'd very much rely on his expertise what might be split-out and what is better kept within (also with regard to performance).

@akrherz
Copy link
Contributor

akrherz commented Feb 6, 2023

FWIW, I am -0 on all this unless we can start quantifying the disk space savings we hope to achieve here. Is poppler the dependency we are hoping to avoid or are there others that consume significant space? Saving 10s of MB of space is not really all that exciting for the amount of work and downstream pain this change will cause.

@PostholerCom
Copy link

unless we can start quantifying the disk space savings

Likewise. What's the benefit or is this just an exercise? I'm of the mind of adding more packages and making the default install even more robust, ie, hdf4, hdf5, geoparquet, latest CGAL, SFCGAL, GEOS, etc.

@olivier-lacroix
Copy link

@akrherz @PostholerCom , not sure what pain you are refering to? downstream could choose to continue to depend on gdal / libgal, which would not change anything as per @jorisvandenbossche proposal.

Also, unless I am assessing this incorrectly (I may well be, plus maybe some deps of poppler would be required anyway), I am afraid we are not talking about a few kb.

❯ micromamba env create -n gdaltest python==3.10
...
❯ du -hs micromamba/envs/gdaltest/
173M    micromamba/envs/gdaltest/

❯ micromamba install poppler -n gdaltest
...
❯ du -hs micromamba/envs/gdaltest/
598M    micromamba/envs/gdaltest/

❯ micromamba install gdal -n gdaltest
...
❯ du -hs micromamba/envs/gdaltest/
852M    micromamba/envs/gdaltest/

@akrherz
Copy link
Contributor

akrherz commented Feb 7, 2023

Thanks @olivier-lacroix for quantifying the impact of poppler. The pain is that the original message denoted geopandas, so in order to take advantage of any savings a lite package could provide, the dependency tree from geopandas, which I believe is geopandas -> fiona -> gdal would need to be updated. In this case, fiona. So there's pain here!

@rouault
Copy link
Contributor

rouault commented Feb 7, 2023

I do believe that Fiona's author would bless linking a lightweight GDAL (as Fiona binary wheels use a quite minimum GDAL), and especially since the PDF driver is of little practical use for Fiona / vector (well there are some GeoPDFs with features in them, but that's quite of an edge use case).
I'm somewhere close to a 0 vote regarding if a libgdal-core package is a good idea.

@theroggy
Copy link

theroggy commented Mar 16, 2023

An other advantage of a libgdal-core version would be that creating conda environments should speed up significantly. Especially on windows this would likely be quite a significant difference.

E.g. for my CI tests for windows, creating and cleaning up de conda environment using mamba takes 12 minutes for an environment with gdal.
https://github.com/geofileops/geofileops/actions/runs/4435354175/jobs/7782410186

For an environment without gdal it takes 5 minutes:
https://github.com/theroggy/pygeoops/actions/runs/4432623545/jobs/7776857699

@rouault
Copy link
Contributor

rouault commented Nov 14, 2023

But I should point that doing a plugin approach for too many plugins has consequences: in my development build, I build with all drivers as plugins (for drivers that support being built as plugin) to detect issues specific to plugin building, and this significantly increases the GDALRegisterAll() time to ~ 200 ms (instead of ~ 10 ms for a all-drivers-in-libgdal approach). Of course that's a bit extreme. For < 10 plugins, the perf should still be reasonable. Another downside of the plugin approach is that users must remember to install them...

This is going to be addressed in GDAL 3.9 per OSGeo/gdal#8648 + OSGeo/gdal#8695

@theroggy
Copy link

theroggy commented Jun 8, 2024

Just a heads-up that GDAL 3.9 has been released now. Or is GDAL 3.9.1 recommended for this due to OSGeo/gdal#10096?

@rouault
Copy link
Contributor

rouault commented Jun 8, 2024

OSGeo/gdal#10096 should have moderate consequence on pure-Conda-forge builds. The effect of this PR is more to allow (again) someone to use libgdal from conda-forge, and build a driver as a plugin (typically a proprietary one) against that libgdal that wasn't aware of that driver as build time.

There are ongoing discussions about potential more modular builds of GDAL in conda-forge, but that might require evolutions in conda-build or using features we don't use yet (like the ability of dispatching build artifacts into multiple output packages). CC @hobu

@rouault
Copy link
Contributor

rouault commented Jun 8, 2024

it would mean that users who need the GDAL PDF driver would have to install libgdal-poppler

speaking about Poppler, another motivation for a libgdal-pdf package (with the Poppler backend) is that Poppler is GPL licenced.

@hobu
Copy link
Contributor

hobu commented Jun 10, 2024

OSGeo/gdal#10096 should have moderate consequence on pure-Conda-forge builds. The effect of this PR is more to allow (again) someone to use libgdal from conda-forge, and build a driver as a plugin (typically a proprietary one) against that libgdal that wasn't aware of that driver as build time.

I confirm the above patch works for me in that situation in relation to our efforts with MrSID and Oracle plugins as described in #936.

I also would very much like to see a libgdal in Conda Forge that is dependent on only GDAL's "core" dependencies – GEOS, PROJ, libgeotiff, libtiff, lerc, libjpeg, sqlite, libcurl, openssl, libdeflate, xz, zlib, zstd, lz4-c, and libwebp-base. The idea of libgdal is any driver that depends on some kind of external library that is more than a compression library would be built as a plugin, not built into the base library. In that case, users would manually install or refer to plugins in recipes such as gdal-openjpeg or gdal-postgresql, or gdal-poppler.

Not only would this speed up solve time, it would help reduce the amount of rerendering churn that GDAL and its downstream users have to endure.

@rouault
Copy link
Contributor

rouault commented Jun 10, 2024

it would help reduce the amount of rerendering churn that GDAL and its downstream users have to endure.

are we sure that if we have multiple output packages for the same feedstock (let's say A=libgdal and B=gdal-poppler packages), and that only a dependency of B is updated (poppler), only B gets rebuilt?

@ocefpaf
Copy link
Member

ocefpaf commented Jun 10, 2024

are we sure that if we have multiple output packages for the same feedstock (let's say A=libgdal and B=gdal-poppler packages), and that only a dependency of B is updated (poppler), only B gets rebuilt?

Nope. Both will get rebuilt.

@hobu
Copy link
Contributor

hobu commented Jun 10, 2024

Nope. Both will get rebuilt.

😦

Why? What is the point of multiple outputs then?

@ocefpaf
Copy link
Member

ocefpaf commented Jun 10, 2024

Why? What is the point of multiple outputs then?

It helps downstream packages to get something specific, or split different licenses, and other benefits that is mostly for downstream use. However, for gdal itself, if any of the gdal dependencies get updated, the whole feedstock will get rebuilt.

@isuruf
Copy link
Member

isuruf commented Jun 18, 2024

We can have libgdal-openjpeg, libgdal-postgresql etc.
How do we want the split to happen?

  1. libgdal to mean the core version and libgdal-all = libgdal + libgdal-openjpeg + libgdal-postgresql
  2. libgdal-core to mean the core version libgdal = libgdal-core + libgdal-openjpeg + libgdal-postgresql

The first one is not backwards compatible, while the second one is.

@ocefpaf
Copy link
Member

ocefpaf commented Jun 18, 2024

We usually do the second option to avoid breakages.

@rouault
Copy link
Contributor

rouault commented Jun 18, 2024

We usually do the second option to avoid breakages.

There's an interesting question regarding the "gdal" package (the Python bindings): should they depend on "libgdal-core" (most logical choice), or still "libgdal" (but that means that users couldn't use the GDAL Python bindings without installing all drivers, which could be undesirable). It seems difficult here to be fully backwards compatible

@xylar
Copy link
Contributor

xylar commented Jun 18, 2024

We control what gdal depends on. We can change it to depend on libgdal-core without any concerns about backwards compatibility that I can see.

@rouault
Copy link
Contributor

rouault commented Jun 18, 2024

We can change it to depend on libgdal-core without any concerns about backwards compatibility that I can see.

my point was more for external users that install the "gdal" package and expect all GDAL drivers (but "libgdal-arrow-parquet") to be installed too.
I'm not against "gdal" just depending on libgdal-core. Just pointing out side effects.

@xylar
Copy link
Contributor

xylar commented Jun 18, 2024

I see. I think that's hopefully a relatively minor inconvenience but I get your point.

@ocefpaf
Copy link
Member

ocefpaf commented Jun 19, 2024

There's an interesting question regarding the "gdal" package (the Python bindings): should they depend on "libgdal-core" (most logical choice), or still "libgdal" (but that means that users couldn't use the GDAL Python bindings without installing all drivers, which could be undesirable). It seems difficult here to be fully backwards compatible

I guess we need to weight the pros and cons here: backward compatibility by making gdal depend on libgdal, or make it lightweight and break backward compatibility.

I don't have a horse in this race b/c I'm "in" the know and I can easily fix my workflows. Breaking changes are annoying but sometimes it is the opportunity we have to fix long standing issue/annoyances. Maybe the compromise would be to patch gdal to tell the user to install driver x, y, z when it fails? not sure how sustainable (or hard) that would be to implement and maintain.

@rouault
Copy link
Contributor

rouault commented Jun 19, 2024

Maybe the compromise would be to patch gdal to tell the user to install driver x, y, z when it fails?

That's exactly a mechanism now available in core GDAL since https://gdal.org/development/rfc/rfc96_deferred_plugin_loading.html and currently used by libgdal-arrow-parquet:

-DOGR_DRIVER_ARROW_PLUGIN_INSTALLATION_MESSAGE="You may install it with with 'conda install -c conda-forge libgdal-arrow-parquet'" \

@ocefpaf
Copy link
Member

ocefpaf commented Jun 19, 2024

Wow, that is awesome. I'm inclined for a breaking change then b/c it is super easy for the user to fix with that error message.

@isuruf
Copy link
Member

isuruf commented Jul 2, 2024

I've split off the following packages in #948.

  • libgdal-arrow-parquet (arrow dependency)
  • libgdal-jp2openjpeg (openjpeg dependency)
  • libgdal-pdf (poppler dependency)
  • libgdal-postgisraster (postgresql dependency)
  • libgdal-pg (postgresql dependency)
  • libgdal-fits (cfitsio dependency)
  • libgdal-xls (freexls dependency)
  • libgdal-grib (libaec dependency)
  • libgdal-kea (kealib dependency)
  • libgdal-tiledb (tiledb dependency)
  • libgdal-netcd (libnetcdf dependency)
  • libgdal-hdf4 (hdf4 dependency)
  • libgdal-hdf5 (hdf5 dependency)

Any other packages that we should split?

@isuruf
Copy link
Member

isuruf commented Jul 4, 2024

Number of deps went down from 113 to 57.

Here's the dep list on macOS

blosc                     1.21.6               h5499902_0    conda-forge
bzip2                     1.0.8                h93a5062_5    conda-forge
c-ares                    1.28.1               h93a5062_0    conda-forge
ca-certificates           2024.7.4             hf0a4a13_0    conda-forge
freexl                    2.0.0                hfbad9fb_0    conda-forge
geos                      3.12.1               h965bd2d_0    conda-forge
geotiff                   1.7.3                h7e5fb84_1    conda-forge
giflib                    5.2.2                h93a5062_0    conda-forge
hdf4                      4.2.15               h2ee6834_7    conda-forge
hdf5                      1.14.3          nompi_hec07895_105    conda-forge
icu                       73.2                 hc8870d7_0    conda-forge
json-c                    0.17                 h40ed0f5_0    conda-forge
krb5                      1.21.3               h237132a_0    conda-forge
lerc                      4.0.0                h9a09cb3_0    conda-forge
libaec                    1.1.3                hebf3989_0    conda-forge
libarchive                3.7.4                h83d404f_0    conda-forge
libboost-headers          1.85.0               hce30654_2    conda-forge
libcurl                   8.8.0                h7b6f9a7_1    conda-forge
libcxx                    17.0.6               h0812c0d_3    conda-forge
libdeflate                1.20                 h93a5062_0    conda-forge
libedit                   3.1.20191231         hc8eb9b7_2    conda-forge
libev                     4.33                 h93a5062_2    conda-forge
libexpat                  2.6.2                hebf3989_0    conda-forge
libgdal-core              3.9.1                h7d70149_2    local
libgfortran               5.0.0           13_2_0_hd922786_3    conda-forge
libgfortran5              13.2.0               hf226fd6_3    conda-forge
libiconv                  1.17                 h0d3ecfb_2    conda-forge
libjpeg-turbo             3.0.0                hb547adb_1    conda-forge
libkml                    1.3.0             h1eb4d9f_1018    conda-forge
libnetcdf                 4.9.2           nompi_he469be0_114    conda-forge
libnghttp2                1.58.0               ha4dd798_1    conda-forge
libpng                    1.6.43               h091b4b1_0    conda-forge
librttopo                 1.1.0               hc8f776e_15    conda-forge
libspatialite             5.1.0                h64db68f_7    conda-forge
libsqlite                 3.46.0               hfb93653_0    conda-forge
libssh2                   1.11.0               h7a5bd25_0    conda-forge
libtiff                   4.6.0                h07db509_3    conda-forge
libwebp-base              1.4.0                h93a5062_0    conda-forge
libxml2                   2.12.7               ha661575_1    conda-forge
libzip                    1.10.1               ha0bc3c6_3    conda-forge
libzlib                   1.3.1                hfb2fe0b_1    conda-forge
llvm-openmp               18.1.8               hde57baf_0    conda-forge
lz4-c                     1.9.4                hb7217d7_0    conda-forge
lzo                       2.10              h93a5062_1001    conda-forge
minizip                   4.0.7                h27ee973_0    conda-forge
ncurses                   6.5                  hb89a1cb_0    conda-forge
openssl                   3.3.1                hfb2fe0b_1    conda-forge
pcre2                     10.44                h297a79d_0    conda-forge
proj                      9.4.1                hfb94cee_0    conda-forge
readline                  8.2                  h92ec313_1    conda-forge
snappy                    1.2.1                hd02b534_0    conda-forge
sqlite                    3.46.0               h5838104_0    conda-forge
uriparser                 0.9.8                h00cdb27_0    conda-forge
xerces-c                  3.2.5                hf393695_0    conda-forge
xz                        5.2.6                h57fd34a_0    conda-forge
zlib                      1.3.1                hfb2fe0b_1    conda-forge
zstd                      1.5.6                hb46c0d2_0    conda-forge

We can remove a bit more.

@olivier-lacroix
Copy link

This is awesome! llvm-openmp May be a good candidate due to size (more than 50Mb) on Linux ?

@isuruf
Copy link
Member

isuruf commented Jul 4, 2024

@olivier-lacroix, that's a dependency only on macOS

@olivier-lacroix
Copy link

Ah great @isuruf ! I am looking forward to this! Thanks a lot for your work on it :-)

@akrherz
Copy link
Contributor

akrherz commented Jul 12, 2024

#948 is now merged to main, I think we can close this as completed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests