Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add read-only OGR ADBC (Arrow Database Connectivity) driver #11003

Merged
merged 4 commits into from
Oct 29, 2024

Conversation

rouault
Copy link
Member

@rouault rouault commented Oct 13, 2024

Cf https://arrow.apache.org/adbc/current/index.html for what ADBC is.

Depends on the adbc-driver-manager library.

The driver is read-only, and there is no support for spatial data currently.

Beyond official ADBC drivers (adbc-driver-sqlite, adbc-driver-postgresql, adbc-driver-snowflake, adbc-driver-bigquery,
etc.), it can also be used to read Parquet or DuckDB datasets using libduckdb, if libduckdb is installed and can be loaded through dynamic shared library opening.

@rouault rouault added this to the 3.11.0 milestone Oct 13, 2024
@rouault rouault mentioned this pull request Oct 13, 2024
@coveralls
Copy link
Collaborator

coveralls commented Oct 14, 2024

Coverage Status

coverage: 69.415% (-0.03%) from 69.445%
when pulling c8df459 on rouault:adbc
into c044df7 on OSGeo:master.

Copy link
Contributor

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is amazing!

With respect to geometry support, we're in the process of thinking through how to mark Geometry while pushing the minimum geometry awareness into the general-purpose drivers (that's GDAL's job, after all!): apache/arrow-adbc#2098 . I think we will be able to mark Snowflake geometry as some extension type denoting GeoJSON, PostGIS as something denoting EWKB, and BigQuery as geoarrow.wkt (with spherical edges). Getting DuckDB to mark their output has been a feature request for a while but I think it's generally thought of as a good idea: duckdb/duckdb_spatial#153 .

I am not sure at what point I'll be able to get there, but I have wanted to do the reverse as well (expose OGR as an ADBC driver). I am in the process of finishing up the documentation for a framework to make it a bit easier: apache/arrow-adbc#2186


gdal_standard_includes(ogr_ADBC)
target_include_directories(ogr_ADBC PRIVATE $<TARGET_PROPERTY:ogrsf_generic,SOURCE_DIR>)
gdal_target_link_libraries(ogr_ADBC PRIVATE AdbcDriverManager::adbc_driver_manager_shared)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is OK, but I think you can also vendor c/driver_manager/adbc_driver_manager.cc/h if that is easier for packaging/makes it more likely to get this into a default build (this is what the R package does since we can't use CMake).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit hesitant on that:

  • that would mean extra code to maintain in GDAL
  • When we vendor things, we make an effort to rename their symbols, which is an extra layer of maintainance on GDAL side. If adbc_driver_manager code itself has some provision for a #define ADBC_DRIVER_MANAGER_NAMESPACE that could allow to prefix its symbols with a user specified prefix, that would save that effort on GDAL side.
  • what about the Windows side of things? Initially, I tried to use Conda to test on Windows on our CI, but it failed, and then I realized that
    https://github.com/conda-forge/arrow-adbc-split-feedstock/blob/6f2af5a9b0de9cbdf895ec0ab1e516c306abac67/recipe/meta.yaml#L31 disabled Windows packaging. Is it a Conda thing, or is there's something more fundamental in adbc_driver_manager itself that prevents it from working on Windows?
  • if we vendor, what about the stability of the interface between adbc_driver_manager and ADBC drivers themselves? I mean if we forget to update the vendored adbc_driver_manager, will that affect its working with ADBC drivers that would have been built against a more recent external ADBC interface?
  • with respect to my above symbol renaming, do ADBC drivers link with adbc_driver_manager itself? If so, then symbol renaming is not possible, and vendoring would be a bit risky.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TL;DR is that this is something we need to make possible, test, and document from our side: we want people to federate to data sources using ADBC exactly like you're doing here and there won't be widely available linux packages anytime soon.

Perhaps a safer alternative, if slightly more verbose, would be to use just the driver loader part of the driver manager:

https://github.com/apache/arrow-adbc/blob/cd24bc096612feeef31bff2a9271616cc94e1634/c/include/arrow-adbc/adbc_driver_manager.h#L34-L67

...and call the driver callbacks directly:

https://github.com/apache/arrow-adbc/blob/cd24bc096612feeef31bff2a9271616cc94e1634/c/driver/framework/base_driver_test.cc#L78-L145

One of the things we may do soon is make a header-only C++ wrapper that supports that process.

that would mean extra code to maintain in GDAL

If it worked using FetchContent to an official ADBC release would that be sufficient? I don't think there will be a widely available "system" ADBC for quite some time that will let R users in particular benefit from this (since most of them choose not to use conda!).

with respect to my above symbol renaming, do ADBC drivers link with adbc_driver_manager itself? If so, then symbol renaming is not possible, and vendoring would be a bit risky.

ADBC drivers definitely do not use the AdbcXXX symbols, but some (or all?) of them export them (I think this is confusing and I wish they didn't). Loading a driver using the init function and calling the driver callbacks definitely doesn't depend on the AdbcXXX symbols at all.

what about the Windows side of things?

The driver manager definitely works on Windows (it's bundled into the Python package that builds on MSVC and vendored into the R package that builds using mingw). I'm not sure of the linking details of Python + conda or why it is not packaged for conda windows.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there won't be widely available linux packages anytime soon.

IMHO, that's something to consider in the top of the priority list for success of ADBC rather than investing in all other workarounds.

If it worked using FetchContent to an official ADBC release would that be sufficient?

not really, at least for GDAL as packaged by Linux distributions. I believe some build systems disable network after they have downloaded the tarball and installed all dependencies to make sure it is fully reproducible

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! It's not a workaround in the sense that it's documented behaviour:

https://github.com/apache/arrow-adbc/blob/cd24bc096612feeef31bff2a9271616cc94e1634/c/include/arrow-adbc/adbc.h#L930-L935

...it's just documented behaviour that we haven't made as easy to use as the driver manager's AdbcXX symbols yet. With the ability to inject AdbcLoadDriver, this would mean you don't necessarily need the driver manager at all (i.e., you would only need adbc.h, which is intended to be vendored in this way and defines an ABI with very intentional forward/backward compatibility guarantees).

Comment on lines +174 to +213
const char *pszADBCDriverName =
CSLFetchNameValue(poOpenInfo->papszOpenOptions, "ADBC_DRIVER");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If at all possible, in R it would help to be able to initialize via the init function address:

https://github.com/apache/arrow-adbc/blob/cd24bc096612feeef31bff2a9271616cc94e1634/c/include/arrow-adbc/adbc_driver_manager.h#L69-L79

...because in R this is how we pass around driver information (including DuckDB and Snowflake). In Python I believe the driver packages we ship always have a shared library + entrypoint (although I would have to look up if this is easily accessible to help something like GDAL load the driver). This can, of course, be handled some other time, but if there's any way to squeeze in supporting a driver installed using pip install adbc_driver_snowflake or install.packages("adbcsnowflake"), this is how our documentation encourages users of those environments to get the driver.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If at all possible, in R it would help to be able to initialize via the init function address:

Hum, would you suggest passing a function pointer as an hexadecimal address as a string like "0xDEADBEEF" ? I believe that's technically undefined behaviour in C/C++ for function pointers, also that will work in most cases. However I remember some time ago to have experimenting with the CHERI architecture and pointers are kind of "fat pointers" and can't be "instanciated" from their address. And there's a security issue in allowing to pass this. I could do that, but protected with a configuration option to enable it explicitly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great point! I was thinking a base 10 integer address string but an opt-in configuration option won't help R or Rust users here. Would a C or C++ API hook to inject AdbcLoadDriver or similar be secure? What I'm hoping for here is the ability for GDAL to use a driver installed via install.packages(), cargo, or pip since the drivers are seeing quite a bit of action at the moment and getting the latest version via system is difficult or impossible to do in some environments. With the ability to inject AdbcLoadDriver, something like sf could provide that by hooking up to the adbcdrivermanager package (or pyogrio could do it by hooking up to adbc_driver_manager).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a C or C++ API hook to inject AdbcLoadDriver or similar be secure?

definitely. The slightly annoying part is that setting the hook should be done in GDAL core (as the ADBC driver may be a plugin, that cannot be directly in it), but that's not much a big deal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a new public header gdal_adbc.h per 8d86043 that can be used to call a void GDALSetAdbcDriverInitFunc(AdbcDriverInitFunc init_func) to define the entry point.
This is only compile tested on my side. Actual functional testing would be appreciated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will take a look today/this evening and report back!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No pressure on any of this (all ADBC is great!), but I put together an example of what I had in mind. This turned out to be more like "inject AdbcLoadDriver" than just one init function. Basically, if the embedding application can have some flexibility about how to load the drivers it can use ones already statically linked in, or updated versions of them installed via pip or some other mechanism. Using the "load" mechanism frees GDAL from the driver manager completely since then it can just use the callbacks.

Experiment in GDAL: rouault/gdal@adbc...paleolimbot:gdal:adbc-suggestions

Changes to sf: r-spatial/sf@main...paleolimbot:adbc-experiments

Changes to the R driver manager: apache/arrow-adbc@main...paleolimbot:r-adbcdrivermanager-init

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paleolimbot Cool! I've merged your changes with minor adaptations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

m_apoLayers.emplace_back(std::move(poLayer));
}
}
else if (bIsDuckDB || bIsSQLite3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I've used in 637405b
However I've noticed what I believe is a discrepency between the API documentation and driver implementations regarding the depth. I would expect ADBC_OBJECT_DEPTH_DB_SCHEMAS to be enough to get the table list since its doc mentions "Return metadata on catalogs, schemas, and table", but I actually need to go one level deeper (ADBC_OBJECT_DEPTH_TABLES) to get the table list. At least with SQLite, DuckDB and PostgreSQL. With ADBC_OBJECT_DEPTH_DB_SCHEMAS, db_schema_tables is null or an empty list.

ogr/ogrsf_frmts/generic/ogrlayerarrow.h Show resolved Hide resolved
autotest/ogr/ogr_adbc.py Outdated Show resolved Hide resolved
autotest/ogr/ogr_adbc.py Show resolved Hide resolved
doc/source/drivers/vector/adbc.rst Outdated Show resolved Hide resolved
doc/source/drivers/vector/adbc.rst Outdated Show resolved Hide resolved
doc/source/drivers/vector/adbc.rst Outdated Show resolved Hide resolved
{
m_poLayerDefn = new OGRFeatureDefn(pszName);
m_poLayerDefn->SetGeomType(wkbNone);
m_poLayerDefn->Reference();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of scope for this PR: I was wondering if it will makes sense to eventually switch to a std::shared_ptr for feature definition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something that we discussed with @dbaston for the scope of his future works, but besides the enormous effort to change that within OSGeo/GDAL, there's an obvious backward compatiiblity issue for external C++ users. Could very well be a 4.0 topic

ogr/ogrsf_frmts/adbc/ogradbcdataset.cpp Show resolved Hide resolved
ogr/ogrsf_frmts/adbc/ogradbcdataset.cpp Show resolved Hide resolved
const char *pszTableName = poFeature->GetFieldAsString(1);
const std::string osStatement = CPLSPrintf(
"SELECT * FROM \"%s\".\"%s\"",
CPLString(pszNamespace).replaceAll("\"", "\"\"").c_str(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have an SQL quoting function available somewhere this is when it could be of use.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added :

/** Replace all occurences of ch by it repeated twice.
 * Typically used for SQL string literal or identifier escaping.
 */
std::string CPL_DLL OGRDuplicateCharacter(const std::string &osStr, char ch);

doc/source/drivers/vector/adbc.rst Outdated Show resolved Hide resolved
Comment on lines +174 to +213
const char *pszADBCDriverName =
CSLFetchNameValue(poOpenInfo->papszOpenOptions, "ADBC_DRIVER");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

rouault and others added 4 commits October 29, 2024 01:49
Cf https://arrow.apache.org/adbc/current/index.html for what ADBC is.

Depends on the adbc-driver-manager library.

The driver is read-only, and there is no support for spatial data currently.

Beyond official ADBC drivers (adbc-driver-sqlite,
adbc-driver-postgresql, adbc-driver-snowflake, adbc-driver-bigquery,
etc.), it can also be used to read Parquet or DuckDB datasets using libduckdb, if
libduckdb is installed and can be loaded through dynamic shared library opening.
Also no longer make AdbcDriverManager a requirement
@rouault
Copy link
Member Author

rouault commented Oct 29, 2024

Merging

@rouault rouault merged commit 103580a into OSGeo:master Oct 29, 2024
43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants