Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python/adbc_driver_manager): export handles and ingest data through python Arrow PyCapsule interface #1346

Merged
merged 10 commits into from
Dec 13, 2023

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Dec 6, 2023

Addresses #70

This PR adds the dunder methods to the Handle classes of the low-level interface (which already enables using the low-level interface without pyarrow and with the capsule protocol).

And secondly, in the places that accept data (eg ingest/bind), it now also accepts objects that implement the dunders in addition to hardcoded support for pyarrow.

@github-actions github-actions bot added this to the ADBC Libraries 0.9.0 milestone Dec 6, 2023
Comment on lines +360 to +362
memcpy(allocated, &self.schema, sizeof(CArrowSchema))
self.schema.release = NULL
return capsule
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are "moving" the schema here, while in nanoarrow I opted for a hard copy for the schema (using nanoarrow's ArrowSchemaDeepCopy).

But I think the only advantage of a hard copy is that this means you can consume it multiple times? (or in the case of nanoarrow-python, that the nanoarrow Schema object is still valid and inspectable after it has been converted to eg a pyarrow.Schema)
For ADBC, I think the use case will be much more "receive handle and convert it directly once", given that the Handle object itself isn't useful at all (in contrast to nanoarrow.Schema), so moving here is probably fine?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think moving makes sense here.

Comment on lines 381 to 393
def __arrow_c_array__(self, requested_schema=None) -> object:
"""Consume this object to get a PyCapsule."""
if requested_schema is not None:
raise NotImplementedError("requested_schema")

cdef CArrowArray* allocated = <CArrowArray*> malloc(sizeof(CArrowArray))
allocated.release = NULL
capsule = PyCapsule_New(
<void*>allocated, "arrow_array", pycapsule_array_deleter,
)
memcpy(allocated, &self.array, sizeof(CArrowArray))
self.array.release = NULL
return capsule
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually not being used at the moment, because I think none of the ADBC APIs are returning an ArrowArray (only ArrowSchema or ArrowArrayStream).
This handle is currently only used internally for ingesting data (bind).

So I could also remove __arrow_c_array__, given it is unused. If we require pyarrow >= 14, I could probably also remove this class entirely, because then we can use the capsule interface for ingesting data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And realizing now, this implementation is actually also wrong -> it needs to return two capsules, one for the ArrowArray but also one for the ArrowSchema. And this handle only has the array. So I don't think we can add this dunder here.

@@ -25,8 +25,8 @@ requires-python = ">=3.9"
dynamic = ["version"]

[project.optional-dependencies]
dbapi = ["pandas", "pyarrow>=8.0.0"]
test = ["duckdb", "pandas", "pyarrow>=8.0.0", "pytest"]
dbapi = ["pandas", "pyarrow>=14.0.1"]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we OK with bumping this requirement? (I don't know who are already users of the python adbc packages that might be affected)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured we should bump it just because our official guidance is to upgrade.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a huge deal now but if usage grows over time this could be a pain point for pandas.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove it. I'm just not sure if we can express that you need the fix package if you're < 14.0.1 in the requirements

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted this change to update the minimum requirement. This PR doesn't strictly speaking need it, so let's keep the discussion to bump the minimum requirement separate.



@pytest.mark.sqlite
def test_pycapsule(sqlite):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I further expanded the specific test that David started, but in addition (at least if requiring pyarrow>=14 for testing) I could also update the other tests above to do the import/export using capsules instead of the current calls to _import_from_c), that then will also automatically give some more test coverage.

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Just one small question

@@ -25,8 +25,8 @@ requires-python = ">=3.9"
dynamic = ["version"]

[project.optional-dependencies]
dbapi = ["pandas", "pyarrow>=8.0.0"]
test = ["duckdb", "pandas", "pyarrow>=8.0.0", "pytest"]
dbapi = ["pandas", "pyarrow>=14.0.1"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured we should bump it just because our official guidance is to upgrade.

Comment on lines +360 to +362
memcpy(allocated, &self.schema, sizeof(CArrowSchema))
self.schema.release = NULL
return capsule
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think moving makes sense here.

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@lidavidm lidavidm merged commit 9544887 into apache:main Dec 13, 2023
48 of 49 checks passed
@jorisvandenbossche jorisvandenbossche deleted the capsules branch December 13, 2023 20:25
@jorisvandenbossche jorisvandenbossche changed the title feat(python/adbc_driver_manager): export handles through python Arrow Capsule interface feat(python/adbc_driver_manager): export handles and ingest data through python Arrow Capsule interface Jan 19, 2024
@jorisvandenbossche jorisvandenbossche changed the title feat(python/adbc_driver_manager): export handles and ingest data through python Arrow Capsule interface feat(python/adbc_driver_manager): export handles and ingest data through python Arrow PyCapsule interface Jan 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants