Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Arrow - Velox conversion support #4450

Closed

Conversation

sanjibansg
Copy link
Contributor

This PR introduces PyVelox functions for the conversion of Arrow Arrays to Velox Vectors and vice-versa.

@facebook-github-bot
Copy link
Contributor

Hi @sanjibansg!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@netlify
Copy link

netlify bot commented Mar 29, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 579010b
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/64820985835d670008c86fa4

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 29, 2023
@facebook-github-bot
Copy link
Contributor

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@pedroerp pedroerp requested a review from kgpai March 29, 2023 16:20
Comment on lines 431 to 434
[](uintptr_t arrowArrayPtr, uintptr_t arrowSchemaPtr) {
auto arrowArray = reinterpret_cast<ArrowArray*>(arrowArrayPtr);
auto arrowSchema = reinterpret_cast<ArrowSchema*>(arrowSchemaPtr);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this function is written such that you pass the pointers as arguments, but alternatively, the function could accept a pyarrow.Array object, and then the pointers could be initialized here and the _export_to_c method could be called here? That way, the user doesn't need to call cffi themselves, and what now takes

        c_schema = ffi.new("struct ArrowSchema*")
        schema_ptr = int(ffi.cast("uintptr_t", c_schema))

        c_array = ffi.new("struct ArrowArray*")
        array_ptr = int(ffi.cast("uintptr_t", c_array))

        arr = pa.array([1, 2, 3], type=pa.int32())
        arr._export_to_c(array_ptr, schema_ptr)

        velox_vector = pv.import_from_arrow(array_ptr, schema_ptr)

could be reduced to

        velox_vector = pv.import_from_arrow(arr)

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was my initial thought. But, I guess that would make PyArrow a dependency here, which I am not sure whether it will be a good approach.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be possible to make it only an optional dependency, I think. Like try to import it, and if not available, raise an informative error?
Long term, we should also make it possible to pass Arrow data without relying on pyarrow (using the C Data Interface, xref apache/arrow#34031). But short term I think assuming you get a pyarrow.Array is the easiest and most useful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make it a proper dependency if thats the way to go. What would be concerns with making pyarrow a proper dependency for pyvelox ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be concerns with making pyarrow a proper dependency for pyvelox ?

Just being an additional dependency that you need to have installed (and each additional dependency can give some installation problem), while it is not strictly needed for other functionality of PyVelox.
Now, since we won't require a development version of pyarrow, that's shouldn't give much problems.

Copy link
Contributor Author

@sanjibansg sanjibansg Apr 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the conversion function accepts a py::object and calls the method _export_to_c internally. Does this approach look good?

cc: @jorisvandenbossche @kgpai

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sanjibansg thanks for the update. Yes, that looks good to me! In addition, we could also do the same for the export_to_arrow function, so that the two methods are more each other counterparts?

@assignUser assignUser force-pushed the arrow-velox-conversion branch from b695e6e to 9f88745 Compare April 4, 2023 18:32
@assignUser
Copy link
Collaborator

assignUser commented Apr 4, 2023

I rebased to fix the failing builds due to the arrow link change. The wheel workflow works now and make python-test should also install the extras_require.

Copy link
Contributor

@kgpai kgpai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Sanjiban,

Some questions.

Makefile Outdated
@@ -155,7 +155,8 @@ python-clean:
DEBUG=1 ${PYTHON_EXECUTABLE} setup.py clean

python-build:
DEBUG=1 CMAKE_BUILD_PARALLEL_LEVEL=4 ${PYTHON_EXECUTABLE} setup.py develop
DEBUG=1 CMAKE_BUILD_PARALLEL_LEVEL=4 ${PYTHON_EXECUTABLE} -m pip install -e .$(extras)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious why the pip install ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we can use the tests extras and also because it is the "correct" way to do it afaik

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For another link, see the warning note here: https://setuptools.pypa.io/en/latest/deprecated/commands.html

pyvelox/pyvelox.h Outdated Show resolved Hide resolved
pyvelox/pyvelox.h Outdated Show resolved Hide resolved
auto arrowArray = new ArrowArray();
std::shared_ptr<facebook::velox::memory::MemoryPool> pool_{
facebook::velox::memory::getDefaultMemoryPool()};
facebook::velox::exportToArrow(inputVector, *arrowArray, pool_.get());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also looks like this isnt 0 copy , atleast for flat vectors it ought to be possible to have velox to arrow be 0 copy , right ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

atleast for flat vectors it ought to be possible to have velox to arrow be 0 copy , right ?

Yes, that should be fully zero-copy. It's only for those vector types that don't have an exact 1:1 mapping with arrow types that additional data might need to be allocated. (eg StringView)

pyvelox/pyvelox.h Outdated Show resolved Hide resolved
Comment on lines 431 to 434
[](uintptr_t arrowArrayPtr, uintptr_t arrowSchemaPtr) {
auto arrowArray = reinterpret_cast<ArrowArray*>(arrowArrayPtr);
auto arrowSchema = reinterpret_cast<ArrowSchema*>(arrowSchemaPtr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make it a proper dependency if thats the way to go. What would be concerns with making pyarrow a proper dependency for pyvelox ?

pyvelox/pyvelox.h Outdated Show resolved Hide resolved

def test_export_to_arrow(self):
vector = pv.from_list([1, 2, 3])
vector_ptr = pv.export_to_arrow(vector)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it common to have c style names like _ptr etc ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you asking about using the _ptr here, or the snake_case style used? For the _ptr usage, I can use better names, it was just to indicate that the integer value returned from the method actually refers to a memory address.

@sanjibansg sanjibansg force-pushed the arrow-velox-conversion branch from 4772c39 to e8720e6 Compare April 25, 2023 07:31
@sanjibansg sanjibansg requested a review from kgpai May 2, 2023 04:10
@sanjibansg sanjibansg force-pushed the arrow-velox-conversion branch from e8720e6 to 3ffd670 Compare May 2, 2023 05:12
Copy link
Contributor

@kgpai kgpai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Sanjiban. This looks very good. Some minor questions .

[](py::object inputArrowArray) {
auto arrowArray = std::make_unique<ArrowArray>();
auto arrowSchema = std::make_unique<ArrowSchema>();
inputArrowArray.attr("_export_to_c")(reinterpret_cast<uintptr_t>(arrowArray.get()), reinterpret_cast<uintptr_t>(arrowSchema.get()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be some sort of validation that this is a valid arrow object and this attr is present , and otherwise a helpful error message ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Longer term, I would like that we add some sort of python protocol (eg __arrow_c_data__()) in the Arrow ecosystem, so that any object that uses arrow-compatible memory can be consumed here.
But on the short term, checking that the input is indeed an pyarrow Array is indeed a good idea (just checking that the attribute is present, and if not raising an informative error, is probably sufficient)

inputArrowArray.attr("_export_to_c")(reinterpret_cast<uintptr_t>(arrowArray.get()), reinterpret_cast<uintptr_t>(arrowSchema.get()));
std::shared_ptr<facebook::velox::memory::MemoryPool> pool_{
facebook::velox::memory::addDefaultLeafMemoryPool()};
return importFromArrowAsOwner(*arrowSchema, *arrowArray, pool_.get());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, why not just use the default memory pool ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, I am using the memory pool from the static instance of the PyVeloxContext. I believe it correctly manages the memory, and cleans it after usage.

@@ -254,3 +255,35 @@ def test_append(self):

with self.assertRaises(TypeError):
ints2.append(strs2)

def test_export_to_arrow(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this is great, can we add export / import tests for all the types we support currently though ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's certainly good to add some more tests, but just to mention that the actual conversion itself (the C++ code) is also already tested at https://github.com/facebookincubator/velox/blob/main/velox/vector/arrow/tests/ArrowBridgeArrayTest.cpp

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jorisvandenbossche , nevertheless would be good to add tests for atleast the primitive types to make sure there are no inadvertent casts across types , since its going via Cpp to Python etc. Also PyVelox doesnt support all the types thats Velox supports yet :) .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the test cases for integers, floats, and strings. Should I add some other test cases as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, while testing multiple test cases, I noticed, that the memory pool instance was not handled correctly earlier. So, I had to move the Instance struct to a separate header file, does this approach look good, or can there be a better way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think the latest vectorsaver changes also moved it to a different file. Please merge from latest.

Copy link
Contributor

@kgpai kgpai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @sanjibansg . Can you rebase against latest main and It should be good to merge after that.

@@ -254,3 +255,35 @@ def test_append(self):

with self.assertRaises(TypeError):
ints2.append(strs2)

def test_export_to_arrow(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think the latest vectorsaver changes also moved it to a different file. Please merge from latest.

@sanjibansg sanjibansg force-pushed the arrow-velox-conversion branch 2 times, most recently from 259598f to 70d85f9 Compare May 31, 2023 07:20
@sanjibansg sanjibansg requested a review from kgpai May 31, 2023 07:23
@sanjibansg
Copy link
Contributor Author

Looks good @sanjibansg . Can you rebase against latest main and It should be good to merge after that.

Rebased it to main, and did the format-check. Thanks!

@kgpai
Copy link
Contributor

kgpai commented Jun 1, 2023

@sanjibansg Can you look into the failing PyVelox build ?

@sanjibansg sanjibansg force-pushed the arrow-velox-conversion branch from b4cd95a to 87e5f2b Compare June 3, 2023 02:38
@facebook-github-bot
Copy link
Contributor

@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@kgpai merged this pull request in 8e95993.

@conbench-facebook
Copy link

Conbench analyzed the 1 benchmark run on commit 8e959935.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

@sighingnow
Copy link

Hi, I would like to know when will the package on pypi https://pypi.org/project/pyvelox/ to be updated to include this important change?

Thanks!

@kgpai
Copy link
Contributor

kgpai commented Jul 14, 2023

@sighingnow I have a job here to publish the new python package : https://github.com/facebookincubator/velox/actions/runs/5554843861 .

@kgpai
Copy link
Contributor

kgpai commented Jul 14, 2023

The packages are published @sighingnow , you can get it from pypi.

@sanjibansg sanjibansg deleted the arrow-velox-conversion branch July 14, 2023 18:18
@sighingnow
Copy link

The packages are published @sighingnow , you can get it from pypi.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants