-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Arrow - Velox conversion support #4450
Add Arrow - Velox conversion support #4450
Conversation
Hi @sanjibansg! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
✅ Deploy Preview for meta-velox canceled.
|
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
pyvelox/pyvelox.h
Outdated
[](uintptr_t arrowArrayPtr, uintptr_t arrowSchemaPtr) { | ||
auto arrowArray = reinterpret_cast<ArrowArray*>(arrowArrayPtr); | ||
auto arrowSchema = reinterpret_cast<ArrowSchema*>(arrowSchemaPtr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, this function is written such that you pass the pointers as arguments, but alternatively, the function could accept a pyarrow.Array object, and then the pointers could be initialized here and the _export_to_c
method could be called here? That way, the user doesn't need to call cffi themselves, and what now takes
c_schema = ffi.new("struct ArrowSchema*")
schema_ptr = int(ffi.cast("uintptr_t", c_schema))
c_array = ffi.new("struct ArrowArray*")
array_ptr = int(ffi.cast("uintptr_t", c_array))
arr = pa.array([1, 2, 3], type=pa.int32())
arr._export_to_c(array_ptr, schema_ptr)
velox_vector = pv.import_from_arrow(array_ptr, schema_ptr)
could be reduced to
velox_vector = pv.import_from_arrow(arr)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that was my initial thought. But, I guess that would make PyArrow a dependency here, which I am not sure whether it will be a good approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be possible to make it only an optional dependency, I think. Like try to import it, and if not available, raise an informative error?
Long term, we should also make it possible to pass Arrow data without relying on pyarrow (using the C Data Interface, xref apache/arrow#34031). But short term I think assuming you get a pyarrow.Array is the easiest and most useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make it a proper dependency if thats the way to go. What would be concerns with making pyarrow a proper dependency for pyvelox ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be concerns with making pyarrow a proper dependency for pyvelox ?
Just being an additional dependency that you need to have installed (and each additional dependency can give some installation problem), while it is not strictly needed for other functionality of PyVelox.
Now, since we won't require a development version of pyarrow, that's shouldn't give much problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the conversion function accepts a py::object
and calls the method _export_to_c
internally. Does this approach look good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sanjibansg thanks for the update. Yes, that looks good to me! In addition, we could also do the same for the export_to_arrow
function, so that the two methods are more each other counterparts?
b695e6e
to
9f88745
Compare
I rebased to fix the failing builds due to the arrow link change. The wheel workflow works now and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Sanjiban,
Some questions.
Makefile
Outdated
@@ -155,7 +155,8 @@ python-clean: | |||
DEBUG=1 ${PYTHON_EXECUTABLE} setup.py clean | |||
|
|||
python-build: | |||
DEBUG=1 CMAKE_BUILD_PARALLEL_LEVEL=4 ${PYTHON_EXECUTABLE} setup.py develop | |||
DEBUG=1 CMAKE_BUILD_PARALLEL_LEVEL=4 ${PYTHON_EXECUTABLE} -m pip install -e .$(extras) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious why the pip install ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we can use the tests extras and also because it is the "correct" way to do it afaik
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For another link, see the warning note here: https://setuptools.pypa.io/en/latest/deprecated/commands.html
pyvelox/pyvelox.h
Outdated
auto arrowArray = new ArrowArray(); | ||
std::shared_ptr<facebook::velox::memory::MemoryPool> pool_{ | ||
facebook::velox::memory::getDefaultMemoryPool()}; | ||
facebook::velox::exportToArrow(inputVector, *arrowArray, pool_.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also looks like this isnt 0 copy , atleast for flat vectors it ought to be possible to have velox to arrow be 0 copy , right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
atleast for flat vectors it ought to be possible to have velox to arrow be 0 copy , right ?
Yes, that should be fully zero-copy. It's only for those vector types that don't have an exact 1:1 mapping with arrow types that additional data might need to be allocated. (eg StringView)
pyvelox/pyvelox.h
Outdated
[](uintptr_t arrowArrayPtr, uintptr_t arrowSchemaPtr) { | ||
auto arrowArray = reinterpret_cast<ArrowArray*>(arrowArrayPtr); | ||
auto arrowSchema = reinterpret_cast<ArrowSchema*>(arrowSchemaPtr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make it a proper dependency if thats the way to go. What would be concerns with making pyarrow a proper dependency for pyvelox ?
pyvelox/test/test_vector.py
Outdated
|
||
def test_export_to_arrow(self): | ||
vector = pv.from_list([1, 2, 3]) | ||
vector_ptr = pv.export_to_arrow(vector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it common to have c style names like _ptr etc ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you asking about using the _ptr
here, or the snake_case
style used? For the _ptr
usage, I can use better names, it was just to indicate that the integer value returned from the method actually refers to a memory address.
4772c39
to
e8720e6
Compare
e8720e6
to
3ffd670
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Sanjiban. This looks very good. Some minor questions .
pyvelox/conversion.cpp
Outdated
[](py::object inputArrowArray) { | ||
auto arrowArray = std::make_unique<ArrowArray>(); | ||
auto arrowSchema = std::make_unique<ArrowSchema>(); | ||
inputArrowArray.attr("_export_to_c")(reinterpret_cast<uintptr_t>(arrowArray.get()), reinterpret_cast<uintptr_t>(arrowSchema.get())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be some sort of validation that this is a valid arrow object and this attr is present , and otherwise a helpful error message ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Longer term, I would like that we add some sort of python protocol (eg __arrow_c_data__()
) in the Arrow ecosystem, so that any object that uses arrow-compatible memory can be consumed here.
But on the short term, checking that the input is indeed an pyarrow Array is indeed a good idea (just checking that the attribute is present, and if not raising an informative error, is probably sufficient)
pyvelox/conversion.cpp
Outdated
inputArrowArray.attr("_export_to_c")(reinterpret_cast<uintptr_t>(arrowArray.get()), reinterpret_cast<uintptr_t>(arrowSchema.get())); | ||
std::shared_ptr<facebook::velox::memory::MemoryPool> pool_{ | ||
facebook::velox::memory::addDefaultLeafMemoryPool()}; | ||
return importFromArrowAsOwner(*arrowSchema, *arrowArray, pool_.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious, why not just use the default memory pool ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, I am using the memory pool from the static instance of the PyVeloxContext
. I believe it correctly manages the memory, and cleans it after usage.
@@ -254,3 +255,35 @@ def test_append(self): | |||
|
|||
with self.assertRaises(TypeError): | |||
ints2.append(strs2) | |||
|
|||
def test_export_to_arrow(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks this is great, can we add export / import tests for all the types we support currently though ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's certainly good to add some more tests, but just to mention that the actual conversion itself (the C++ code) is also already tested at https://github.com/facebookincubator/velox/blob/main/velox/vector/arrow/tests/ArrowBridgeArrayTest.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jorisvandenbossche , nevertheless would be good to add tests for atleast the primitive types to make sure there are no inadvertent casts across types , since its going via Cpp to Python etc. Also PyVelox doesnt support all the types thats Velox supports yet :) .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the test cases for integers, floats, and strings. Should I add some other test cases as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, while testing multiple test cases, I noticed, that the memory pool instance was not handled correctly earlier. So, I had to move the Instance struct to a separate header file, does this approach look good, or can there be a better way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think the latest vectorsaver changes also moved it to a different file. Please merge from latest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good @sanjibansg . Can you rebase against latest main and It should be good to merge after that.
@@ -254,3 +255,35 @@ def test_append(self): | |||
|
|||
with self.assertRaises(TypeError): | |||
ints2.append(strs2) | |||
|
|||
def test_export_to_arrow(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think the latest vectorsaver changes also moved it to a different file. Please merge from latest.
259598f
to
70d85f9
Compare
Rebased it to main, and did the format-check. Thanks! |
@sanjibansg Can you look into the failing PyVelox build ? |
…ot memory address
b4cd95a
to
87e5f2b
Compare
@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
Hi, I would like to know when will the package on pypi https://pypi.org/project/pyvelox/ to be updated to include this important change? Thanks! |
@sighingnow I have a job here to publish the new python package : https://github.com/facebookincubator/velox/actions/runs/5554843861 . |
The packages are published @sighingnow , you can get it from pypi. |
Thank you! |
This PR introduces PyVelox functions for the conversion of Arrow Arrays to Velox Vectors and vice-versa.