Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: memoryview of Apache Arrow buffers (#9) #11

Conversation

maartenbreddels
Copy link

@maartenbreddels maartenbreddels commented Sep 11, 2020

This tests the issue mentioned in #9, which gives me


    def test_arrow():
        import pyarrow as pa
        ar = pa.array(['foo', 'bar'])
        blake = blake3()
>       blake.update(memoryview(ar.buffers()[1]))
E       BufferError: Incompatible type as buffer

tests/test_blake3.py:98: BufferError

@oconnor663
Copy link
Owner

The failure seems to be at this line, with PyBuffer::get returning an error. I'm not very familiar with the details of Python's buffer protocol, or with whatever quirks there might be in how PyO3 wraps it. I'm also unfamiliar with PyArrow. But as far as I can tell, this is either a bug in PyO3, a bug in PyArrow, or expected behavior?

@oconnor663
Copy link
Owner

It looks like the memoryview part might be a red herring. I get the same error if I try to use the buffers object directly.

@oconnor663
Copy link
Owner

Ok, putting some debugging into PyO3 clarifies this. The issue is that the PyArrow buffer has a signed char type. You can see that like this:

>>> import pyarrow
>>> ar = pyarrow.array(["foo", "bar"])
>>> memoryview(ar.buffers()[1]).format
'b'

Here's the table of format specifiers. Compare that to regular unsigned char format of a regular bytes object:

>>> memoryview(b"foobar").format
'B'

That leads to a fix. If I replace u8 with i8 in the PyBuffer::get call, then the test passes. (But of course everything else breaks.) Perhaps what we should do is to try to get the buffer as both u8 and i8, and use whichever one succeeds? This assumes that we're ok with pointer casting i8's to u8's...maybe that would have a weird result on hypothetical non-two's-complement machines?

@maartenbreddels
Copy link
Author

I think it will be a good approach, in the end, it's just bytes that will be processed, the type information will probably be lost (in the blake library) and nobody should be making any memory copies.

@oconnor663
Copy link
Owner

I've pushed beb4e32, which includes a fix and some tests, so I'll close out this PR. Thanks for pointing me to this!

@oconnor663 oconnor663 closed this Sep 12, 2020
@maartenbreddels
Copy link
Author

maartenbreddels commented Sep 12, 2020 via email

oconnor663 pushed a commit that referenced this pull request Sep 12, 2020
Changes since 0.1.6:
- Add support for hashing buffers of signed chars. This is uncommon,
  but it can come up with e.g. PyArrow. See
  #9 and
  #11.
@oconnor663
Copy link
Owner

Just pushed version 0.1.7 :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants