Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow format string for format_str in _dtype_from_vaexdtype() #54

Closed
AlenkaF opened this issue Aug 25, 2021 · 6 comments
Closed

Arrow format string for format_str in _dtype_from_vaexdtype() #54

AlenkaF opened this issue Aug 25, 2021 · 6 comments
Labels
bug Something isn't working interchange-protocol

Comments

@AlenkaF
Copy link

AlenkaF commented Aug 25, 2021

The code now uses NumPy format strings, while the docs for Column.dtype specify it must use the format string from the Apache Arrow C Data Interface (similar but slightly different). So we need a utility to map NumPy to Arrow format here.

Example - should say 'b' not |b1':

df = pd.DataFrame({"A": [True, False, False, True]})

>>> df.__dataframe__().get_column_by_name('A').dtype
(<_DtypeKind.BOOL: 20>, 8, '|b1', '|')

Source:
https://arrow.apache.org/docs/format/CDataInterface.html#data-type-description-format-strings
https://numpy.org/doc/stable/reference/arrays.interface.html#arrays-interface
https://numpy.org/doc/stable/reference/generated/numpy.dtype.itemsize.html
https://numpy.org/doc/stable/reference/generated/numpy.dtype.byteorder.html

@rgommers
Copy link
Member

@honno could you please verify that https://github.com/data-apis/dataframe-interchange-tests has a test checking that the format strings used in each dataframe library are indeed in Arrow format now?

xref #62, which has a basic test that may be reusable in case you don't yet have one.

@honno
Copy link
Member

honno commented Jul 28, 2022

could you please verify that https://github.com/data-apis/dataframe-interchange-tests has a test checking that the format strings used in each dataframe library are indeed in Arrow format now?

I had left it as a TODO, but just now updated test_dtype to test that the format strings are in Arrow format (for the most part—will explore testing some trickier scenarios later). I found vaex and cuDF were returning the NumPy-style format strings and filed issues respectively.

@rgommers
Copy link
Member

Thanks @honno. I think we are good here then. I double checked for Pandas, and that looks as expected:

>>> import pandas as pd
>>> pd.__version__
'1.5.0rc0'
>>> df = pd.DataFrame({"A": [True, False, False, True]})
>>> df.__dataframe__().get_column_by_name('A').dtype
(<DtypeKind.BOOL: 20>, 8, 'b', '|')

Let's close this. Thanks @AlenkaF and @honno

@jorisvandenbossche
Copy link
Member

When demoing the dataframe protocol at EuroScipy, I actually ran into this, seeing that Vaex is returning a wrong value:

import vaex
df_vaex = vaex.example()[:1000]
df_obj = df_vaex.__dataframe__()
col_obj = df_obj.get_column_by_name("x")
col_obj.get_buffers()

gives

{'data': (VaexBuffer({'bufsize': 4000, 'ptr': 139788622914344, 'device': 'CPU'}),
  (<_DtypeKind.FLOAT: 2>, 32, '<f4', '<')),
 'validity': None,
 'offsets': None}

And so the third entry in the dtype tuple "<f4" is wrong, it should be "f" instead (it seems this is just the numpy descriptor).

So while this might now be tested in dataframe-interchange-tests, is this failing or skipped there? And are there issues in the libraries to keep track of this?

@honno
Copy link
Member

honno commented Sep 13, 2022

So while this might now be tested in dataframe-interchange-tests, is this failing or skipped there? And are there issues in the libraries to keep track of this?

Ayup, issue filed in vaexio/vaex#2139, and it's failed in the test suite (well skipped due to flakiness issues).

@jorisvandenbossche
Copy link
Member

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working interchange-protocol
Projects
None yet
Development

No branches or pull requests

4 participants