Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: ENH Add int[pyarrow] dtype #46972

Closed
wants to merge 3 commits into from

Conversation

mroeschke
Copy link
Member

@jbrockmendel
Copy link
Member

didn't like the parametrized-dtype idea?

@mroeschke
Copy link
Member Author

mroeschke commented May 9, 2022

didn't like the parametrized-dtype idea?

Hmm I was mainly mimicking your structure for the nullable numeric dtypes. What's the parameterized-dtype route?

EDIT: Oh do you mean like

pd.array([1, 2], dtype="Int64")
pd.array([1, 2], dtype="Int64[pandas]")  # same as above
pd.array([1, 2], dtype="Int64[pyarrow]")


...

class IntegerDtype(...)

   def construct_from_string(string):
       if string == "Int64":
           return cls(storage="pandas")
       if string == "Int64[pyarrow]":
           return cls(storage="pyarrow")

@jbrockmendel
Copy link
Member

The suggestion is to have a single PyarrowDtype class and instances of PyarrowDtype[int64] instead of a bunch of FooPyarrowDtype classes.

If it were up to move we'd move to having just NullableDtype[foo] instead of FooNullableDtype as well. (and just MaskedArray instead of BooleanArray/IntegerArray/FloatingArray, kinda xref #43002)

@mroeschke
Copy link
Member Author

Hmm okay I think I am starting to understand. So to confirm with relevant examples:

class ArrowDtype(StorageExtensionDtype):

    def __init__(self, storage, pa_dtype):
        # validate
        self.storage = storage
        self.pa_dtype = pa_dtype

    def _is_numeric(self):
        return pa.types.is_integer(self.pa_dtype) or pa.types.is_float(self.pa_dtype)


class ArrowExtensionArray:

    def __init__(self, values, pa_dtype):
        # validate
        self.storage = storage
        self._dtype = ArrowDtype("pyarrow", pa_dtype)

    @cache_readonly
    def dtype(self) -> NumericArrowDtype:
        return self._dtype

    @classmethod
    def _from_sequence_of_strings(cls, strings, *, dtype=None, copy: bool = False):
        if self.dtype._is_numeric:
            from pandas.core.tools.numeric import to_numeric

            scalars = to_numeric(strings, errors="raise")
        elif other_dtype:
            # something else
        return cls._from_sequence(scalars, dtype=dtype, copy=copy)

    def mean(self):
        if self.dtype._is_numeric:
            # call pyarrow compute mean
        else:
            # raise TypeError?

@jbrockmendel
Copy link
Member

thats pretty much what i had in mind, yah

@gsheni gsheni mentioned this pull request May 15, 2022
3 tasks
@gsheni
Copy link
Contributor

gsheni commented May 15, 2022

@mroeschke I tried to do a similar approach for the floating[arrow] dtype, based on your discussion:

@mroeschke
Copy link
Member Author

Closing in favor of #47034

@mroeschke mroeschke closed this May 16, 2022
@mroeschke mroeschke deleted the enh/integeraarrowarray branch May 16, 2022 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants