Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-45175: [Python] Honor the strings_to_categorical keyword in to_pandas for string view type #45176

Merged

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Jan 5, 2025

Rationale for this change

Currently this keyword works for string or large string:

>>> table = pa.table({"col": pa.array(["a", "b", "a"], pa.string())})
>>> table.to_pandas(strings_to_categorical=True).dtypes
col    category
dtype: object
>>> table = pa.table({"col": pa.array(["a", "b", "a"], pa.large_string())})
>>> table.to_pandas(strings_to_categorical=True).dtypes
col    category
dtype: object

but not for string view:

>>> table = pa.table({"col": pa.array(["a", "b", "a"], pa.string_view())})
>>> table.to_pandas(strings_to_categorical=True).dtypes
col    object
dtype: object

For consistency we should make that keyword check for string view columns as well, I think

From https://github.com/apache/arrow/pull/44195/files#r1901831460

Are these changes tested?

Yes

Are there any user-facing changes?

Yes, when using the strings_to_categorical=True keyword and having a string_view type, this column will now be converted to a pandas Categorical

Copy link

github-actions bot commented Jan 5, 2025

⚠️ GitHub issue #45175 has been automatically assigned in GitHub to PR creator.

Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jorisvandenbossche
I've learnt today that Table.__getitem__ returns a ChunkedArray.
I am going to merge this. Where you expecting this to go on 19.0.0? cc @amoeba

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels Jan 7, 2025
@raulcd raulcd merged commit 2c5ae51 into apache:main Jan 7, 2025
14 checks passed
@raulcd raulcd removed the awaiting merge Awaiting merge label Jan 7, 2025
amoeba pushed a commit that referenced this pull request Jan 7, 2025
…das for string view type (#45176)

### Rationale for this change

Currently this keyword works for string or large string:

```python
>>> table = pa.table({"col": pa.array(["a", "b", "a"], pa.string())})
>>> table.to_pandas(strings_to_categorical=True).dtypes
col    category
dtype: object
>>> table = pa.table({"col": pa.array(["a", "b", "a"], pa.large_string())})
>>> table.to_pandas(strings_to_categorical=True).dtypes
col    category
dtype: object
```

but not for string view:

```python
>>> table = pa.table({"col": pa.array(["a", "b", "a"], pa.string_view())})
>>> table.to_pandas(strings_to_categorical=True).dtypes
col    object
dtype: object
```

For consistency we should make that keyword check for string view columns as well, I think

From https://github.com/apache/arrow/pull/44195/files#r1901831460

### Are these changes tested?

Yes

### Are there any user-facing changes?

Yes, when using the `strings_to_categorical=True` keyword and having a string_view type, this column will now be converted to a pandas Categorical

* GitHub Issue: #45175

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
@jorisvandenbossche jorisvandenbossche deleted the strings-to-catorical-string-view branch January 9, 2025 16:41
amoeba pushed a commit that referenced this pull request Jan 11, 2025
…das for string view type (#45176)

### Rationale for this change

Currently this keyword works for string or large string:

```python
>>> table = pa.table({"col": pa.array(["a", "b", "a"], pa.string())})
>>> table.to_pandas(strings_to_categorical=True).dtypes
col    category
dtype: object
>>> table = pa.table({"col": pa.array(["a", "b", "a"], pa.large_string())})
>>> table.to_pandas(strings_to_categorical=True).dtypes
col    category
dtype: object
```

but not for string view:

```python
>>> table = pa.table({"col": pa.array(["a", "b", "a"], pa.string_view())})
>>> table.to_pandas(strings_to_categorical=True).dtypes
col    object
dtype: object
```

For consistency we should make that keyword check for string view columns as well, I think

From https://github.com/apache/arrow/pull/44195/files#r1901831460

### Are these changes tested?

Yes

### Are there any user-facing changes?

Yes, when using the `strings_to_categorical=True` keyword and having a string_view type, this column will now be converted to a pandas Categorical

* GitHub Issue: #45175

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 2c5ae51.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants