Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Allow other na values in StringArray Constructor #45168

Merged
merged 18 commits into from
Jan 17, 2022

Conversation

lithomas1
Copy link
Member

@lithomas1 lithomas1 commented Jan 3, 2022

  • closes #xxxx
  • tests added / passed
  • Ensure all linting tests pass, see here for how to run them
  • whatsnew entry

around 20-30% perf regression compared to master on added benchmarks. Still faster than 1.3.x so the regression is not user visible if we merge this for 1.4.0.
really just a quick and dirty solution to unblock @jbrockmendel.
The best solution would be to go through ensure_string_array(which is what _from_sequence does), but that would take a lot of effort to avoid perf regressions(it seems like ensure_string_array does not type its arr arg and also takes in EA inputs? we should probably restrict to numpy arrays as suggested by a code comment in the stub file).

I'll open an issue as a follow-up.

@jbrockmendel
Copy link
Member

it seems like ensure_string_array does not type its arr arg and also takes in EA inputs

yah thats a pattern we should avoid. IIRC there is a TODO in the code about that.

for j in range(n):
val = arr[i][j]
if not isinstance(val, str):
result[i][j] = <object>C_NA
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i expect its cheaper to index with result[i, j]? or maybe just do res_i = result[i] on L694 on do res_i[k] = here? similar for the arr[i][j] above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went with the first option. Not too concerned about perf for 2D arrays given that this is a short-term solution.

pandas/_libs/lib.pyx Show resolved Hide resolved
@jreback jreback added ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Strings String extension data type and string data labels Jan 3, 2022
@lithomas1 lithomas1 requested review from jreback and removed request for jreback January 7, 2022 03:28
@lithomas1
Copy link
Member Author

gentle ping @jreback @jbrockmendel.

pandas/_libs/lib.pyx Outdated Show resolved Hide resolved
# Check to see if need to convert Na values to pd.NA
if self._ndarray.ndim > 2:
# Ravel if ndims > 2 b/c no cythonized version available
lib.convert_nans_to_NA(self._ndarray.ravel("K"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isnt the ravel unnecessary bc convert_nans_to_NA supports 2D?

Copy link
Member Author

@lithomas1 lithomas1 Jan 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for > 2D (e.g. 3d stuff). IIUC, pandas doesn't support > 2D but the fancy indexing tests that check for an error in the 3D case, expect that a 3D StringArray creates correctly for some reason only to raise later in the series constructor.

Previously failing tests here: https://github.com/pandas-dev/pandas/runs/4697417795?check_suite_focus=true

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just always ravel and then reshape?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for all ndim

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just always ravel and then reshape?

risks making copies

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kk, don't luv the ndim > 3 here at all (i get your reasoning) but as this will be refactored would be good to simplify.

doc/source/whatsnew/v1.4.0.rst Outdated Show resolved Hide resolved
@lithomas1
Copy link
Member Author

@github-actions pre-commit.

@pandas-dev pandas-dev deleted a comment from pep8speaks Jan 12, 2022
@lithomas1 lithomas1 closed this Jan 12, 2022
@lithomas1 lithomas1 reopened this Jan 12, 2022
@pep8speaks
Copy link

pep8speaks commented Jan 15, 2022

Hello @lithomas1! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-01-15 15:58:47 UTC

@lithomas1
Copy link
Member Author

@github-actions pre-commit.

@lithomas1 lithomas1 closed this Jan 15, 2022
@lithomas1 lithomas1 reopened this Jan 15, 2022
@jreback jreback added this to the 1.5 milestone Jan 17, 2022
@jreback jreback merged commit e6a20bd into pandas-dev:main Jan 17, 2022
@jreback
Copy link
Contributor

jreback commented Jan 17, 2022

thanks @lithomas1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants