-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: Allow other na values in StringArray Constructor #45168
Conversation
yah thats a pattern we should avoid. IIRC there is a TODO in the code about that. |
pandas/_libs/lib.pyx
Outdated
for j in range(n): | ||
val = arr[i][j] | ||
if not isinstance(val, str): | ||
result[i][j] = <object>C_NA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i expect its cheaper to index with result[i, j]
? or maybe just do res_i = result[i]
on L694 on do res_i[k] =
here? similar for the arr[i][j]
above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went with the first option. Not too concerned about perf for 2D arrays given that this is a short-term solution.
gentle ping @jreback @jbrockmendel. |
# Check to see if need to convert Na values to pd.NA | ||
if self._ndarray.ndim > 2: | ||
# Ravel if ndims > 2 b/c no cythonized version available | ||
lib.convert_nans_to_NA(self._ndarray.ravel("K")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isnt the ravel unnecessary bc convert_nans_to_NA supports 2D?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for > 2D (e.g. 3d stuff). IIUC, pandas doesn't support > 2D but the fancy indexing tests that check for an error in the 3D case, expect that a 3D StringArray creates correctly for some reason only to raise later in the series constructor.
Previously failing tests here: https://github.com/pandas-dev/pandas/runs/4697417795?check_suite_focus=true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just always ravel and then reshape?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for all ndim
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just always ravel and then reshape?
risks making copies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kk, don't luv the ndim > 3 here at all (i get your reasoning) but as this will be refactored would be good to simplify.
@github-actions pre-commit. |
Hello @lithomas1! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2022-01-15 15:58:47 UTC |
@github-actions pre-commit. |
thanks @lithomas1 |
around 20-30% perf regression compared to master on added benchmarks. Still faster than 1.3.x so the regression is not user visible if we merge this for 1.4.0.
really just a quick and dirty solution to unblock @jbrockmendel.
The best solution would be to go through ensure_string_array(which is what _from_sequence does), but that would take a lot of effort to avoid perf regressions(it seems like ensure_string_array does not type its
arr
arg and also takes in EA inputs? we should probably restrict to numpy arrays as suggested by a code comment in the stub file).I'll open an issue as a follow-up.