-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TST (string dtype): clean-up xpasssing tests with future string dtype #59323
TST (string dtype): clean-up xpasssing tests with future string dtype #59323
Conversation
b3b65a3
to
dd174ef
Compare
def test_unique_bad_unicode(index_or_series): | ||
# regression test for #34550 | ||
uval = "\ud83d" # smiley emoji | ||
|
||
obj = index_or_series([uval] * 2) | ||
obj = index_or_series([uval] * 2, dtype=object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the solution here to add dtype=object
? Shouldn't this just work naturally with the inferred string type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't work with a string dtype because this test is about "bad unicode". And an actual string dtype cannot represent invalid unicode (at least when using pyarrow under the hood. I assume that our object-dtype based one will be able to hold it).
To keep the spirit of the test (ensure our unique implementation can work with bad unicode in object dtype), I made it explicitly used object dtype.
See also the "Invalid unicode input" section in #59328 (that issue I started yesterday to start record breaking changes / things that are no longer supported with the string dtype)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense - also makes me think how we can leverage a BinaryDtype in the future, though that is a different topic for a different day
@@ -360,7 +361,7 @@ def test_info_memory_usage(): | |||
df = DataFrame(data) | |||
df.columns = dtypes | |||
|
|||
df_with_object_index = DataFrame({"a": [1]}, index=["foo"]) | |||
df_with_object_index = DataFrame({"a": [1]}, index=Index(["foo"], dtype=object)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment on all of these - might be overlooking something simple but unsure why dtype=object
is the solution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because what we are testing here is that if you have object dtype, the full memory usage is not known (and you get this "+"):
In [1]: df = DataFrame({"a": ["a", "b"]})
In [2]: df.info()
<class 'pandas.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 a 2 non-null object
dtypes: object(1)
memory usage: 148.0+ bytes
In [3]: pd.options.future.infer_string = True
In [4]: df = DataFrame({"a": ["a", "b"]})
In [5]: df.info()
<class 'pandas.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 a 2 non-null string
dtypes: string(1)
memory usage: 150.0 bytes
So 148.0+ bytes
vs 150.0 bytes
.
Of course I could also update the expected result to be accurate instead of an estimate, but that would 1) complicate the test (since we still have to account for both current and future behaviour), and 2) we still need to test the case of object dtype explicitly anyway.
We should probably add a test specifically for string dtype, though, where we can assert that if you have a proper string dtype the memory is now always the full number and not a lower estimate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool - yea would be nice to add a test for exact memory representation with the StringDtype + pyarrow, though can be done separately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm
The failures with python-dev / numpy-dev seem unrelated. |
Follow-up PR on #59320, doing some superficial clean-up of the tests (largely removing some xfails that were actually passing in the meantime)
xref #54792