-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TST (string dtype): clean-up xpasssing tests with future string dtype #59323
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -15,6 +15,7 @@ | |
from pandas import ( | ||
CategoricalIndex, | ||
DataFrame, | ||
Index, | ||
MultiIndex, | ||
Series, | ||
date_range, | ||
|
@@ -360,7 +361,7 @@ def test_info_memory_usage(): | |
df = DataFrame(data) | ||
df.columns = dtypes | ||
|
||
df_with_object_index = DataFrame({"a": [1]}, index=["foo"]) | ||
df_with_object_index = DataFrame({"a": [1]}, index=Index(["foo"], dtype=object)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similar comment on all of these - might be overlooking something simple but unsure why There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because what we are testing here is that if you have object dtype, the full memory usage is not known (and you get this "+"):
So Of course I could also update the expected result to be accurate instead of an estimate, but that would 1) complicate the test (since we still have to account for both current and future behaviour), and 2) we still need to test the case of object dtype explicitly anyway. We should probably add a test specifically for string dtype, though, where we can assert that if you have a proper string dtype the memory is now always the full number and not a lower estimate. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Cool - yea would be nice to add a test for exact memory representation with the StringDtype + pyarrow, though can be done separately |
||
df_with_object_index.info(buf=buf, memory_usage=True) | ||
res = buf.getvalue().splitlines() | ||
assert re.match(r"memory usage: [^+]+\+", res[-1]) | ||
|
@@ -398,25 +399,25 @@ def test_info_memory_usage(): | |
|
||
@pytest.mark.skipif(PYPY, reason="on PyPy deep=True doesn't change result") | ||
def test_info_memory_usage_deep_not_pypy(): | ||
df_with_object_index = DataFrame({"a": [1]}, index=["foo"]) | ||
df_with_object_index = DataFrame({"a": [1]}, index=Index(["foo"], dtype=object)) | ||
assert ( | ||
df_with_object_index.memory_usage(index=True, deep=True).sum() | ||
> df_with_object_index.memory_usage(index=True).sum() | ||
) | ||
|
||
df_object = DataFrame({"a": ["a"]}) | ||
df_object = DataFrame({"a": Series(["a"], dtype=object)}) | ||
assert df_object.memory_usage(deep=True).sum() > df_object.memory_usage().sum() | ||
|
||
|
||
@pytest.mark.xfail(not PYPY, reason="on PyPy deep=True does not change result") | ||
def test_info_memory_usage_deep_pypy(): | ||
df_with_object_index = DataFrame({"a": [1]}, index=["foo"]) | ||
df_with_object_index = DataFrame({"a": [1]}, index=Index(["foo"], dtype=object)) | ||
assert ( | ||
df_with_object_index.memory_usage(index=True, deep=True).sum() | ||
== df_with_object_index.memory_usage(index=True).sum() | ||
) | ||
|
||
df_object = DataFrame({"a": ["a"]}) | ||
df_object = DataFrame({"a": Series(["a"], dtype=object)}) | ||
assert df_object.memory_usage(deep=True).sum() == df_object.memory_usage().sum() | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the solution here to add
dtype=object
? Shouldn't this just work naturally with the inferred string type?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't work with a string dtype because this test is about "bad unicode". And an actual string dtype cannot represent invalid unicode (at least when using pyarrow under the hood. I assume that our object-dtype based one will be able to hold it).
To keep the spirit of the test (ensure our unique implementation can work with bad unicode in object dtype), I made it explicitly used object dtype.
See also the "Invalid unicode input" section in #59328 (that issue I started yesterday to start record breaking changes / things that are no longer supported with the string dtype)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense - also makes me think how we can leverage a BinaryDtype in the future, though that is a different topic for a different day