-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: coerce pd.wide_to_long suffixes to ints #17628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: coerce pd.wide_to_long suffixes to ints #17628
Conversation
pandas/tests/reshape/test_reshape.py
Outdated
@@ -991,3 +991,54 @@ def test_non_unique_idvars(self): | |||
}) | |||
with pytest.raises(ValueError): | |||
wide_to_long(df, ['A_A', 'B_B'], i='x', j='colname') | |||
|
|||
def test_cast_j_int(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add an example where the cast fails? e.g. columns are ['A_1', 'A_foo']...
pandas/core/reshape/reshape.py
Outdated
|
||
def melt_stub(df, stub, i, j, value_vars, sep): | ||
newdf = melt(df, id_vars=i, value_vars=value_vars, | ||
value_name=stub.rstrip(sep), var_name=j) | ||
newdf[j] = Categorical(newdf[j]) | ||
newdf[j] = newdf[j].str.replace(re.escape(stub + sep), "") | ||
newdf[j] = newdf[j].astype('int', errors='ignore') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cast to int64 (int is platform specific)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a whatsnew note (in api breaking)
pandas/core/reshape/reshape.py
Outdated
@@ -852,7 +852,7 @@ def lreshape(data, groups, dropna=True, label=None): | |||
|
|||
|
|||
def wide_to_long(df, stubnames, i, j, sep="", suffix='\d+'): | |||
r""" | |||
""" | |||
Wide panel to long format. Less flexible but more user-friendly than melt. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need a node that the casting will occur (with a versionadded tag)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a one line explanation, followed by the versionadded tag. Not sure if thats correct
pandas/tests/reshape/test_reshape.py
Outdated
@@ -991,3 +991,54 @@ def test_non_unique_idvars(self): | |||
}) | |||
with pytest.raises(ValueError): | |||
wide_to_long(df, ['A_A', 'B_B'], i='x', j='colname') | |||
|
|||
def test_cast_j_int(self): | |||
df = pd.DataFrame({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue as a comment
pandas/tests/reshape/test_reshape.py
Outdated
'Pirates of the Caribbean', | ||
'Spectre', | ||
'Avatar', | ||
'Pirates of the Caribbean', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use
result =
expected =
tm.assert_frame_equal(result, expected)
Codecov Report
@@ Coverage Diff @@
## master #17628 +/- ##
=========================================
Coverage ? 91.58%
=========================================
Files ? 153
Lines ? 51275
Branches ? 0
=========================================
Hits ? 46961
Misses ? 4314
Partials ? 0
Continue to review full report at Codecov.
|
pandas/core/reshape/reshape.py
Outdated
|
||
def melt_stub(df, stub, i, j, value_vars, sep): | ||
newdf = melt(df, id_vars=i, value_vars=value_vars, | ||
value_name=stub.rstrip(sep), var_name=j) | ||
newdf[j] = Categorical(newdf[j]) | ||
newdf[j] = newdf[j].str.replace(re.escape(stub + sep), "") | ||
|
||
# GH17627 Cast numerics suffixes to int/float | ||
newdf[j] = newdf[j].astype('int64', errors='ignore') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just use to_numeric(..., errors='ignore')
and you will get what you need here. This is a soft conversion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Changed.
Hello @tdpetrou! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on December 10, 2017 at 16:20 Hours UTC |
doc/source/whatsnew/v0.21.0.txt
Outdated
@@ -434,6 +434,7 @@ Other API Changes | |||
- :class:`Period` is now immutable, and will now raise an ``AttributeError`` when a user tries to assign a new value to the ``ordinal`` or ``freq`` attributes (:issue:`17116`). | |||
- :func:`to_datetime` when passed a tz-aware ``origin=`` kwarg will now raise a more informative ``ValueError`` rather than a ``TypeError`` (:issue:`16842`) | |||
- Renamed non-functional ``index`` to ``index_col`` in :func:`read_stata` to improve API consistency (:issue:`16342`) | |||
- :func:`wide_to_long` previously kept interger-only suffixes as ``object`` dtype. Now they are casted to ``int64`` if possible (:issue:`17627`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/interger/integer/. Now they are casted to a numeric dtype if possible.
pandas/core/reshape/reshape.py
Outdated
@@ -1347,4 +1353,4 @@ def make_axis_dummies(frame, axis='minor', transform=None): | |||
values = np.eye(len(items), dtype=float) | |||
values = values.take(labels, axis=0) | |||
|
|||
return DataFrame(values, columns=items, index=frame.index) | |||
return DataFrame(values, columns=items, index=frame.index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need a return at the end
pandas/core/reshape/reshape.py
Outdated
@@ -9,11 +9,12 @@ | |||
|
|||
from pandas.core.dtypes.common import ( | |||
_ensure_platform_int, | |||
is_list_like, is_bool_dtype, | |||
is_list_like, is_bool_dtype, is_object_dtype, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
pandas/tests/reshape/test_reshape.py
Outdated
i='A', j='colname', suffix='.+', sep='_') | ||
tm.assert_frame_equal(result, expected) | ||
|
||
def test_float_suffix(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try to use parametrize here to avoid adding lots of code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can also use fixtures if appropriate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, not sure where I can use it here. All DataFrames are different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
can you rebase. @TomAugspurger @jorisvandenbossche comments. |
edited to add: I did the following and then squashed all the commits into 1 and change a word in the commit message First time rebasing. I did...
Resolved conflicts then did
Gave some message about forgetting to do
Which did something but I don't know if this is what you wanted. |
it looks rebased |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm +0 on this. I think it's the preferable outcome, but I'm unsure about breaking API. Overall, it's probably OK to skip a keyword / deprecation cycle.
Perhaps make the API breakage slightly more obvious in the release notes by creating a mini-section with code examples detailing the old and new behavior.
maybe add |
@jreback Unless there is a different |
needs a rebase. this is orthogonal to #17677 ? |
Yes, this is mutually exclusive. I think pd.wide_to_long can be rewritten too. Its quite slow. |
ok, rebase and move whatsnew to 0.22 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tiny comments. ping on green.
doc/source/whatsnew/v0.21.0.txt
Outdated
@@ -1172,4 +1172,3 @@ Other | |||
^^^^^ | |||
- Bug where some inplace operators were not being wrapped and produced a copy when invoked (:issue:`12962`) | |||
- Bug in :func:`eval` where the ``inplace`` parameter was being incorrectly handled (:issue:`16732`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls revert this file
pandas/core/reshape/melt.py
Outdated
regex = "^{stub}{sep}{suffix}".format( | ||
stub=re.escape(stub), sep=re.escape(sep), suffix=suffix) | ||
return df.filter(regex=regex).columns.tolist() | ||
regex = '^{0}{1}{2}$'.format(re.escape(stub), re.escape(sep), suffix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slightly prefer the names in formatting
@jreback I attempted to revert whatsnew 0.21. Not sure if I did it correctly. I changed the names in the format as well. |
pandas/core/reshape/melt.py
Outdated
@@ -197,6 +198,10 @@ def wide_to_long(df, stubnames, i, j, sep="", suffix=r'\d+'): | |||
|
|||
.. versionadded:: 0.20.0 | |||
|
|||
When all suffixes are numeric, they are cast to int64/float64. | |||
|
|||
.. versionadded:: 0.21.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change to 0.22
pandas/core/reshape/melt.py
Outdated
@@ -335,22 +340,25 @@ def wide_to_long(df, stubnames, i, j, sep="", suffix=r'\d+'): | |||
----- | |||
All extra variables are left untouched. This simply uses | |||
`pandas.melt` under the hood, but is hard-coded to "do the right thing" | |||
in a typicaly case. | |||
in a typical case. | |||
""" | |||
def get_var_names(df, stub, sep, suffix): | |||
regex = "^{stub}{sep}{suffix}".format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use re.compile
here
pandas/tests/reshape/test_reshape.py
Outdated
@@ -764,12 +764,12 @@ def test_simple(self): | |||
exp_data = {"X": x.tolist() + x.tolist(), | |||
"A": ['a', 'b', 'c', 'd', 'e', 'f'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these need to move to test_melt.py
as well.
this should be after #18428, you can rebase on top (of that one after you have pushed) |
@jreback changes made. Should I add examples to whatsnew? |
@@ -199,6 +200,10 @@ def wide_to_long(df, stubnames, i, j, sep="", suffix=r'\d+'): | |||
|
|||
.. versionadded:: 0.20.0 | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add an example here (in the doc-string); I think all the other examples are resulting as strings anyhow (could also modify an example to avoid making this longer). you can check docs in reshape.rst to see if anything needs updating.
@jreback Since all the examples had integers as suffixes, I added one with strings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small changes & you have a lint issue. ping on green.
doc/source/whatsnew/v0.22.0.txt
Outdated
@@ -127,6 +127,7 @@ Other API Changes | |||
- :func:`pandas.DataFrame.merge` no longer casts a ``float`` column to ``object`` when merging on ``int`` and ``float`` columns (:issue:`16572`) | |||
- The default NA value for :class:`UInt64Index` has changed from 0 to ``NaN``, which impacts methods that mask with NA, such as ``UInt64Index.where()`` (:issue:`18398`) | |||
- Building pandas for development now requires ``cython >= 0.24`` (:issue:`18613`) | |||
- :func:`wide_to_long` previously kept suffixes as ``object`` dtype. Now they are cast to numeric if possible (:issue:`17627`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same numeric-like suffixes
pandas/core/reshape/melt.py
Outdated
""" | ||
def get_var_names(df, stub, sep, suffix): | ||
regex = "^{stub}{sep}{suffix}".format( | ||
regex = '^{stub}{sep}{suffix}$'.format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use r
here
can you rebase |
@jreback Got lost in rebase hell. Will self-flagellate 10 times if I did this wrong. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got lost in rebase hell.
Can you do it once more? (will try to merge as soon as possible to prevent new conflicts)
BTW, if you have multiple commits, I personally find merging master into the branch easier
pandas/core/reshape/melt.py
Outdated
@@ -199,6 +200,10 @@ def wide_to_long(df, stubnames, i, j, sep="", suffix=r'\d+'): | |||
|
|||
.. versionadded:: 0.20.0 | |||
|
|||
When all suffixes are numeric, they are cast to int64/float64. | |||
|
|||
.. versionadded:: 0.22.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use versionchanged instead of versionadded and format it this like:
.. versionchanged:: 0.22.0
When all suffixes are numeric, they are cast to int64/float64.
(see http://www.sphinx-doc.org/en/stable/markup/para.html#directive-versionadded)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I just pushed this change. It looks a little weird now though because there is a versionadded directly above it.
...
suffix : str, default '\\d+'
A regular expression capturing the wanted suffixes. '\\d+' captures
numeric suffixes. Suffixes with no numbers could be specified with the
negated character class '\\D+'. You can also further disambiguate
suffixes, for example, if your wide variables are of the form
Aone, Btwo,.., and you have an unrelated column Arating, you can
ignore the last one by specifying `suffix='(!?one|two)'`
.. versionadded:: 0.20.0
.. versionchanged:: 0.22.0
When all suffixes are numeric, they are cast to int64/float64.
edit: oops didn't rebase... will do so now.
Thanks! |
git diff upstream/master -u -- "*.py" | flake8 --diff
I also cleaned up the finding of the var_names and substituted in some list comprehensions.