Skip to content

BUG: coerce pd.wide_to_long suffixes to ints #17628

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Dec 10, 2017
Merged

BUG: coerce pd.wide_to_long suffixes to ints #17628

merged 4 commits into from
Dec 10, 2017

Conversation

tdpetrou
Copy link
Contributor

@tdpetrou tdpetrou commented Sep 22, 2017

I also cleaned up the finding of the var_names and substituted in some list comprehensions.

@@ -991,3 +991,54 @@ def test_non_unique_idvars(self):
})
with pytest.raises(ValueError):
wide_to_long(df, ['A_A', 'B_B'], i='x', j='colname')

def test_cast_j_int(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add an example where the cast fails? e.g. columns are ['A_1', 'A_foo']...


def melt_stub(df, stub, i, j, value_vars, sep):
newdf = melt(df, id_vars=i, value_vars=value_vars,
value_name=stub.rstrip(sep), var_name=j)
newdf[j] = Categorical(newdf[j])
newdf[j] = newdf[j].str.replace(re.escape(stub + sep), "")
newdf[j] = newdf[j].astype('int', errors='ignore')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cast to int64 (int is platform specific)

@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Sep 22, 2017
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a whatsnew note (in api breaking)

@@ -852,7 +852,7 @@ def lreshape(data, groups, dropna=True, label=None):


def wide_to_long(df, stubnames, i, j, sep="", suffix='\d+'):
r"""
"""
Wide panel to long format. Less flexible but more user-friendly than melt.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a node that the casting will occur (with a versionadded tag)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a one line explanation, followed by the versionadded tag. Not sure if thats correct

@@ -991,3 +991,54 @@ def test_non_unique_idvars(self):
})
with pytest.raises(ValueError):
wide_to_long(df, ['A_A', 'B_B'], i='x', j='colname')

def test_cast_j_int(self):
df = pd.DataFrame({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue as a comment

'Pirates of the Caribbean',
'Spectre',
'Avatar',
'Pirates of the Caribbean',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use

result =
expected = 
tm.assert_frame_equal(result, expected)

@codecov
Copy link

codecov bot commented Sep 22, 2017

Codecov Report

❗ No coverage uploaded for pull request base (master@1355df6). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #17628   +/-   ##
=========================================
  Coverage          ?   91.58%           
=========================================
  Files             ?      153           
  Lines             ?    51275           
  Branches          ?        0           
=========================================
  Hits              ?    46961           
  Misses            ?     4314           
  Partials          ?        0
Flag Coverage Δ
#multiple 89.45% <100%> (?)
#single 40.68% <14.28%> (?)
Impacted Files Coverage Δ
pandas/core/reshape/melt.py 97.24% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1355df6...b9d3e62. Read the comment docs.


def melt_stub(df, stub, i, j, value_vars, sep):
newdf = melt(df, id_vars=i, value_vars=value_vars,
value_name=stub.rstrip(sep), var_name=j)
newdf[j] = Categorical(newdf[j])
newdf[j] = newdf[j].str.replace(re.escape(stub + sep), "")

# GH17627 Cast numerics suffixes to int/float
newdf[j] = newdf[j].astype('int64', errors='ignore')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just use to_numeric(..., errors='ignore') and you will get what you need here. This is a soft conversion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Changed.

@pep8speaks
Copy link

pep8speaks commented Sep 22, 2017

Hello @tdpetrou! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on December 10, 2017 at 16:20 Hours UTC

@@ -434,6 +434,7 @@ Other API Changes
- :class:`Period` is now immutable, and will now raise an ``AttributeError`` when a user tries to assign a new value to the ``ordinal`` or ``freq`` attributes (:issue:`17116`).
- :func:`to_datetime` when passed a tz-aware ``origin=`` kwarg will now raise a more informative ``ValueError`` rather than a ``TypeError`` (:issue:`16842`)
- Renamed non-functional ``index`` to ``index_col`` in :func:`read_stata` to improve API consistency (:issue:`16342`)
- :func:`wide_to_long` previously kept interger-only suffixes as ``object`` dtype. Now they are casted to ``int64`` if possible (:issue:`17627`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/interger/integer/. Now they are casted to a numeric dtype if possible.

@@ -1347,4 +1353,4 @@ def make_axis_dummies(frame, axis='minor', transform=None):
values = np.eye(len(items), dtype=float)
values = values.take(labels, axis=0)

return DataFrame(values, columns=items, index=frame.index)
return DataFrame(values, columns=items, index=frame.index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a return at the end

@@ -9,11 +9,12 @@

from pandas.core.dtypes.common import (
_ensure_platform_int,
is_list_like, is_bool_dtype,
is_list_like, is_bool_dtype, is_object_dtype,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

i='A', j='colname', suffix='.+', sep='_')
tm.assert_frame_equal(result, expected)

def test_float_suffix(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try to use parametrize here to avoid adding lots of code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can also use fixtures if appropriate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, not sure where I can use it here. All DataFrames are different.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@jreback jreback added this to the 0.21.0 milestone Sep 23, 2017
@jreback
Copy link
Contributor

jreback commented Sep 23, 2017

can you rebase.

@TomAugspurger @jorisvandenbossche comments.

@tdpetrou
Copy link
Contributor Author

tdpetrou commented Sep 23, 2017

edited to add: I did the following and then squashed all the commits into 1 and change a word in the commit message

First time rebasing. I did...

git fetch upstream
git rebase upstream/master

Resolved conflicts then did

git add <files>
git rebase --continue

Gave some message about forgetting to do git add which i just did. Then i did

git rebase --skip
git push origin wide-to-long-int -f

Which did something but I don't know if this is what you wanted.

@jreback
Copy link
Contributor

jreback commented Sep 24, 2017

it looks rebased

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm +0 on this. I think it's the preferable outcome, but I'm unsure about breaking API. Overall, it's probably OK to skip a keyword / deprecation cycle.

Perhaps make the API breakage slightly more obvious in the release notes by creating a mini-section with code examples detailing the old and new behavior.

@jreback
Copy link
Contributor

jreback commented Sep 28, 2017

maybe add downcast=True (as the default), consistent with other functions that we have.

@jreback jreback removed this from the 0.21.0 milestone Sep 28, 2017
@tdpetrou
Copy link
Contributor Author

@jreback Unless there is a different to_numeric, there is no option for downcast=True. And why would you want to downcast it to something other than the default 64 bit types?

@tdpetrou tdpetrou mentioned this pull request Oct 30, 2017
2 tasks
@jreback
Copy link
Contributor

jreback commented Nov 12, 2017

needs a rebase. this is orthogonal to #17677 ?

@tdpetrou
Copy link
Contributor Author

Yes, this is mutually exclusive. I think pd.wide_to_long can be rewritten too. Its quite slow.

@jreback
Copy link
Contributor

jreback commented Nov 17, 2017

ok, rebase and move whatsnew to 0.22

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny comments. ping on green.

@@ -1172,4 +1172,3 @@ Other
^^^^^
- Bug where some inplace operators were not being wrapped and produced a copy when invoked (:issue:`12962`)
- Bug in :func:`eval` where the ``inplace`` parameter was being incorrectly handled (:issue:`16732`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls revert this file

regex = "^{stub}{sep}{suffix}".format(
stub=re.escape(stub), sep=re.escape(sep), suffix=suffix)
return df.filter(regex=regex).columns.tolist()
regex = '^{0}{1}{2}$'.format(re.escape(stub), re.escape(sep), suffix)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slightly prefer the names in formatting

@tdpetrou
Copy link
Contributor Author

@jreback I attempted to revert whatsnew 0.21. Not sure if I did it correctly. I changed the names in the format as well.

@@ -197,6 +198,10 @@ def wide_to_long(df, stubnames, i, j, sep="", suffix=r'\d+'):

.. versionadded:: 0.20.0

When all suffixes are numeric, they are cast to int64/float64.

.. versionadded:: 0.21.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to 0.22

@@ -335,22 +340,25 @@ def wide_to_long(df, stubnames, i, j, sep="", suffix=r'\d+'):
-----
All extra variables are left untouched. This simply uses
`pandas.melt` under the hood, but is hard-coded to "do the right thing"
in a typicaly case.
in a typical case.
"""
def get_var_names(df, stub, sep, suffix):
regex = "^{stub}{sep}{suffix}".format(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use re.compile here

@@ -764,12 +764,12 @@ def test_simple(self):
exp_data = {"X": x.tolist() + x.tolist(),
"A": ['a', 'b', 'c', 'd', 'e', 'f'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these need to move to test_melt.py as well.

@jreback
Copy link
Contributor

jreback commented Nov 23, 2017

this should be after #18428, you can rebase on top (of that one after you have pushed)

@tdpetrou
Copy link
Contributor Author

@jreback changes made. Should I add examples to whatsnew?

@@ -199,6 +200,10 @@ def wide_to_long(df, stubnames, i, j, sep="", suffix=r'\d+'):

.. versionadded:: 0.20.0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add an example here (in the doc-string); I think all the other examples are resulting as strings anyhow (could also modify an example to avoid making this longer). you can check docs in reshape.rst to see if anything needs updating.

@tdpetrou
Copy link
Contributor Author

tdpetrou commented Dec 6, 2017

@jreback Since all the examples had integers as suffixes, I added one with strings.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small changes & you have a lint issue. ping on green.

@@ -127,6 +127,7 @@ Other API Changes
- :func:`pandas.DataFrame.merge` no longer casts a ``float`` column to ``object`` when merging on ``int`` and ``float`` columns (:issue:`16572`)
- The default NA value for :class:`UInt64Index` has changed from 0 to ``NaN``, which impacts methods that mask with NA, such as ``UInt64Index.where()`` (:issue:`18398`)
- Building pandas for development now requires ``cython >= 0.24`` (:issue:`18613`)
- :func:`wide_to_long` previously kept suffixes as ``object`` dtype. Now they are cast to numeric if possible (:issue:`17627`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same numeric-like suffixes

"""
def get_var_names(df, stub, sep, suffix):
regex = "^{stub}{sep}{suffix}".format(
regex = '^{stub}{sep}{suffix}$'.format(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use r here

@tdpetrou
Copy link
Contributor Author

tdpetrou commented Dec 6, 2017

@jreback

@jreback
Copy link
Contributor

jreback commented Dec 8, 2017

can you rebase

@tdpetrou
Copy link
Contributor Author

tdpetrou commented Dec 8, 2017

@jreback Got lost in rebase hell. Will self-flagellate 10 times if I did this wrong.

@jorisvandenbossche jorisvandenbossche added this to the 0.22.0 milestone Dec 10, 2017
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got lost in rebase hell.

Can you do it once more? (will try to merge as soon as possible to prevent new conflicts)

BTW, if you have multiple commits, I personally find merging master into the branch easier

@@ -199,6 +200,10 @@ def wide_to_long(df, stubnames, i, j, sep="", suffix=r'\d+'):

.. versionadded:: 0.20.0

When all suffixes are numeric, they are cast to int64/float64.

.. versionadded:: 0.22.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use versionchanged instead of versionadded and format it this like:

.. versionchanged:: 0.22.0
    When all suffixes are numeric, they are cast to int64/float64.

(see http://www.sphinx-doc.org/en/stable/markup/para.html#directive-versionadded)

Copy link
Contributor Author

@tdpetrou tdpetrou Dec 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I just pushed this change. It looks a little weird now though because there is a versionadded directly above it.

...
suffix : str, default '\\d+'
        A regular expression capturing the wanted suffixes. '\\d+' captures
        numeric suffixes. Suffixes with no numbers could be specified with the
        negated character class '\\D+'. You can also further disambiguate
        suffixes, for example, if your wide variables are of the form
        Aone, Btwo,.., and you have an unrelated column Arating, you can
        ignore the last one by specifying `suffix='(!?one|two)'`

        .. versionadded:: 0.20.0

        .. versionchanged:: 0.22.0
            When all suffixes are numeric, they are cast to int64/float64.

edit: oops didn't rebase... will do so now.

@tdpetrou
Copy link
Contributor Author

@jorisvandenbossche

@jorisvandenbossche jorisvandenbossche merged commit a259b64 into pandas-dev:master Dec 10, 2017
@jorisvandenbossche
Copy link
Member

Thanks!

@tdpetrou tdpetrou deleted the wide-to-long-int branch December 10, 2017 21:51
jreback added a commit to jreback/pandas that referenced this pull request Dec 11, 2017
jreback added a commit that referenced this pull request Dec 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

wide_to_long does not convert integer suffixes to int
5 participants