-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: MultiIndex.from_frame #23141
ENH: MultiIndex.from_frame #23141
Conversation
Hello @ArtinSarraf! Thanks for updating the PR.
Comment last updated on November 19, 2018 at 04:25 Hours UTC |
huh? pls show a real usecase for this in the future pls open an issue for discussion first |
Sorry should have tagged this issue opened ~2 months ago. The issue contains an example as well. |
Here are some more concrete examples of how these can be used as well. Adding metadata.
Filtering
The examples here are extremely simplified, but reflect how my coworkers and I use pandas for working with multiindices. |
Codecov Report
@@ Coverage Diff @@
## master #23141 +/- ##
==========================================
+ Coverage 92.2% 92.2% +<.01%
==========================================
Files 162 162
Lines 51700 51709 +9
==========================================
+ Hits 47670 47679 +9
Misses 4030 4030
Continue to review full report at Codecov.
|
you can do the same thing if everything is in long form.
What you are proposing is wide from manipulation, which though a nice idea, and in this particular case makes sense. Is not generaly purpose, nor easy for people to do (and requires sorting guarantees). So why should this be added? |
One of the nice features of multiindexes is that they act as meta-dataframes for the associated numeric data, allowing a frame to be much more concise and memory efficient. Long form manipulations require multiple pivots (which can be time consuming) and can require significant extra memory. The relationship between multiindexes and a standard dataframe can also be demonstrated through this following example. One can substitute a multiindex (and I have seen this done before) with a separate metaframe like so:
I believe that this, along with the fact that to_frame already exists shows that from_frame is a reasonable and logical complementary method to be added as a way to instantiate multiindexes. Regarding the sorting guarantees of to_frame, I think that this is something that any user would reasonably expect, considering the similarities between a mi and a df. Additionally, not only does it keep manipulations, that might involve using it, consistent, but all other related casting methods (e.g. tolist) ensure the sorting of the resultant type matches that of the original multiindex. At the very least I think adding a parameter to to_frame for preserving the original sorting would make sense. It would be very easy to add to the existing implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so need some discussions
and of course would need much testing - having some examples would acceptance possibikity
I would be happy to contribute to any further discussions. Is there a more appropriate forum for that? Also I will add tests to the PR this week. Are there any types of examples in particular that would help move along the addition of this feature? Thanks. |
some usecases, and showing how its possible now, but how the 'new' method makes it easeir. |
Here is a demo to show the benefits of the from_frame method. Thanks for taking the time to look.
Setup df
Adding meta data
Quick PrototypingUse this df for the example below
Formatting MultiIndexUse this formatted df for remainder of demo
Complex Filtering/Slicing
Keep only counties with > 1 city
Admittedly the 2 remaining examples below can be done without from_frame as well, like so:
Get CT + NJ without Greenwich
Get all cities that don't start with "Bron"
Custom SortingThis is nice for formatting report output Given the following custom ordering
|
Is there a way to rerun the Azure pipeline tests? The tests only failed for “Windows py36_np14” environment and the failed tests don’t seem to be related to my commits at all. I’m guessing it was a blip in the testing framework. |
Please see additions to demo (Formatting and Custom Sorting): Also any ideas on the tests? I don't see anything in the logs that point to any changes I've made. |
@jreback. Any thoughts on the examples provided above? |
@pandas-dev/pandas-core if anyone has thoughts here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some issues with the docstrings. Also would be nice to add See Also
sections. If you can take a look at the documentation for the docstrings that would be great.
pandas/core/indexes/multi.py
Outdated
Parameters | ||
---------- | ||
df : pd.DataFrame | ||
DataFrame to be converted to MultiIndex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you finish with a period.
pandas/core/indexes/multi.py
Outdated
---------- | ||
df : pd.DataFrame | ||
DataFrame to be converted to MultiIndex | ||
squeeze : bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add the default.
pandas/core/indexes/multi.py
Outdated
@classmethod | ||
def from_frame(cls, df, squeeze=True): | ||
""" | ||
Make a MultiIndex from a dataframe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finish with period. Run ./scripts/validate_docstrings.py pandas.MultiIndex.from_frame
to make sure the docstring follows all the standards we validate.
Use DataFrame
instead of dataframe.
pandas/core/indexes/multi.py
Outdated
|
||
Returns | ||
------- | ||
index : MultiIndex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for giving a name, just leave the type. Add a description in the next line (indented).
pandas/core/indexes/multi.py
Outdated
-------- | ||
>>> df = pd.DataFrame([[0, u'green'], [0, u'purple'], [1, u'green'], | ||
[1, u'purple'], [2, u'green'], [2, u'purple']], | ||
columns=[u'number', u'color']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is missing the ...
for multine commands. Also show the content of df
after creating it.
pandas/core/indexes/multi.py
Outdated
""" | ||
Squeeze a single level multiindex to be a regular Index instane. If | ||
the MultiIndex is more than a single level, return a copy of the | ||
MultiIndex. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you run validate_docstrings.py
for this docstring, and also read the contributing docstring documentation too. There are several issues.
…h the Pandas docstring guide
@datapythonista. Changes have been made and pushed. I've also added from_frame to the "See Also" of the other 3 contructor methods. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some doc comments. can you update advanced.rst (see where we use .from_arrays and add an example using .from_frame). ping on green.
@pytest.mark.parametrize('names_in,names_out', [ | ||
(None, [('L1', 'x'), ('L2', 'y')]), | ||
(['x', 'y'], ['x', 'y']), | ||
('bad_input', ValueError("Names should be list-like for a MultiIndex")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't you split this to 2 tests, with 1 the working cases, and 1 the error cases, easier to read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
looks pretty good to me, just some doc comments remain. @toobaz @datapythonista can you have a look and comment or approve. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small doc-comments. ping on green.
doc/source/whatsnew/v0.24.0.rst
Outdated
@@ -378,6 +378,7 @@ Backwards incompatible API changes | |||
- Passing scalar values to :class:`DatetimeIndex` or :class:`TimedeltaIndex` will now raise ``TypeError`` instead of ``ValueError`` (:issue:`23539`) | |||
- ``max_rows`` and ``max_cols`` parameters removed from :class:`HTMLFormatter` since truncation is handled by :class:`DataFrameFormatter` (:issue:`23818`) | |||
- :meth:`read_csv` will now raise a ``ValueError`` if a column with missing values is declared as having dtype ``bool`` (:issue:`20591`) | |||
- The column order of the resultant ``DataFrame`` from ``MultiIndex.to_frame()`` is now guaranteed to match the ``MultiIndex.names`` order. (:issue:`22420`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use a :meth:
ref for MultiIndex.to_frame()
and :attr:
for .names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
pandas/core/indexes/multi.py
Outdated
3 1 jolly | ||
4 2 joy | ||
5 2 joy | ||
>>> pd.MultiIndex.from_frame(df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a blank line between cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
pandas/core/indexes/multi.py
Outdated
labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 2, 2]], | ||
names=['will_be', 'used']) | ||
|
||
>>> df = pd.DataFrame([['ahc', 'iam'], ['ahc', 'wim'], ['boh', 'amg'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a 1-line expln here (I think the first one is self-explanatorY)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
np.array([[1, 2], [3, 4], [5, 6]]), | ||
27 | ||
]) | ||
def test_from_frame_non_frame(non_frame): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to test_from_frame_error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
# GH 22420 | ||
mi = pd.MultiIndex.from_arrays([ | ||
pd.date_range('19910905', periods=6, tz='US/Eastern'), | ||
[1, 1, 1, 2, 2, 2], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a repeated test of the above, if so, then not necessary here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test was at the suggestion of @TomAugspurger
#23141 (comment)
@jreback - any reason why I would get this linting error in pandas-dev.pandas tests ?
Nothing else accompanying it (I fixed the other ones that were there before). The local pep8 checks don't show anything, and the branch is up to date with master. Same thing is happening on this PR (#23538) |
To see the error you need to click into the log. For many of them it's easy to make them red, and they are highlighted, but in this case it's an "error" with the ordering of the imports:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, the docstring has some format errors and can be improved, but good work.
pandas/core/indexes/multi.py
Outdated
|
||
Parameters | ||
---------- | ||
df : pd.DataFrame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df : pd.DataFrame | |
df : DataFrame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
pandas/core/indexes/multi.py
Outdated
---------- | ||
df : pd.DataFrame | ||
DataFrame to be converted to MultiIndex. | ||
sortorder : int or None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sortorder : int or None | |
sortorder : int, optional |
And please explain what it means to not be provided (None
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pandas/core/indexes/multi.py
Outdated
labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 2, 2]], | ||
names=['a', 'b']) | ||
|
||
# Use explicit names, instead of column names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better as a paragraph than as a code comment (remove the #
and leave a blank line after this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
pandas/core/indexes/multi.py
Outdated
labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 2, 2]], | ||
names=['X', 'Y']) | ||
|
||
See Also |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This goes before the Examples section. ./scripts/validate_docstrings.py pandas.MultiIndex.from_frame
should report it as an error. Make sure nothing is reported by the script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed this docstring and all the other constructor methods docstrings (since I had to modify them to update the See Alsos). I also fixed the pd.MultiIndex docstring to the best of my abilities (since I had to make some small modifications to that too). However, there were still some issues with the MultiIndex docstring:
2 Errors found:
Parameters {_set_identity, name, dtype} not documented
Unknown parameters {labels}
Any ideas on how I should address these? Looks like labels is there to serve as a deprecation reminder.
@jreback - tests are clean. |
@ArtinSarraf one more merge master and looks good. |
@datapythonista @toobaz if you have further comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks for all the fixes to docstrings @ArtinSarraf
@jreback - master has been merged and tests are green. |
Looks good to me! |
thanks @ArtinSarraf |
git diff upstream/master -u -- "*.py" | flake8 --diff
This pull request is to add the from_frame method of creating multiindexes. Along with this feature the helper method "squeeze" has been added for squeezing single level multiindexes to a standard index (analogous to df.squeeze).
Additionally the to_frame method was updated to guarantee that the order of the labels is preserved when converting a multiindex to a dataframe. Currently this cannot be guaranteed in Python 2.7 due to the use of a dictionary for creating the frame. With this change from_frame and to_frame are perfectly complementary.