-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG, ENH: Add support for parsing duplicate columns #12935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
you missed my point. don't use a dictionary. use a list for the data and the column names then you don't need all of these crazy gymnastics. |
I did try using lists. I could not get it to work. The current mechanism is very reliant on using the names and labels in the |
@gfyoung you have added a lot of code which is not idomatic, is quite complicated. Refacting should reduce the code not increase it dramatically. |
@jreback : A closer inspection of my changes will indicate that I did not make any major changes, except for moving two functions into a class and adding some references to Also, to reiterate what I said above, two In short, it would be more useful to receive specific comments as to what I can do to simplify the code and make it more idiomatic instead of blanket comments like ones previous. |
|
||
df = DataFrame(col_dict, columns=columns, index=index) | ||
if not mappings: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we NEVER use inplace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make the change, but just for reference, why? And if so, why is it an option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's an option because some people want to write non idiomatic code
you won't need rename anyhow see below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Until I can see / achieve otherwise, renaming like what you did below does not suffice for more complex renaming set ups, and keeping track of those changes will require "gymnastics" as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already showed the way to do this below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is notthing wrong with a loop over the columns! its is very very tiny compared to all other things. The point is the code with be a lot simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but at this point, there are no obvious simplifications like there were with tokenizer.c
for example. And at that point, I prefer the correctness, and right now, I don't see how your list of lists will hold up against the cases I proposed (which are in the test suite FYI).
Also, it's probably not going to be a single loop over the columns as I explained above because the specifications can force you to double back. Secondly, that's not the performance hit I am worried about. The performance hit will come from when you have to do all of the name changes, and you find yourself having to do list.index
and list.remove
time in time again to get the appropriate element in your names and data lists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would you have to do that? I am not going to accept a marginal change that obfuscates things more. yes there are a lot of options, but a refactor is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain to me how your list of lists would correctly and efficiently handle this case:
>>> data = """1,2,3,4,5
6,7,8,9,10"""
>>> read_csv(StringIO(data), names=['foo', 'bar', 'baz', 'foo', 'lol'],
parse_dates={'foo_date':[0, 3], 'lol_bar_lol': ['bar', 'lol],
'foo_lol':[3, 4]})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In [44]: l = ['foo', 'bar', 'baz', 'foo', 'lol']
In [45]: Index(l).get_loc('bar')
Out[45]: 1
In [46]: Index(l).get_loc('lol')
Out[46]: 4
In [47]: d = { i:k for i, k in enumerate(l)}
In [48]: d[0]
Out[48]: 'foo'
In [49]: d[3]
Out[49]: 'foo'
I would do it this way.
then if you need to assign different names
This might need some profiling, not for speed (as it conceptually does the same think as construction), but for interim memory considerations. |
How would this set up handle |
@gfyoung pls try to avoid moving around blocks of code for cleaning (unless its germane), rather do that separately. Its very hard to see what you are actually changing. This is completely orthogonal to date parsing. |
|
@gfyoung you are missing the point. it is orthogonal to construction of the final product. yes of course it needs to deal with names, but these are defined in a list-like structure that it can easily access. The point is there is somewhat clean separation now:
name assignment doesn't happen till the final step (though it can be modified in column combinations) don't try to link things that are not necessary |
As evidenced by the current code, even under the assumption of unique columns, keeping track of the naming changes is not "simple" to do and is quite intertwined with what happens during the processing so that the final construction step is correct. In your comment, abstracting away the naming works that goes on under "name assignment doesn't happen till the final step (though it can be modified in column combinations)" is misleading IMO. |
3ac50d7
to
9591741
Compare
msg = 'is not supported' | ||
|
||
for engine in ('c', 'python'): | ||
with tm.assertRaisesRegexp(ValueError, msg): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh its here. But does this depend on whether names
is passed and/or are dupes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. As of now, it is an unsupported feature because it fails with a duplicate header (or names
) and when you set mangle_dupe_cols=False
, so it is treated like any other unsupported feature in parser.py
- we just don't allow it period.
@@ -120,7 +120,8 @@ header : int or list of ints, default ``'infer'`` | |||
rather than the first line of the file. | |||
names : array-like, default ``None`` | |||
List of column names to use. If file contains no header row, then you should | |||
explicitly pass ``header=None``. | |||
explicitly pass ``header=None``. If this list contains any duplicates, then | |||
you should ensure that ``mangle_dupe_cols=True``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what will a user think about this? maybe say that duplicates names
are not allowed unless mangle_dups_cols=True
(which is the default)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. Done.
@gfyoung some minor comments |
b44886f
to
55481ff
Compare
@jreback : Made all of the requested changes, and Travis is giving the green light. Ready to merge if there is nothing else. |
Deduplicates the 'names' parameter by default if there are duplicate names. Also raises when 'mangle_ dupe_cols' is False to prevent data overwrite. Closes pandas-devgh-7160. Closes pandas-devgh-9424.
thanks @gfyoung nice PR! |
as usual, pls review built docs & issue a follow up if needed for any corrections. |
Can you please merge this pull request? This is exactly the issue that I ran into and when I tried to use the arg it raised the error "ValueError: Setting mangle_dupe_cols=False is not supported yet" |
@NickWoodhams : This PR got merged actually (@jreback closed it by committing my changes to If you would like to take a stab at it, you are more than welcome! |
@gfyoung Thank you for the thoughtful response. So I realized I actually had things mixed up a bit. I was unaware that mangle_dupe_cols actually meant that it did not overwrite the column names because I was having the issue of missing rows.. but I realized it was actually due to the to_dict() method-- specifically to_dict('index'). Given a dataframe like I would end up with just MW-1 and MW-2 in my indexed dict and lost half the data. It would look like: The way I found around it was by adding the { Hopefully understanding my use case helps! |
@NickWoodhams : We'll definitely take that into consideration, though I'm a little concerned about performance (especially if you have multiple duplicate columns), but thanks for letting me know! |
Introduces
mappings
andreverse_map
attributes to the parser inpandas.io.parsers
that allow it to differentiate between duplicate columns that may be present in a file.Closes #7160.
Closes #9424.