BUG, ENH: Add support for parsing duplicate columns #12935

gfyoung · 2016-04-20T16:14:12Z

Introduces mappings and reverse_map attributes to the parser in pandas.io.parsers that allow it to differentiate between duplicate columns that may be present in a file.

Closes #7160.
Closes #9424.

jreback · 2016-04-20T16:25:43Z

you missed my point. don't use a dictionary. use a list for the data and the column names then you don't need all of these crazy gymnastics.

gfyoung · 2016-04-20T16:33:05Z

I did try using lists. I could not get it to work. The current mechanism is very reliant on using the names and labels in the dict object to generate the data and create the DataFrame at the end, which two list objects is not very efficient or good at doing. Gymnastics, maybe, but it does manage to resolve the duplicate column names in a parsing system that is entirely unique-column oriented.

jreback · 2016-04-20T16:38:16Z

@gfyoung you have added a lot of code which is not idomatic, is quite complicated. Refacting should reduce the code not increase it dramatically.

gfyoung · 2016-04-20T16:51:17Z

@jreback : A closer inspection of my changes will indicate that I did not make any major changes, except for moving two functions into a class and adding some references to mapping and reverse_map to enable unique references to column names, even if there are duplicates.

Also, to reiterate what I said above, two list objects did not work for me. Not only could I not get it to pass all of the parsing tests, but it is slow as you now have to use list indexing instead of dict look-ups to access and manipulate labels. You also have to spend time to re-arrange the lists at the end to properly pass it into the DataFrame, since AFAICT it only accepts rows, not columns when passing in list of list. So yes, there may be some "gymnastics" to get it to work. However, my changes maintain the overall structure of the current parsing system that, separate from the duplicate columns issue I think works fairly well and efficiently.

In short, it would be more useful to receive specific comments as to what I can do to simplify the code and make it more idiomatic instead of blanket comments like ones previous.

jreback · 2016-04-20T16:52:41Z

pandas/io/parsers.py


-        df = DataFrame(col_dict, columns=columns, index=index)
+        if not mappings:


we NEVER use inplace

I'll make the change, but just for reference, why? And if so, why is it an option?

it's an option because some people want to write non idiomatic code

you won't need rename anyhow see below

Until I can see / achieve otherwise, renaming like what you did below does not suffice for more complex renaming set ups, and keeping track of those changes will require "gymnastics" as well.

I already showed the way to do this below

there is notthing wrong with a loop over the columns! its is very very tiny compared to all other things. The point is the code with be a lot simpler.

I agree, but at this point, there are no obvious simplifications like there were with tokenizer.c for example. And at that point, I prefer the correctness, and right now, I don't see how your list of lists will hold up against the cases I proposed (which are in the test suite FYI).

Also, it's probably not going to be a single loop over the columns as I explained above because the specifications can force you to double back. Secondly, that's not the performance hit I am worried about. The performance hit will come from when you have to do all of the name changes, and you find yourself having to do list.index and list.remove time in time again to get the appropriate element in your names and data lists.

why would you have to do that? I am not going to accept a marginal change that obfuscates things more. yes there are a lot of options, but a refactor is needed.

Explain to me how your list of lists would correctly and efficiently handle this case:

>>> data = """1,2,3,4,5 6,7,8,9,10""" >>> read_csv(StringIO(data), names=['foo', 'bar', 'baz', 'foo', 'lol'], parse_dates={'foo_date':[0, 3], 'lol_bar_lol': ['bar', 'lol], 'foo_lol':[3, 4]})

In [44]: l = ['foo', 'bar', 'baz', 'foo', 'lol'] In [45]: Index(l).get_loc('bar') Out[45]: 1 In [46]: Index(l).get_loc('lol') Out[46]: 4 In [47]: d = { i:k for i, k in enumerate(l)} In [48]: d[0] Out[48]: 'foo' In [49]: d[3] Out[49]: 'foo'

jreback · 2016-04-20T16:59:42Z

I would do it this way.

In [15]: arr = [np.array([1,2,3]),np.array([1.0,1.1,2.0]),np.array(['foo','bar','az']),[4,5,6]]

In [16]: pd.concat(map(Series,arr),axis=1,copy=False)
Out[16]: 
   0    1    2  3
0  1  1.0  foo  4
1  2  1.1  bar  5
2  3  2.0   az  6

In [17]: pd.concat(map(Series,arr),axis=1).dtypes
Out[17]: 
0      int64
1    float64
2     object
3      int64
dtype: object

then if you need to assign different names

In [23]: result.columns = list('AAAA')

In [24]: result
Out[24]: 
   A    A    A  A
0  1  1.0  foo  4
1  2  1.1  bar  5
2  3  2.0   az  6

This might need some profiling, not for speed (as it conceptually does the same think as construction), but for interim memory considerations.

gfyoung · 2016-04-20T17:03:49Z

How would this set up handle parse_dates for example, where you could be dealing with either str names or numerical indices?

jreback · 2016-04-20T17:37:01Z

@gfyoung pls try to avoid moving around blocks of code for cleaning (unless its germane), rather do that separately. Its very hard to see what you are actually changing.

This is completely orthogonal to date parsing.

gfyoung · 2016-04-20T17:40:04Z

Noted about the block code moving (to do it in a separate commit). I moved it because I wanted to use mapping and reverse_map in the code, and I saw no reason why I should be passing in additional parameters (they are private functions after all). It also nicely simplifies the signature.
I disagree about the orthogonality. What happens if parse_dates needs to use a column whose name happens to be a duplicate?

jreback · 2016-04-20T17:47:11Z

@gfyoung you are missing the point.

it is orthogonal to construction of the final product. yes of course it needs to deal with names, but these are defined in a list-like structure that it can easily access. The point is there is somewhat clean separation now:

tokenize (maybe include usecols)
dtype inference
possible column combination (e.g. from parse_dates)
construction

name assignment doesn't happen till the final step (though it can be modified in column combinations)

don't try to link things that are not necessary

gfyoung · 2016-04-20T17:50:25Z

As evidenced by the current code, even under the assumption of unique columns, keeping track of the naming changes is not "simple" to do and is quite intertwined with what happens during the processing so that the final construction step is correct. In your comment, abstracting away the naming works that goes on under "name assignment doesn't happen till the final step (though it can be modified in column combinations)" is misleading IMO.

jreback · 2016-05-18T22:26:24Z

pandas/io/tests/parser/test_unsupported.py

+        msg = 'is not supported'
+
+        for engine in ('c', 'python'):
+            with tm.assertRaisesRegexp(ValueError, msg):


oh its here. But does this depend on whether names is passed and/or are dupes?

No. As of now, it is an unsupported feature because it fails with a duplicate header (or names) and when you set mangle_dupe_cols=False, so it is treated like any other unsupported feature in parser.py - we just don't allow it period.

jreback · 2016-05-19T13:33:34Z

doc/source/io.rst

@@ -120,7 +120,8 @@ header : int or list of ints, default ``'infer'``
  rather than the first line of the file.
 names : array-like, default ``None``
  List of column names to use. If file contains no header row, then you should
-  explicitly pass ``header=None``.
+  explicitly pass ``header=None``. If this list contains any duplicates, then
+  you should ensure that ``mangle_dupe_cols=True``.


what will a user think about this? maybe say that duplicates names are not allowed unless mangle_dups_cols=True (which is the default)?

Fair enough. Done.

jreback · 2016-05-19T13:36:56Z

@gfyoung some minor comments

gfyoung · 2016-05-22T23:28:14Z

@jreback : Made all of the requested changes, and Travis is giving the green light. Ready to merge if there is nothing else.

Deduplicates the 'names' parameter by default if there are duplicate names. Also raises when 'mangle_ dupe_cols' is False to prevent data overwrite. Closes pandas-devgh-7160. Closes pandas-devgh-9424.

jreback · 2016-05-23T21:43:47Z

thanks @gfyoung nice PR!

jreback · 2016-05-23T21:44:31Z

as usual, pls review built docs & issue a follow up if needed for any corrections.

NickWoodhams · 2017-06-21T14:32:32Z

Can you please merge this pull request? This is exactly the issue that I ran into and when I tried to use the arg it raised the error "ValueError: Setting mangle_dupe_cols=False is not supported yet"

gfyoung · 2017-06-21T14:48:08Z

@NickWoodhams : This PR got merged actually (@jreback closed it by committing my changes to master), but we deliberately disabled support mangle_dupe_cols=False because it was vulnerable to data overwriting. I tried a couple of ways of implementing support for it, but unfortunately, we were not able to reach consensus on acceptable implementation.

If you would like to take a stab at it, you are more than welcome!

NickWoodhams · 2017-06-22T00:27:27Z

@gfyoung Thank you for the thoughtful response. So I realized I actually had things mixed up a bit.

I was unaware that mangle_dupe_cols actually meant that it did not overwrite the column names because I was having the issue of missing rows.. but I realized it was actually due to the to_dict() method-- specifically to_dict('index').

Given a dataframe like
ID Value
MW-1 49.3
MW-1 39.9
MW-2 12.2
MW-2 11.5

I would end up with just MW-1 and MW-2 in my indexed dict and lost half the data. It would look like:
{
MW-1: { value: 39.9 },
MW-2: { value: 11.5 }
}

The way I found around it was by adding the reset_index() method to my data frame. df.reset_index().to_dict('index') prevented the overwrite since it was no longer indexed by the column and I was all set!

{
0: { MW-1: { value: 49.3 } },
1: { MW-1: { value: 39.9 } },
2: { MW-2: { value: 12.2 } },
3: { MW-2: { value: 11.5 } },
}

Hopefully understanding my use case helps!

gfyoung · 2017-06-22T03:19:01Z

@NickWoodhams : We'll definitely take that into consideration, though I'm a little concerned about performance (especially if you have multiple duplicate columns), but thanks for letting me know!

gfyoung changed the title ~~BUG, ENG: Add support for parsing duplicate columns~~ BUG, ENH: Add support for parsing duplicate columns Apr 20, 2016

jreback added the IO CSV read_csv, to_csv label Apr 20, 2016

gfyoung force-pushed the dupe-col-names branch from bd4644b to 0a0246d Compare April 20, 2016 16:40

jreback reviewed Apr 20, 2016
View reviewed changes

gfyoung force-pushed the dupe-col-names branch from 0a0246d to be71821 Compare April 20, 2016 17:29

gfyoung force-pushed the dupe-col-names branch 14 times, most recently from 3ac50d7 to 9591741 Compare April 27, 2016 23:41

jreback reviewed May 18, 2016
View reviewed changes

gfyoung force-pushed the dupe-col-names branch from b5eb8d0 to b0226c8 Compare May 19, 2016 00:20

jreback reviewed May 19, 2016
View reviewed changes

gfyoung force-pushed the dupe-col-names branch 5 times, most recently from b44886f to 55481ff Compare May 22, 2016 20:11

gfyoung force-pushed the dupe-col-names branch from 55481ff to 29dc344 Compare May 23, 2016 17:44

BUG, ENH: Add support for parsing duplicate columns

ef7636f

Deduplicates the 'names' parameter by default if there are duplicate names. Also raises when 'mangle_ dupe_cols' is False to prevent data overwrite. Closes pandas-devgh-7160. Closes pandas-devgh-9424.

gfyoung force-pushed the dupe-col-names branch from 29dc344 to ef7636f Compare May 23, 2016 21:07

jreback closed this in 9a6ce07 May 23, 2016

gfyoung deleted the dupe-col-names branch May 23, 2016 21:57

gfyoung mentioned this pull request May 23, 2016

ENH: Support mangle_dupe_cols=False in pd.read_csv() #13262

Closed

philrz mentioned this pull request Mar 26, 2024

Reading CSV files that have duplicate columns brimdata/super#5090

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG, ENH: Add support for parsing duplicate columns #12935

BUG, ENH: Add support for parsing duplicate columns #12935

gfyoung commented Apr 20, 2016 •

edited

Loading

jreback commented Apr 20, 2016

gfyoung commented Apr 20, 2016 •

edited

Loading

jreback commented Apr 20, 2016

gfyoung commented Apr 20, 2016 •

edited

Loading

jreback Apr 20, 2016

gfyoung Apr 20, 2016

jreback Apr 20, 2016

gfyoung Apr 20, 2016

jreback Apr 20, 2016

jreback Apr 22, 2016

gfyoung Apr 22, 2016 •

edited

Loading

jreback Apr 22, 2016

gfyoung Apr 22, 2016 •

edited

Loading

jreback Apr 22, 2016

jreback commented Apr 20, 2016

gfyoung commented Apr 20, 2016 •

edited

Loading

jreback commented Apr 20, 2016

gfyoung commented Apr 20, 2016 •

edited

Loading

jreback commented Apr 20, 2016

gfyoung commented Apr 20, 2016 •

edited

Loading

jreback May 18, 2016

gfyoung May 18, 2016

jreback May 19, 2016

gfyoung May 19, 2016

jreback commented May 19, 2016

gfyoung commented May 22, 2016 •

edited

Loading

jreback commented May 23, 2016

jreback commented May 23, 2016

NickWoodhams commented Jun 21, 2017

gfyoung commented Jun 21, 2017

NickWoodhams commented Jun 22, 2017 •

edited

Loading

gfyoung commented Jun 22, 2017


		df = DataFrame(col_dict, columns=columns, index=index)
		if not mappings:

BUG, ENH: Add support for parsing duplicate columns #12935

BUG, ENH: Add support for parsing duplicate columns #12935

Conversation

gfyoung commented Apr 20, 2016 • edited Loading

jreback commented Apr 20, 2016

gfyoung commented Apr 20, 2016 • edited Loading

jreback commented Apr 20, 2016

gfyoung commented Apr 20, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Apr 22, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Apr 22, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Apr 20, 2016

gfyoung commented Apr 20, 2016 • edited Loading

jreback commented Apr 20, 2016

gfyoung commented Apr 20, 2016 • edited Loading

jreback commented Apr 20, 2016

gfyoung commented Apr 20, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented May 19, 2016

gfyoung commented May 22, 2016 • edited Loading

jreback commented May 23, 2016

jreback commented May 23, 2016

NickWoodhams commented Jun 21, 2017

gfyoung commented Jun 21, 2017

NickWoodhams commented Jun 22, 2017 • edited Loading

gfyoung commented Jun 22, 2017

gfyoung commented Apr 20, 2016 •

edited

Loading

gfyoung commented Apr 20, 2016 •

edited

Loading

gfyoung commented Apr 20, 2016 •

edited

Loading

gfyoung Apr 22, 2016 •

edited

Loading

gfyoung Apr 22, 2016 •

edited

Loading

gfyoung commented Apr 20, 2016 •

edited

Loading

gfyoung commented Apr 20, 2016 •

edited

Loading

gfyoung commented Apr 20, 2016 •

edited

Loading

gfyoung commented May 22, 2016 •

edited

Loading

NickWoodhams commented Jun 22, 2017 •

edited

Loading