Skip to content

BUG, ENH: Add support for parsing duplicate columns #12935

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Apr 20, 2016

Introduces mappings and reverse_map attributes to the parser in pandas.io.parsers that allow it to differentiate between duplicate columns that may be present in a file.

Closes #7160.
Closes #9424.

@gfyoung gfyoung changed the title BUG, ENG: Add support for parsing duplicate columns BUG, ENH: Add support for parsing duplicate columns Apr 20, 2016
@jreback
Copy link
Contributor

jreback commented Apr 20, 2016

you missed my point. don't use a dictionary. use a list for the data and the column names then you don't need all of these crazy gymnastics.

@gfyoung
Copy link
Member Author

gfyoung commented Apr 20, 2016

I did try using lists. I could not get it to work. The current mechanism is very reliant on using the names and labels in the dict object to generate the data and create the DataFrame at the end, which two list objects is not very efficient or good at doing. Gymnastics, maybe, but it does manage to resolve the duplicate column names in a parsing system that is entirely unique-column oriented.

@jreback
Copy link
Contributor

jreback commented Apr 20, 2016

@gfyoung you have added a lot of code which is not idomatic, is quite complicated. Refacting should reduce the code not increase it dramatically.

@jreback jreback added the IO CSV read_csv, to_csv label Apr 20, 2016
@gfyoung
Copy link
Member Author

gfyoung commented Apr 20, 2016

@jreback : A closer inspection of my changes will indicate that I did not make any major changes, except for moving two functions into a class and adding some references to mapping and reverse_map to enable unique references to column names, even if there are duplicates.

Also, to reiterate what I said above, two list objects did not work for me. Not only could I not get it to pass all of the parsing tests, but it is slow as you now have to use list indexing instead of dict look-ups to access and manipulate labels. You also have to spend time to re-arrange the lists at the end to properly pass it into the DataFrame, since AFAICT it only accepts rows, not columns when passing in list of list. So yes, there may be some "gymnastics" to get it to work. However, my changes maintain the overall structure of the current parsing system that, separate from the duplicate columns issue I think works fairly well and efficiently.

In short, it would be more useful to receive specific comments as to what I can do to simplify the code and make it more idiomatic instead of blanket comments like ones previous.


df = DataFrame(col_dict, columns=columns, index=index)
if not mappings:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we NEVER use inplace

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make the change, but just for reference, why? And if so, why is it an option?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's an option because some people want to write non idiomatic code

you won't need rename anyhow see below

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until I can see / achieve otherwise, renaming like what you did below does not suffice for more complex renaming set ups, and keeping track of those changes will require "gymnastics" as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already showed the way to do this below

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is notthing wrong with a loop over the columns! its is very very tiny compared to all other things. The point is the code with be a lot simpler.

Copy link
Member Author

@gfyoung gfyoung Apr 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but at this point, there are no obvious simplifications like there were with tokenizer.c for example. And at that point, I prefer the correctness, and right now, I don't see how your list of lists will hold up against the cases I proposed (which are in the test suite FYI).

Also, it's probably not going to be a single loop over the columns as I explained above because the specifications can force you to double back. Secondly, that's not the performance hit I am worried about. The performance hit will come from when you have to do all of the name changes, and you find yourself having to do list.index and list.remove time in time again to get the appropriate element in your names and data lists.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would you have to do that? I am not going to accept a marginal change that obfuscates things more. yes there are a lot of options, but a refactor is needed.

Copy link
Member Author

@gfyoung gfyoung Apr 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain to me how your list of lists would correctly and efficiently handle this case:

>>> data = """1,2,3,4,5
6,7,8,9,10"""
>>> read_csv(StringIO(data), names=['foo', 'bar', 'baz', 'foo', 'lol'],
             parse_dates={'foo_date':[0, 3], 'lol_bar_lol': ['bar', 'lol], 
                                    'foo_lol':[3, 4]})

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [44]: l = ['foo', 'bar', 'baz', 'foo', 'lol']

In [45]: Index(l).get_loc('bar')
Out[45]: 1

In [46]: Index(l).get_loc('lol')
Out[46]: 4

In [47]: d = { i:k for i, k in enumerate(l)}

In [48]: d[0]
Out[48]: 'foo'

In [49]: d[3]
Out[49]: 'foo'

@jreback
Copy link
Contributor

jreback commented Apr 20, 2016

I would do it this way.

In [15]: arr = [np.array([1,2,3]),np.array([1.0,1.1,2.0]),np.array(['foo','bar','az']),[4,5,6]]

In [16]: pd.concat(map(Series,arr),axis=1,copy=False)
Out[16]: 
   0    1    2  3
0  1  1.0  foo  4
1  2  1.1  bar  5
2  3  2.0   az  6

In [17]: pd.concat(map(Series,arr),axis=1).dtypes
Out[17]: 
0      int64
1    float64
2     object
3      int64
dtype: object

then if you need to assign different names

In [23]: result.columns = list('AAAA')

In [24]: result
Out[24]: 
   A    A    A  A
0  1  1.0  foo  4
1  2  1.1  bar  5
2  3  2.0   az  6

This might need some profiling, not for speed (as it conceptually does the same think as construction), but for interim memory considerations.

@gfyoung
Copy link
Member Author

gfyoung commented Apr 20, 2016

How would this set up handle parse_dates for example, where you could be dealing with either str names or numerical indices?

@jreback
Copy link
Contributor

jreback commented Apr 20, 2016

@gfyoung pls try to avoid moving around blocks of code for cleaning (unless its germane), rather do that separately. Its very hard to see what you are actually changing.

This is completely orthogonal to date parsing.

@gfyoung
Copy link
Member Author

gfyoung commented Apr 20, 2016

  1. Noted about the block code moving (to do it in a separate commit). I moved it because I wanted to use mapping and reverse_map in the code, and I saw no reason why I should be passing in additional parameters (they are private functions after all). It also nicely simplifies the signature.

  2. I disagree about the orthogonality. What happens if parse_dates needs to use a column whose name happens to be a duplicate?

@jreback
Copy link
Contributor

jreback commented Apr 20, 2016

@gfyoung you are missing the point.

it is orthogonal to construction of the final product. yes of course it needs to deal with names, but these are defined in a list-like structure that it can easily access. The point is there is somewhat clean separation now:

  • tokenize (maybe include usecols)
  • dtype inference
  • possible column combination (e.g. from parse_dates)
  • construction

name assignment doesn't happen till the final step (though it can be modified in column combinations)

don't try to link things that are not necessary

@gfyoung
Copy link
Member Author

gfyoung commented Apr 20, 2016

As evidenced by the current code, even under the assumption of unique columns, keeping track of the naming changes is not "simple" to do and is quite intertwined with what happens during the processing so that the final construction step is correct. In your comment, abstracting away the naming works that goes on under "name assignment doesn't happen till the final step (though it can be modified in column combinations)" is misleading IMO.

@gfyoung gfyoung force-pushed the dupe-col-names branch 14 times, most recently from 3ac50d7 to 9591741 Compare April 27, 2016 23:41
msg = 'is not supported'

for engine in ('c', 'python'):
with tm.assertRaisesRegexp(ValueError, msg):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh its here. But does this depend on whether names is passed and/or are dupes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. As of now, it is an unsupported feature because it fails with a duplicate header (or names) and when you set mangle_dupe_cols=False, so it is treated like any other unsupported feature in parser.py - we just don't allow it period.

@@ -120,7 +120,8 @@ header : int or list of ints, default ``'infer'``
rather than the first line of the file.
names : array-like, default ``None``
List of column names to use. If file contains no header row, then you should
explicitly pass ``header=None``.
explicitly pass ``header=None``. If this list contains any duplicates, then
you should ensure that ``mangle_dupe_cols=True``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what will a user think about this? maybe say that duplicates names are not allowed unless mangle_dups_cols=True (which is the default)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. Done.

@jreback
Copy link
Contributor

jreback commented May 19, 2016

@gfyoung some minor comments

@gfyoung gfyoung force-pushed the dupe-col-names branch 5 times, most recently from b44886f to 55481ff Compare May 22, 2016 20:11
@gfyoung
Copy link
Member Author

gfyoung commented May 22, 2016

@jreback : Made all of the requested changes, and Travis is giving the green light. Ready to merge if there is nothing else.

Deduplicates the 'names' parameter by default if
there are duplicate names. Also raises when 'mangle_
dupe_cols' is False to prevent data overwrite.

Closes pandas-devgh-7160.
Closes pandas-devgh-9424.
@jreback
Copy link
Contributor

jreback commented May 23, 2016

thanks @gfyoung nice PR!

@jreback
Copy link
Contributor

jreback commented May 23, 2016

as usual, pls review built docs & issue a follow up if needed for any corrections.

@NickWoodhams
Copy link

Can you please merge this pull request? This is exactly the issue that I ran into and when I tried to use the arg it raised the error "ValueError: Setting mangle_dupe_cols=False is not supported yet"

@gfyoung
Copy link
Member Author

gfyoung commented Jun 21, 2017

@NickWoodhams : This PR got merged actually (@jreback closed it by committing my changes to master), but we deliberately disabled support mangle_dupe_cols=False because it was vulnerable to data overwriting. I tried a couple of ways of implementing support for it, but unfortunately, we were not able to reach consensus on acceptable implementation.

If you would like to take a stab at it, you are more than welcome!

@NickWoodhams
Copy link

NickWoodhams commented Jun 22, 2017

@gfyoung Thank you for the thoughtful response. So I realized I actually had things mixed up a bit.

I was unaware that mangle_dupe_cols actually meant that it did not overwrite the column names because I was having the issue of missing rows.. but I realized it was actually due to the to_dict() method-- specifically to_dict('index').

Given a dataframe like
ID Value
MW-1 49.3
MW-1 39.9
MW-2 12.2
MW-2 11.5

I would end up with just MW-1 and MW-2 in my indexed dict and lost half the data. It would look like:
{
MW-1: { value: 39.9 },
MW-2: { value: 11.5 }
}

The way I found around it was by adding the reset_index() method to my data frame. df.reset_index().to_dict('index') prevented the overwrite since it was no longer indexed by the column and I was all set!

{
0: { MW-1: { value: 49.3 } },
1: { MW-1: { value: 39.9 } },
2: { MW-2: { value: 12.2 } },
3: { MW-2: { value: 11.5 } },
}

Hopefully understanding my use case helps!

@gfyoung
Copy link
Member Author

gfyoung commented Jun 22, 2017

@NickWoodhams : We'll definitely take that into consideration, though I'm a little concerned about performance (especially if you have multiple duplicate columns), but thanks for letting me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants