Skip to content

Strip header names? #9067

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mrocklin opened this issue Dec 12, 2014 · 7 comments
Closed

Strip header names? #9067

mrocklin opened this issue Dec 12, 2014 · 7 comments
Labels
Enhancement IO CSV read_csv, to_csv

Comments

@mrocklin
Copy link
Contributor

In csv files that have odd spacing

name,    amount
Alice,100
Bob,200

Loaded dataframes can sometimes have column names with leading or trailing whitespace. Should these column names be stripped by default?

In [4]: pd.read_csv('foo.csv').columns
Out[4]: Index([u'name', u'    amount'], dtype='object')
@jorisvandenbossche
Copy link
Member

@mrocklin You have the skipinitialspace=True argument for this (it is False by default). This is not specifically for the header, but you can use it for that:

In [1]: s = """name,    amount
   ...: Alice,100
   ...: Bob,200"""

In [5]: pd.read_csv(StringIO(s)).columns
Out[5]: Index([u'name', u'    amount'], dtype='object')

In [6]: pd.read_csv(StringIO(s), skipinitialspace=True).columns
Out[6]: Index([u'name', u'amount'], dtype='object')

So your question is more about the default value of skipinitialspace (or would you treat columns in another way as the values?)

@jorisvandenbossche jorisvandenbossche added the IO CSV read_csv, to_csv label Dec 13, 2014
@mrocklin
Copy link
Contributor Author

Ah, this resolves my underlying issue. Thank you for pointing me to the right keyword argument. I apologize for not being sufficiently thorough.

The two directions you pointed out both seem like valid questions. I'll state them again below:

  1. Should the default value of skipinitialspace be set to True?
  2. Should we skip initial spaces by default in columns, even if skipinitialspace=False?

I have not personally encountered a dataset where positive answers to both of these questions would have had a negative effect and I have encountered a few where positive answers to these questions would have had a positive one. Admittedly though I have encountered a small subset of datasets.

@cpcloud
Copy link
Member

cpcloud commented Dec 13, 2014

I wonder if @wesm would be able to comment on why it doesn't behave the way @mrocklin says above.

@imadcat
Copy link

imadcat commented Aug 7, 2018

This default behavior just wasted me several hours trying to figure out why there's a key error when joining two data frames read from CSVs. It's really difficult to see there are leading spaces in a column's name from python terminal output.

@gitgithan
Copy link

Could it be a useful feature to add to warn users of leading/trailing spaces in index/column names when using importing functions?

@rhshadrach
Copy link
Member

I'm -0 here; to me the most expected behavior of pd.read_csv('filename.csv') is to read the csv file with no additional processing. Changing the result, even if it's thought to be a very common use case, is unexpected. This agrees with a common CSV specification:

Spaces are considered part of a field and should not be ignored.

But of course, there is no true standard.

@phofl
Copy link
Member

phofl commented Nov 26, 2021

duplicate with #14460

@phofl phofl closed this as completed Nov 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

8 participants