-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Read fwf try2 #51018
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read fwf try2 #51018
Conversation
Looking at the errors thrown by Many test results depend on stripping all whitespace from tabular data without explicitly specifying such behaviour.
What are the maintainers' thoughts on In my background and recent use case, a fixed width file possibly came from a tape, likely does not even have line terminators, may contain packed data,... In this scenario, leading whitespace within fields is critical, delimiters are unheard of, and visually / programmatically inferring field / column boundaries wouldn't be possible. Hence, my opinion is that Thanks! |
I've given this PR further thought, and it'd be nice to also accept a tuple(bool, bool) for I'll hold off on developing this feature 'til I hear feedback on the PR and the whole delimiters vs fixed width files situation. Thank you. |
Not totally sure how to fix. My understanding is that delimiter should be the number of whitespaces that separate the fields to preserve additional whitespaces that are not part of the filed-delimiter? Why exactly does this not work right now? |
That's how it works, but if there are delimiters of any kind, then the file is either a table (delimiters are spaces) or a form of CSV. The file may have columns / fields and lines / records of a fixed width, but it's not a Fixed Width File™.
Because In a Fixed Width File™, leading spaces are not simply padding, they're important - like in a Python script file. In this anonymized and trimmed fixed-width data, there are several fields, but no delimiters:
The default behaviour for An "anti-pattern", IMHO. Also, this isn't mentioned at all in the API reference. It's mentioned obliquely in the user guide, but is confusing:
I read that, not thinking of potential leading whitespace as "filler" and didn't realize data was being mangled. This is fairly clear, though further down the page:
It appears others have similar expectations: https://stackoverflow.com/questions/72235501/python-pandas-read-fwf-strips-white-space https://stackoverflow.com/questions/57012437/pandas-read-fwf-removes-white-space
Agree - it's tricky. I don't think this will be a popular idea, but fixed width files and tables are distinctly different and there already exists a method for reading tables; users ought to be directed to the proper tool. Merely documenting #49832 this anti-pattern of delimiter + colspecs doesn't get to the root of the problem. I suggest people reading data where the file may be of a fixed width, but is basically tabular data should be using Perhaps Thank you for looking at this issue. |
31604c7
to
13db83a
Compare
I messed up a I shall re-do the work, create a new issue & PR from a fresh clone of pandas and start fresh. |
doc/source/whatsnew/v2.0.0.rst
file if fixing a bug or adding a new feature.Addresses "anti-patterns" discussed in #49832 where fixed width files have whitespace removed from fields by default, and require
delimiter
option to override this behaviour.Specifically, @phofl commented here: #(#49832 (comment)) about preferring a fix to documenting current behaviour.
Fixed width files shouldn't have delimiters, rather
colspecs
orwidths
. Seems to be some conflation between, say,read_table
andread_fwf
(and evenread_csv
).This PR shouldn't interfere with
colspecs="infer"
, but IMHO that usage should be discouraged with fixed width files; sounds like a job forread_table
orread_csv
.Feedback welcome, thanks.