Read fwf try2 #51018

RonaldBarnes · 2023-01-27T10:43:58Z

closes pd.read_fwf removes leading and trailing whitespace #16772
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/v2.0.0.rst file if fixing a bug or adding a new feature.

Addresses "anti-patterns" discussed in #49832 where fixed width files have whitespace removed from fields by default, and require delimiter option to override this behaviour.

Specifically, @phofl commented here: #(#49832 (comment)) about preferring a fix to documenting current behaviour.

Fixed width files shouldn't have delimiters, rather colspecs or widths. Seems to be some conflation between, say, read_table and read_fwf (and even read_csv).

This PR shouldn't interfere with colspecs="infer", but IMHO that usage should be discouraged with fixed width files; sounds like a job for read_table or read_csv.

Feedback welcome, thanks.

RonaldBarnes · 2023-01-28T21:15:19Z

Looking at the errors thrown by pytest pandast (I now know how to run that pre-commit), I could use some guidance from the maintainers before continuing.

Many test results depend on stripping all whitespace from tabular data without explicitly specifying such behaviour.

Should I change the test parameters to get the same results?
Change the results to match the proposed new behaviour?
Change the data to non-tabular data?
Move the tests into read_table (there is no existing test file for read_table)?

What are the maintainers' thoughts on read_fwf being so intertwined with the concept of delimiters?

In my background and recent use case, a fixed width file possibly came from a tape, likely does not even have line terminators, may contain packed data,...

In this scenario, leading whitespace within fields is critical, delimiters are unheard of, and visually / programmatically inferring field / column boundaries wouldn't be possible.

Hence, my opinion is that read_table should be encouraged for many scenarios the documentation are actually using read_fwf for.

Thanks!

RonaldBarnes · 2023-01-31T17:13:36Z

I've given this PR further thought, and it'd be nice to also accept a tuple(bool, bool) for keep_whitespace for indicating a (left,right) stripping of whitespace.

I'll hold off on developing this feature 'til I hear feedback on the PR and the whole delimiters vs fixed width files situation.

Thank you.

phofl · 2023-01-31T23:43:37Z

Not totally sure how to fix. My understanding is that delimiter should be the number of whitespaces that separate the fields to preserve additional whitespaces that are not part of the filed-delimiter?

Why exactly does this not work right now?

RonaldBarnes · 2023-02-02T18:50:14Z

My understanding is that delimiter should be the number of whitespaces that separate the fields to preserve additional whitespaces that are not part of the filed-delimiter?

That's how it works, but if there are delimiters of any kind, then the file is either a table (delimiters are spaces) or a form of CSV. The file may have columns / fields and lines / records of a fixed width, but it's not a Fixed Width File™.

Why exactly does this not work right now?

Because read_fwf changes the data as it reads it in and that missing data cannot be recovered (leading spaces).

In a Fixed Width File™, leading spaces are not simply padding, they're important - like in a Python script file.

In this anonymized and trimmed fixed-width data, there are several fields, but no delimiters:

00000013753x arxb91 ...
00000021308xjrxb104 ...
00000037098xjrxb105 ...
00000048336xjrxb106 ...
00000059290xjrxb108 ...
00000067863xjrxb109 ...

The default behaviour for read_fwf will trim delimiter (default: \r\n\t ) from each field, unless a delimiter is set to a character that (hopefully) does not exist at a field boundary within the data (\r or \n are safe bets).

An "anti-pattern", IMHO.

Also, this isn't mentioned at all in the API reference.

It's mentioned obliquely in the user guide, but is confusing:

delimiter: Characters to consider as filler characters in the fixed-width file. Can be used to specify the filler character of the fields if it is not spaces (e.g., ‘~’).

I read that, not thinking of potential leading whitespace as "filler" and didn't realize data was being mangled.

This is fairly clear, though further down the page:

The parser will take care of extra white spaces around the columns so it’s ok to have extra separation between the columns in the file.

It appears others have similar expectations:

https://stackoverflow.com/questions/72235501/python-pandas-read-fwf-strips-white-space

https://stackoverflow.com/questions/57012437/pandas-read-fwf-removes-white-space

Not totally sure how to fix.

Agree - it's tricky.

I don't think this will be a popular idea, but fixed width files and tables are distinctly different and there already exists a method for reading tables; users ought to be directed to the proper tool.

Merely documenting #49832 this anti-pattern of delimiter + colspecs doesn't get to the root of the problem.

I suggest people reading data where the file may be of a fixed width, but is basically tabular data should be using read_table.

Perhaps delimiter in the context of read_fwf can be retained for colspecs="infer", but its other functionality should be moved to, say, keep_whitespace.

Thank you for looking at this issue.

RonaldBarnes · 2023-02-12T09:13:08Z

I messed up a git merge and closed this PR.

I shall re-do the work, create a new issue & PR from a fresh clone of pandas and start fresh.

mroeschke requested a review from phofl January 31, 2023 21:29

mroeschke added the IO Data IO issues that don't fit into a more specific label label Feb 1, 2023

RonaldBarnes closed this Feb 9, 2023

RonaldBarnes force-pushed the read_fwf_try2 branch from 31604c7 to 13db83a Compare February 9, 2023 11:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Read fwf try2 #51018

Read fwf try2 #51018

Uh oh!

RonaldBarnes commented Jan 27, 2023 •

edited

Loading

Uh oh!

RonaldBarnes commented Jan 28, 2023

Uh oh!

RonaldBarnes commented Jan 31, 2023

Uh oh!

phofl commented Jan 31, 2023

Uh oh!

RonaldBarnes commented Feb 2, 2023

Uh oh!

RonaldBarnes commented Feb 12, 2023

Uh oh!

Uh oh!

Uh oh!

Read fwf try2 #51018

Read fwf try2 #51018

Uh oh!

Conversation

RonaldBarnes commented Jan 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RonaldBarnes commented Jan 28, 2023

Uh oh!

RonaldBarnes commented Jan 31, 2023

Uh oh!

phofl commented Jan 31, 2023

Uh oh!

RonaldBarnes commented Feb 2, 2023

Uh oh!

RonaldBarnes commented Feb 12, 2023

Uh oh!

Uh oh!

RonaldBarnes commented Jan 27, 2023 •

edited

Loading