add support for parsing CRLF line endings #3

mvdan · 2021-09-08T11:05:10Z

(see commit message)

With a test.

warpfork

Confirming my read of the behavior of this:

The hunk body will never be parsed as containing trailing 'CR' bytes. In other words, it effectively normalizes the content seen in memory after parsing to be unix-style linebreaks.
The patch behavior is unchanged. (So, patch operations still blithely assume 'LF' only.)

What does this mean for users?

I think this will have the effect of letting someone in the user story of "I work on windows, and my git checkouts switch files to CRLF, and I still want them to be read as if they contain the same data as if I was working on linux, or had configured my git to be sane" be happy.

Okay. I guess this seems better than not. I don't think there's any strong justification to make sure 'CR' bytes are preserved; it would be fragile anyway, seeing as how other systems (again, e.g. git) will fiddle with those things readily.

But anyone using the patch system will still get a file out with no 'CR' bytes.

I guess that also still seems acceptable, at least. This PR doesn't make anything about patch worse. (And if we did want to make it better, I frankly don't know how: I don't know what a "better" behavior for patching should be, nor am I really familiar with what the user journey is like for someone living in that git-fiddles-with-line-break-normalziation world, so I can't speculate. Someone who comes in with an idea for this, and a PR, would be welcome, I suppose.)

Okay. SGTM. It's better, and it's not worse. 👍

mvdan · 2021-09-08T11:25:01Z

I didn't really think about other APIs at all. I just fixed the one that was affecting me :)

And yup, the change makes it ignore the CR byte in CRLF line endings. Kinda like if we ran all input files through dos2unix. I think most compilers/parsers of plaintext files work this way, as do libraries that "read lines" at a high level.

warpfork · 2021-09-08T12:02:32Z

I have tagged v0.4.0 to include this. :D

rvagg · 2021-09-09T01:40:15Z

Ugh, I'm not so keen on this, I opted to not solve it this way in the JS implementation. You're changing bytes in test fixture data. I think even dos2unix is a bit more intelligent than this in that it won't just do a global replace but detects whether the whole file is "dos" or not. This change is on a per-line basis so even if you have a file that's entirely unix but happen to have a single CR in your test fixture at the end of a line, then it's gone.

Not impossible to work around, but the user then has to do data munging on their own (e.g. convert everything to base64) and fixtures can become less descriptive as a document when that happens. My preferred approach for now was to tell users to deal with this problem themselves: https://github.com/rvagg/testmark.js#note-regarding-windows, but admittedly that's not awesome either because many users are going to end up being surprised at some point when a windows contributor comes along and finds out that stuff is breaking.

The other approach I considered, but was too much effort, was to check if all line endings were crlf and only then do a global replace.

warpfork · 2021-09-10T06:49:42Z

Can't say I'm "keen" on it either, honestly. I wish this problem would go away at the root: that windows would stop being pointlessly weird. (And you saw me let out a little steam and opinions, already, with my juxaposition of the words "git configuration" and "sane", above...)

@mvdan's convinced me that testmark shrugging at CR preservation is fine, though. It's never advertised being binary safe. (It also, already, can't actually support content that doesn't end in a linebreak, fun fact. Code blocks, after all, are always multiple lines in markdown.) Most critically: the most common user workflow is almost certainly going to involve a git checkout, and that git checkout is already manipulating CRs, so the battle for byte-exactitude on linebreaks is already lost before we've even drawn our swords.

Given that, I think then taking "the dumbest possible approach that works" -- so, blithely stripping all CRs -- is not an unreasonable approach.

If we didn't want to unconditionally strip CRs, the next alternative I would think of is: looking for the first LF in the file, and then looking to see if the byte right before it is CR, and if it is, then doing the trailing-CR removal for the rest of the file. But... why? Is that conditional branch really going to make anyone's life easier, clearer, or less surprising? I sort of doubt it.

The rare case where this very direct approach of nuking CR characters on read is going to matter is when someone intentionally put CR characters in their content, in just some parts of the file, when writing from linux (or, just generally some context where git won't already strip them back out). And I think the base rate for that is near zero. And at that point, the person in that scenario is probably working in binary data in a text file and... yeah, they should probably shape up and use hexidecimal or figure out some other appropriate escaping format, asap, because it's not just testmark that's going to be fighting with them.

Sidenote, I also don't generally consider it a problem if people have to add b64 or hexidemical munging wrappers to their data. I've done it myself a couple of times already. There's always some code to glue the raw bytes hunks into meaningful, executable tests. Usually I'm finding it's on the order of at least a dozen lines, or more if I'm setting up a whole scenario (example 1, example 2)... and so adding one more line to munge b64 or hex or other forms of escaping is not usually a meaningfully large increase in barriers.

warpfork · 2021-09-10T07:04:23Z

(Thank you both for working through this, and out loud, btw. This is indeed a tricky topic and I'd be uncomfortable going it alone. This conversation helps refine and understand (and build explanations for!) the choices, and is very valued. 🙇 )

Ref: warpfork/go-testmark#3

rvagg · 2021-09-10T07:16:56Z

it's all gross rvagg/testmark.js#3, but I know you two feel that too
rvagg/testmark.js#3

Ref: warpfork/go-testmark#3

mvdan · 2021-09-10T08:03:18Z

Nod to the ugliness. Agreed with Eric that we can't "win" this battle here. Markdown is plaintext, so we can only parse lines. If we try to be strict with LF vs CRLF, we'll be in a world of pain, and that was already evident with git's default behavior on Windows.

At the end of the day, I see formats like testmark and txtar as plaintext, so they are well suited for anything that is valid UTF-8 and formed of printable characters (and "newlines"). Anything outside that scope needs to either be encoded, or stored elsewhere.

add support for parsing CRLF line endings

b342782

With a test.

warpfork approved these changes Sep 8, 2021

View reviewed changes

warpfork merged commit d583ad6 into warpfork:master Sep 8, 2021

rvagg added a commit to rvagg/testmark.js that referenced this pull request Sep 10, 2021

fix: handle windows crlf with more brutality

805e176

Ref: warpfork/go-testmark#3

rvagg mentioned this pull request Sep 10, 2021

fix: handle windows crlf with more brutality rvagg/testmark.js#3

Merged

rvagg added a commit to rvagg/testmark.js that referenced this pull request Sep 10, 2021

fix: handle windows crlf with more brutality

7aa1c75

Ref: warpfork/go-testmark#3

warpfork mentioned this pull request Sep 24, 2021

fix parsing CRLF files, part 3 #5

Merged

aschmahmann mentioned this pull request Dec 1, 2021

Release v0.11 ipfs/kubo#8343

Closed

80 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add support for parsing CRLF line endings #3

add support for parsing CRLF line endings #3

mvdan commented Sep 8, 2021

warpfork left a comment

mvdan commented Sep 8, 2021

warpfork commented Sep 8, 2021

rvagg commented Sep 9, 2021

warpfork commented Sep 10, 2021 •

edited

Loading

warpfork commented Sep 10, 2021

rvagg commented Sep 10, 2021

mvdan commented Sep 10, 2021 •

edited

Loading

add support for parsing CRLF line endings #3

add support for parsing CRLF line endings #3

Conversation

mvdan commented Sep 8, 2021

warpfork left a comment

Choose a reason for hiding this comment

mvdan commented Sep 8, 2021

warpfork commented Sep 8, 2021

rvagg commented Sep 9, 2021

warpfork commented Sep 10, 2021 • edited Loading

warpfork commented Sep 10, 2021

rvagg commented Sep 10, 2021

mvdan commented Sep 10, 2021 • edited Loading

warpfork commented Sep 10, 2021 •

edited

Loading

mvdan commented Sep 10, 2021 •

edited

Loading