Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for parsing CRLF line endings #3

Merged
merged 1 commit into from
Sep 8, 2021
Merged

add support for parsing CRLF line endings #3

merged 1 commit into from
Sep 8, 2021

Conversation

mvdan
Copy link
Contributor

@mvdan mvdan commented Sep 8, 2021

(see commit message)

Copy link
Owner

@warpfork warpfork left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirming my read of the behavior of this:

  • The hunk body will never be parsed as containing trailing 'CR' bytes. In other words, it effectively normalizes the content seen in memory after parsing to be unix-style linebreaks.
  • The patch behavior is unchanged. (So, patch operations still blithely assume 'LF' only.)

What does this mean for users?

I think this will have the effect of letting someone in the user story of "I work on windows, and my git checkouts switch files to CRLF, and I still want them to be read as if they contain the same data as if I was working on linux, or had configured my git to be sane" be happy.

Okay. I guess this seems better than not. I don't think there's any strong justification to make sure 'CR' bytes are preserved; it would be fragile anyway, seeing as how other systems (again, e.g. git) will fiddle with those things readily.

But anyone using the patch system will still get a file out with no 'CR' bytes.

I guess that also still seems acceptable, at least. This PR doesn't make anything about patch worse. (And if we did want to make it better, I frankly don't know how: I don't know what a "better" behavior for patching should be, nor am I really familiar with what the user journey is like for someone living in that git-fiddles-with-line-break-normalziation world, so I can't speculate. Someone who comes in with an idea for this, and a PR, would be welcome, I suppose.)


Okay. SGTM. It's better, and it's not worse. 👍

@mvdan
Copy link
Contributor Author

mvdan commented Sep 8, 2021

I didn't really think about other APIs at all. I just fixed the one that was affecting me :)

And yup, the change makes it ignore the CR byte in CRLF line endings. Kinda like if we ran all input files through dos2unix. I think most compilers/parsers of plaintext files work this way, as do libraries that "read lines" at a high level.

@warpfork warpfork merged commit d583ad6 into warpfork:master Sep 8, 2021
@warpfork
Copy link
Owner

warpfork commented Sep 8, 2021

I have tagged v0.4.0 to include this. :D

@rvagg
Copy link

rvagg commented Sep 9, 2021

Ugh, I'm not so keen on this, I opted to not solve it this way in the JS implementation. You're changing bytes in test fixture data. I think even dos2unix is a bit more intelligent than this in that it won't just do a global replace but detects whether the whole file is "dos" or not. This change is on a per-line basis so even if you have a file that's entirely unix but happen to have a single CR in your test fixture at the end of a line, then it's gone.

Not impossible to work around, but the user then has to do data munging on their own (e.g. convert everything to base64) and fixtures can become less descriptive as a document when that happens. My preferred approach for now was to tell users to deal with this problem themselves: https://github.com/rvagg/testmark.js#note-regarding-windows, but admittedly that's not awesome either because many users are going to end up being surprised at some point when a windows contributor comes along and finds out that stuff is breaking.

The other approach I considered, but was too much effort, was to check if all line endings were crlf and only then do a global replace.

@warpfork
Copy link
Owner

warpfork commented Sep 10, 2021

Can't say I'm "keen" on it either, honestly. I wish this problem would go away at the root: that windows would stop being pointlessly weird. (And you saw me let out a little steam and opinions, already, with my juxaposition of the words "git configuration" and "sane", above...)

@mvdan's convinced me that testmark shrugging at CR preservation is fine, though. It's never advertised being binary safe. (It also, already, can't actually support content that doesn't end in a linebreak, fun fact. Code blocks, after all, are always multiple lines in markdown.) Most critically: the most common user workflow is almost certainly going to involve a git checkout, and that git checkout is already manipulating CRs, so the battle for byte-exactitude on linebreaks is already lost before we've even drawn our swords.

Given that, I think then taking "the dumbest possible approach that works" -- so, blithely stripping all CRs -- is not an unreasonable approach.

If we didn't want to unconditionally strip CRs, the next alternative I would think of is: looking for the first LF in the file, and then looking to see if the byte right before it is CR, and if it is, then doing the trailing-CR removal for the rest of the file. But... why? Is that conditional branch really going to make anyone's life easier, clearer, or less surprising? I sort of doubt it.

The rare case where this very direct approach of nuking CR characters on read is going to matter is when someone intentionally put CR characters in their content, in just some parts of the file, when writing from linux (or, just generally some context where git won't already strip them back out). And I think the base rate for that is near zero. And at that point, the person in that scenario is probably working in binary data in a text file and... yeah, they should probably shape up and use hexidecimal or figure out some other appropriate escaping format, asap, because it's not just testmark that's going to be fighting with them.


Sidenote, I also don't generally consider it a problem if people have to add b64 or hexidemical munging wrappers to their data. I've done it myself a couple of times already. There's always some code to glue the raw bytes hunks into meaningful, executable tests. Usually I'm finding it's on the order of at least a dozen lines, or more if I'm setting up a whole scenario (example 1, example 2)... and so adding one more line to munge b64 or hex or other forms of escaping is not usually a meaningfully large increase in barriers.

@warpfork
Copy link
Owner

(Thank you both for working through this, and out loud, btw. This is indeed a tricky topic and I'd be uncomfortable going it alone. This conversation helps refine and understand (and build explanations for!) the choices, and is very valued. 🙇 )

@rvagg
Copy link

rvagg commented Sep 10, 2021

it's all gross rvagg/testmark.js#3, but I know you two feel that too
rvagg/testmark.js#3

rvagg added a commit to rvagg/testmark.js that referenced this pull request Sep 10, 2021
@mvdan
Copy link
Contributor Author

mvdan commented Sep 10, 2021

Nod to the ugliness. Agreed with Eric that we can't "win" this battle here. Markdown is plaintext, so we can only parse lines. If we try to be strict with LF vs CRLF, we'll be in a world of pain, and that was already evident with git's default behavior on Windows.

At the end of the day, I see formats like testmark and txtar as plaintext, so they are well suited for anything that is valid UTF-8 and formed of printable characters (and "newlines"). Anything outside that scope needs to either be encoded, or stored elsewhere.

@aschmahmann aschmahmann mentioned this pull request Dec 1, 2021
80 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants