Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change EOL convention to LF? #2

Closed
masinter opened this issue Jul 26, 2020 · 55 comments
Closed

Change EOL convention to LF? #2

masinter opened this issue Jul 26, 2020 · 55 comments
Assignees

Comments

@masinter
Copy link
Member

masinter commented Jul 26, 2020

@rmkaplan reported:

Windows may still be the outlier, Mac OSX and Unix/Linux are LF.

It’s really a question of what the default EOL convention should be when output streams are created. It shouldn’t matter for input streams, at least for the operations that read characters and not bytes. The character reading functions should all recognize any of the EOL conventions on files and map them into the internal CR (the value of (CHARCODE EOL). (I remember setting it up that way—silly to have to know where or how a file was created before you could read it).

I thought that Unicode might have something to say about this, but they aren’t very helpful. They point out (in the Unicode 3.0 book that I have) that the CR/LF/CRLF conventions are confused…and then they add to the confusion. They define a new code U+2028 as the unambiguous “line separator” (also an unambiguous “paragraph separator" U+2029).

My Xerox XCCS book doesn’t say anything about this, so I’m not sure what the representation is in XCCS-compliant files (which would have run-codes for mixed character-set strings, but control characters are unique in any run).

My temptation is to change the default, so that we are more compatible with Unix/Mac files. We were never compatible with Windows/CRLF. If not for all files, then at least for UTF8 and UTF16 files.

In prowling around, I have also discovered that some of the low-level files got corrupted by the Japanese. A substantial part of the LLREAD file, for example, is filled with conversion tables for various Japanese coding systems, and this stuff is mixed in in a number of other places. Should have been in separate and later files—hard to imagine that these would be needed in the INIT.SYSOUT.

@masinter
Copy link
Member Author

Should we make sure the files are all LF before we push the Interlisip source code into the GitHub repo?

@johnwcowan
Copy link

Windows can mostly cope with LF-terminated text files nowadays. For a long time the last holdout was WIndows Notepad, but that's finally been fixed; in addition, there are many third-party plain text editors that have no problem with LF.

@masinter
Copy link
Member Author

more @rmkaplan notes:
But there is an ancient EOL issue that I thought had been resolved many years ago, but seems to still be there in the code.

I did a lot of the character reading stuff in the early days, and apparently I also implemented the notion of an external format in the mid 90’s (which I don’t remember at all). But that’s the interface that makes it easy to add the UTF8 stuff.

The issue has to do with the low-level character reading macros. They all go through a macro \inchar that basically wraps a call to \nsin, that then does xccs/ns stuff inline but otherwise calls out to the external format character function.

But \inchar wraps the \nsin call inside another macro \checkeolc, which triggers on CR and LF, and does coercion to internal EOL (which happens to be CR) when the byte sequence matches the EOLCONVENTION of the file.

If it sees a CR and the EOLCONVENTION is CRLF, it peeks at the next byte and reads it if it sees LF. If the macros are told to decrement the byte count, then the CRLF cause the necessary extra decrement. If it doesn’t see the LF, then it returns CR by itself, which is OK because that’s the internal. But if the convention is CR (which it defaults to now) and the file has CRLF, the LF is left in the file, and that screws things up. And if sees a naked LF, it doesn’t get converted to EOL.

In aboriginal times—before text files moves back and forth between operating system environments with different conventions—there was a lot of fussiness about properly interpreting the EOL. This is still the correct thing to do for output files, so that files will look good in their home environment.

But at one point it became apparent that this is a mistake for input files. Since you don’t know the provenance of a file, if you are operating on a file—or a region of a file—with text or character input functions, then any of the 3 eol indicators that happen to appear in the file should be mapped to the internal EOL.

In fact, that is what PFCOPYBYTES is doing—it calls \NSIN directly instead of \INCHAR because it doesn’t want the accidental EOL convention of the file to get in the way.

I thought I had cleaned this up a long time ago, but apparently not. I’m tempted to take another crack at it: to change the \CHECKEOLC macro to scoop up all the options, and then to recompile the relatively few functions that contain the macros (which may run into the problem in LLREAD).

@masinter
Copy link
Member Author

masinter commented Jan 5, 2021

@rmkaplan @nbriggs re comments on #128
Isn't reading just a matter of having the right entry for LF in the readable? To treat it the same as CR?
Why would you need to patch LLREAD?

@nbriggs
Copy link
Contributor

nbriggs commented Jan 5, 2021

Have a look at the source for PFCOPYBYTES in PRINTFN

@masinter
Copy link
Member Author

masinter commented Jan 5, 2021

In fact, could you just add CHARSET of xccs vs. Unicode to the READTABLE too?
Instead of a different independent property?

@masinter
Copy link
Member Author

masinter commented Jan 5, 2021

Have a look at the source for PFCOPYBYTES in PRINTFN

looks like at a minimum PFCOPYBYTES is misnamed or the patch was put into the wrong place.

@rmkaplan
Copy link
Contributor

rmkaplan commented Jan 5, 2021 via email

@masinter
Copy link
Member Author

masinter commented Jan 5, 2021

i'm confused. This is the thread to talk about "Change EOL convention to LF".
There's another bug #19 about stack overflow with multiple values.
It's best to keep these discussions separate by choosing "View it on GitHub" rather than replying to the email notification.

@rmkaplan
Copy link
Contributor

rmkaplan commented Jan 5, 2021

It is PFCOPYBYTES (or SEE) that provokes the stack crash. If PFCOPYBYTES didn't do what it is doing, then it wouldn't be the provoker...but the bug would still be there

@johnwcowan
Copy link

CR by itself was last used on Mac Classic, and although WIndows still likes to write CRLFs, as I said above it can handle either LF or CRLF formats: Cygwin writes LF by default. As for those Unicode line and paragraph terminators, absolutely nobody uses them: they were a fine example "There are so many standards to choose from, and if you don't like any of them, wait till next year." So switching from CR-only to LF-only would be the Right Thing for usability's sake.

@masinter
Copy link
Member Author

masinter commented Apr 3, 2021

@rmkaplan I meant READFILE. Use REANDFILE to tell you which CRs you can turn into LF .
eolhack.zip

@rmkaplan
Copy link
Contributor

rmkaplan commented Apr 4, 2021 via email

@masinter
Copy link
Member Author

masinter commented Apr 4, 2021

I was looking for some method to apply that didn't depend on examining the sources and understanding them. Which CRs can be turned into LF's without changing the meaning of the code? What is the invariant? I took 'READFILE' a good invariant because that's what COMPARESOURCES compares. If you change a CR to a LF and it doesn't change the value returned by READFILE, it's OK.

This will leave alone CRs that appear between string quotes, which seems like the most common exception in these files.

@rmkaplan
Copy link
Contributor

rmkaplan commented Apr 4, 2021 via email

@rmkaplan
Copy link
Contributor

rmkaplan commented Apr 5, 2021 via email

@rmkaplan
Copy link
Contributor

rmkaplan commented Apr 5, 2021

Based on the discussion at our meeting today...

For Lisp source files, the simple way of doing the CR → LF conversion is just to TR. Characters in whitespace contexts are completely safe. Because LF's are escaped in strings, the string reader can recognize unescaped LF's and turn them back into the internal EOL (= CR).

But LF and CR in atom names are not discriminated, nor are LF's that were originally intended to be read as CR's (e.g. a user call to PRINTCCODE).

So it is not entirely safe to do a brute force translation, unless we know that there are no CR's on the file that are intended to remain as CR's. Better to LOAD PROP the file with a CR EOLCONVENTION and MAKEFILE NEW with LF EOLCONVENTION. This can be done safely (I think) with a few code adjustments:

The string and atom printers should trap the CR's and print them as CR's, not letting them go through to the EOLCONVENTION coercion. Same for (PRINTCCODE (CHARCODE EOL)) (which will now be semantically different than TERPRI). Thus, the file willl have LF's everywhere, and CR's only where semantically required. The readers should not need to be modified.

This is about the READ/PRIN2 sequence, where presumably we want bidirectional equivalence. I'm not sure about PRIN1, which will print things that presumably aren't going to be read by READ, but might actually read and parsed by the user's own code. I suppose that the best approximation would also be to do the same thing as PRIN2, preserve the CR's instead of letting them coerce to LF. If that doesn't format properly when viewed from the outside, then the user can fix their code (convert the LF's in their strings). But then we are breaking the intuitive connection between internal EOL and newline on the outside. (Which raises the question: What if we changed the internal EOL to LF? How many places in the code do we see (CHARCODE EOL) or (CHARCODE CR)?)

In sum, it should be (pretty much) safe if we:

  1. Change the string/atom/charcode printers to preserve CR in their inputs.
  2. TR files that don't have CR's in atoms.
  3. For files with CR's in atoms, LOAD/MAKEFILE NEW, where the MAKEFILE is done to an LF file device.

Legacy files that are not converted, one way or the other, should still load properly with an LF EOLCONVENTION, as long as they don't contain atoms with LF's. The low-level character-reader will convert those to EOL.

@nbriggs
Copy link
Contributor

nbriggs commented Apr 5, 2021 via email

@nbriggs
Copy link
Contributor

nbriggs commented Apr 5, 2021

Also -- what about files with EOLCONVENTION CRLF ? (which Lisp will happily generate)

@rmkaplan
Copy link
Contributor

rmkaplan commented Apr 6, 2021 via email

@nbriggs
Copy link
Contributor

nbriggs commented Apr 6, 2021

If a CRLF file were converted with tr for CR=>LF I would presume the result would be a "LFLF" file being read with LF convention. I agree that in all these cases the whitespace isn't going to be a problem. I haven't looked at exactly what we print for the case of a string containing CRLF when we print it with each of the three EOLCONVENTIONs.

@rmkaplan
Copy link
Contributor

rmkaplan commented Apr 6, 2021 via email

@rmkaplan
Copy link
Contributor

rmkaplan commented Apr 7, 2021

In the last comment, I meant "EOL type", not "time".

@masinter
Copy link
Member Author

masinter commented Apr 7, 2021

I'm sorry I wasn't clear. My "EOLHACK" was meant to be loaded in and run ONCE, on all the Interlisp source files in the repository.
It wasn't meant as a permanent patch to READFILE or anything else.

Then LOGOUT(T). commit -a all of the files that will appear changed in Git.
I looked at the results and I think there might be some additional sepr CR.

@rmkaplan
Copy link
Contributor

rmkaplan commented Apr 7, 2021 via email

@masinter
Copy link
Member Author

masinter commented Apr 7, 2021

i did the equivalent of that by just creating another directory parallel to the medley directory

A lot of the files differed because the (* ;;; "Copyright ..." ) strings have an embedded CR in them. So maybe not flag those as important?

With Git backing up everything, there's no need to increase the version number?

@rmkaplan
Copy link
Contributor

I ran across a function that is reading text as bytes and is there fore not synchronized with the EOL LF conventions or potential changes to external character encodings (like Unicode). This is the function FASL:READ-TEXT, used in the reading of DFASL files.

@rmkaplan rmkaplan reopened this Jun 10, 2021
@rmkaplan
Copy link
Contributor

Continuing my last comment...

This shows up immediately in the fact that the LF's that are now printed for EOL's in the header to DFASL files are copied to the screen as LFs, not as EOL's.

An obvious fix would be to change it that so that it does READCCODE instead of BIN. But I presume that this is used to read characters in the compiled code as well, which raises a caution flag. But if those are printed with \PRINDATUM, as I believe they are, then the reading should mirror that.

Is there any that should be done before proceeding with this change?

@masinter
Copy link
Member Author

I'm confused; how can you print an EOL to the header of a DFASL file when that is "copied to the screen".
This is for what? The end-of-line inside a quoted string?

@rmkaplan
Copy link
Contributor

rmkaplan commented Jun 10, 2021 via email

@masinter
Copy link
Member Author

could you give just a bit of context? like what file did you make in what context, and what happened when you loaded it?

Under what circumstance would you find a file NAME that includes (non-ASCII) XCCS characters?
I think we don't have support for Unicode filenames so I'm not sure how that's relevant to the EOL issue.

@rmkaplan
Copy link
Contributor

rmkaplan commented Jun 10, 2021 via email

@masinter
Copy link
Member Author

my inclination is to change the display device to treat CR, LF, and CRLF as equivalent.

@nbriggs
Copy link
Contributor

nbriggs commented Jun 11, 2021 via email

@rmkaplan
Copy link
Contributor

rmkaplan commented Jun 11, 2021 via email

@nbriggs
Copy link
Contributor

nbriggs commented Jun 11, 2021 via email

@rmkaplan
Copy link
Contributor

rmkaplan commented Jun 11, 2021 via email

@masinter
Copy link
Member Author

masinter commented Jul 2, 2021

should we delete the EOLCONV scripts ?

@johnwcowan
Copy link

johnwcowan commented Jul 2, 2021 via email

@rmkaplan
Copy link
Contributor

rmkaplan commented Jul 3, 2021

Medley source files can be effectively converted by LOAD-PROP and MAKEFILE-NEW, so it isn't strictly necessary. But is there any harm in keeping it?

@masinter
Copy link
Member Author

masinter commented Jul 3, 2021

I think people should de-install the script in current repos since the tr isn't needed anymore and would possibly hinder us in future (e.g. it was incompatible with lfe for github large files); my note was cryptic.

@masinter
Copy link
Member Author

masinter commented Jul 5, 2021

There might be some issues left which should be opened as new issues

@masinter masinter closed this as completed Jul 5, 2021
masinter pushed a commit that referenced this issue Oct 21, 2021
* Add new GitHub action to create medley release

* Update to manual trigger with release name as input

* Build loadup (#1)

* Add new GitHub action to create medley release

* Update to manual trigger with release name as input

* Cleanup

* Build loadup (#2)

* Add new GitHub action to create medley release

* Update to manual trigger with release name as input

* Cleanup

* Cleanup

* Build loadup (#3)

* Add new GitHub action to create medley release

* Update to manual trigger with release name as input

* Cleanup

* Cleanup

* Move sysouts to correct location

* Build loadup (#4)

* Add new GitHub action to create medley release

* Update to manual trigger with release name as input

* Cleanup

* Cleanup
masinter pushed a commit that referenced this issue Oct 23, 2021
* Build loadup (#1)

* Add new GitHub action to create medley release

* Update to manual trigger with release name as input

* Build loadup (#2)

* Add new GitHub action to create medley release

* Update to manual trigger with release name as input

* Cleanup

* Build loadup (#3)

* Add new GitHub action to create medley release

* Update to manual trigger with release name as input

* Cleanup

* Cleanup

* Build loadup (#4)

* Add new GitHub action to create medley release

* Update to manual trigger with release name as input

* Cleanup

* Cleanup

* Build loadup (#5)

* Add new GitHub action to create medley release

* Update to manual trigger with release name as input

* Cleanup

* Cleanup

* Move sysouts to correct location

* Set root directory to medley
masinter pushed a commit that referenced this issue Sep 27, 2023
… Windows docker build / install (#1337)

* Add cygwin-sdl build to buildLoadup workflow; add installer for cygwin-sdl on windows

* Change how buildLoadup computes latest maiko release to accomodate draft releases

* Fix call to gh release list for maiko

* Debugging call to gh release list for maiko

* Debugging call to gh release list for maiko #2

* Debugging call to gh release list for maiko #3

* Debugging call to gh release list for maiko #4

* Debugging call to gh release list for maiko #5

* Debugging call to gh release list for maiko #6

* Change maiko downloads to accoiunt for draft releases

* Change maiko downloads to account for draft releases #2

* Specify shell (powershell) for Download cygwin installler

* Few cleanup items on cygwin-install

* Update ShellWhich to use command -v instead of which because which returns to much crap on cygwin and command -v is more portable overall

* Switch from using medley-loadup & -runtime tars to medley-full-*.tgz so we get full release incl notecards; delete maiko on install and replace with cygwin maiko

* Make sure Notecards doesn't try to load its HASH fileon PostGreet - for apps.sysout

* Add xdg-utils to cygwin install to support ShellBrowser

* Odds and ends on cygwin build

* Redo medley.iss install script to use tar from Windows rather than cygwin tar because cygwin tar was messing up ACLs in windows.  Needed to change creation of medley.bat accordingly.

* Remove junk lines from buildLoadup.yml

* Restore accidently deleted line to buildLoadup.yml

* Fix multiple issues with cygwin_installer filename; arrange to remove placeholder.txt from the release assets at the end of cygwin installer

* Change name of job from windows_installer to cygwin_installer

* Fix missing GH_TOKEN is removal of placeholder.txt; fix naming of output file in medley.iss

* Fiddling with getting cygwin-installer name right

* Redoing merge of medley.sh/medley.command to handle the Darwin plus Cygwin cases; is medley.iss recreate symbolic links surrounding the medley.sh script

* Fix typos/syntrax errors in medley.sh/medley.command
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants