-
-
Notifications
You must be signed in to change notification settings - Fork 29
Description
When I did the external formats, I didn't understand all the ways that character sets (presumably only for XCCS encoded files) are threaded through the Medley code, so I only cleaned up the most obvious scenarios. I left some of the other confusing pathways for later, or hopefully for never. Now I want to address the rest of it. This is a long description of what I now understand is the current situation--bear with me--and what I propose to do.
Conceptually, character-set access and changing should operate like reading and writing characters, so that pieces of code (like Tedit) don't have to branch on the format of the backing stream they are operating on, just call a generic function to do whatever they think they need. Currently that is mostly but not completely like that.
A stream has a charset field CHARSET that was put there originally to support the runcode representation of XCCS characters. The purpose of the runcode representation is to optimize for file-size in the scale up to multi-byte characters, on the assumption that sequences of characters tend to belong to the same character set (high-order byte)--a Greek character is likely to be followed by another Greek character, Ascii by Ascii, etc.
The run encoding thus allows sequence of 16 bit characters with the same high order byte to be represented as a sequence of single bytes. When a character is read from an XCCS file, the XCCS implementation of the generic \INCCODE function packs the current CHARSET byte onto the front of each byte it retrieves from the file, and when a character is written with a high-order byte that matches the CHARSET, only the low-order byte is output (the XCCX implementation of the generic \OUTCHAR).
When a character is written whose charset is not the same as the stream's view of the current CHARSET, the CHARSET field is changed and charset shifting bytes (255 followed by the new charset byte) are written in the file. So the stream's field and the bytes in the file are consistent.
A stream can also be put into a non-runcoded mode, in which case the characters are all written as 2 bytes. But again, the CHARSET field and shifting bytes in the file are synchronized to say that that mode is in play (at least for that region of the file, until another shifting byte is seen).
Of course, none of this is relevant or makes sense for external formats (like UTF-8) that map bytes to characters locally, without depending on a character-set context maintained in some other way. And in particular, it should never happen that charset shifting bytes (255...) get written into a file that is UTF-8 encoded.
The big change in the external format design was to remove from the generic \OUTCHAR function any knowledge of character sets or shifting bytes, to push those down into the XCCS implementation function (and obviously not into the UTF-8 function). So that implicit charset-shifting bytes will only be put out as needed for XCCS formatted files.
The legacy set up also included another tangled pathway for manipulating the CHARSET field of a stream. I thought it was just fooling with the CHARSET field, but I now see that it could also operate under the table to put out the magic shifting bytes, even in UTF-8 files. That needs to be rearchitected.
The user entry for manipulating the stream's charset is the function (CHARSET stream charset). Notionally, this just sets the stream's CHARSET field to charset, and returns the old value. The documentation says, however, that it may also put out extra bytes in an output stream, if those are needed.
This probably started out with a simple implementation, but I think it got more complicated when common-lisp's meta streams (broadcast, concatenate...) were introduced.
So, CHARSET doesn't operate directly on the stream it is given, it calls a helper function ACCESS-CHARSET to do the work. That function then passes off responsibility to a function that it retrieves from the stream's file device (FDEV), its CHARSETFN. Although that's already a little surprising--why the FDEV and not the stream itself--the point is that each of the meta-stream methods know how to run over their subsidiary streams and apply ACCESS-CHARSET to each of those. It walks through all of the streams until it finally ends up at the leaf-streams of the meta-stream tree. And there it finally finds a CHARSETFN function \GENERIC.CHARSET in all of those stream's FDEV's (DSK, UNIX...) that actually changes the CHARSET field of a real stream.
\GENERIC.CHARSET only changes the CHARSET field, it doesn't actually write any magic bytes. That happens back at the top-level CHARSET function, through another indirection.
Each of our streams has another vector of functions (its IMAGEOPS array) that implement the various image-stream operations (DSPFONT, DSPDRAWLINE, etc.). This includes IMCHARSET as one of its methods, and that's what CHARSET invokes after its call to ACCESS-CHARSET. The IMCHARSET method for the "noimage" imageops that most file-streams use is the one that puts out the magic character-shifting bytes, if the charset argument to CHARSET doesn't match the CHARSET field of the stream.
Those bytes are written by binning to the top-level stream; if it's a meta stream, then presumably the BIN function of a meta-stream invokes the corresponding BIN on all its daughter streams, recursively, until it eventually get to the stream where the \GENERIC.CHARSET was found in the earlier recursion.
I was not able to decipher this twisty configuration when I looked at it several years ago--and I passed. It has several problems: the IMCHARSET is unaware of external formats, so calling CHARSET might result in magic bytes even if its top-level argument stream is formatted as UTF-8. And the leaf streams of a meta-substream might have different external formats, some that want the bytes and some that don't, but they would all get them.
So here's what I propose:
- Remove IMCHARSET as an IMAGEOP method, and remove its invocation from the CHARSET function.
- Add FORMATCHARSETFN as a new field in the EXTERNALFORMAT datatype.
- Change \GENERIC.CHARSET so that it applies the FORMATCHARSETFN of the stream's external format, if such a function is specified. Otherwise, for backward compatibility default just to setting the CHARSET field (and I suppose also putting out the magic bytes).
And, for finer control
4. Add an optional argument DONTMARKFILE to the CHARSET entry that inhibits writing bytes, even for XCCS streams, because the client (e.g. TEDIT) wants to manage things more carefully.
The intended effect is to change charset management so that its interpretation (if any) is implemented by a stream's external format.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status