Unicode support #1
Replies: 29 comments
-
Ron reports " I figured out the format of the binary file, so I can extract the mappings that it contains. ... it has an enormous number of Japanese mappings, but then trails off on some of the higher post-Japanese character sets, like Hebrew, and some of the mathematical symbols. But it may be getting to the point of usability, so that we can read and write Unicode/UTF-8 files. The tricky part is that it only has entries for 188 characters per character set, with no 2-byte cells for the undefined regions of each 128 code panel of each character set. So at each panel boundary you have to bump the running XCC code by 33 and then skip code 127. The hexcode at position 0 is for 0,41—the space 0,40 isn’t represented. Within that, Unicode FFFD (the unicode missing-char slug) is used for unassigned XCC codes, and it seems that FFFF is used when whole panels are missing (the higher order panel for most of the Japanese). And finally, no cells are allocated for the unused/reserved character sets (1 through 40Q), so that the Unicode after 0,376 is for 41,41. |
Beta Was this translation helpful? Give feedback.
-
This work is pretty much done and installed in lispcore/xfull.sysout. There are 2 aspects: the collection of XCCS to Unicode mapping files (in the directory lispcore/unicode/xerox--see the README.TXT), and the lispcore/library package UNICODE (see UNICODE.TXT) that defines the :UTF8 external file format and implements the read/write file-mapping behavior inside Medley. The mapping files are fairly (but not completely) comprehensive and accurate for a substantial number of XCCS character sets. Future editing is best done outside of Medley in a UTF8/Unicode editor (e.g. Textedit on the Mac). The mapping tables can be edited in Tedit, but it is more difficult because of our fairly incomplete XCCS display fonts (you see a lot of black boxes). (Some fonts are better than others.) The UNICODE package initializes the internal mapping tables for a common selection of character sets--Latin and extended Latin, plus the various symbol character sets. It also sets :UTF8 as the default external format for streams opened on the {UNIX} file device. {UNIX} is an alternative to {DSK} as an alternative view of the local file system, but it is a little bit raw and should be used with caution: it does not simulate the conventional Medley file-versioning behavior, and files can get out of step. As described in UNICODE.TXT, the :UTF8 can be installed for particular streams by specifying the EXTERNALFORMAT parameter when they are opened OPENSTREAM (or :EXTERNALFORMAT for CL:OPEN), or by STREAMPROP on an already open stream. It remains to design and provide a user-defined method for causing the external format to be changed for particular streams when they are opened deep inside some other function or subsystem (SEE, TEDIT, MAKEFILE) that gets only a filename at the interface level. One possibility is to extend OPENSTREAM with a user definable function that applies to any newly opened stream before it is released to any consuming code. It could examine the open stream (full name, direction, property list associated with the root filename, etc.) and decide whether to change the format (STREAMPROP). One other note: The library package CLIPBOARD (also in the full sysout) implements an interface to the (currently Mac only) clipboard, so that meta-C and meta-V can be used in TEDIT and SEDIT (for now) to move information into and out of Medley (or between 2 Medley windows or windows in 2 Medleys running side by side). The clipboard stream is by definition :UTF8--it does conversion into and out of XCCS. |
Beta Was this translation helpful? Give feedback.
-
Can the mapping files be pulled out and posted here? The Unicode Consortium maintains mapping files between all sorts of character sets and Unicode. I have some (very old) experience with translating various mapping files to the format Unicode expects. |
Beta Was this translation helpful? Give feedback.
-
I have no objection. They conform to the format of other mapping files that I extracted from an old Unicode 3.0 disk. Perhaps someone at Unicode would have other data that would fill in some of the still-missing code points, or make sure that these are compatible with any changes between Unicode 3.0 (which I had) and modern versions. |
Beta Was this translation helpful? Give feedback.
-
Sounds good. The meaning of assigned Unicode code points has been stabilized since 2.0, so if your mappings were correct as of 3.0, they'll be correct indefinitely. |
Beta Was this translation helpful? Give feedback.
-
I figured that assigned points were presumaby stable, that changes were additive.
There may also be later version of XCCS than the version 3 book that I have, with more codes filled in. But unless we can come up with more complete display-fonts, it doesn’t figured it wouldn’t matter. I set it up, following a suggestion from Larry, so that unmapped codes on either side are mapped on the fly to values in unassigned or private regions, so that they will be preserved under read/write round-trips.
… On Aug 12, 2020, at 8:18 PM, John Cowan ***@***.***> wrote:
Sounds good. The meaning of assigned Unicode code points has been stabilized since 2.0, so if your mappings were correct as of 3.0, they'll be correct indefinitely.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <https://github.com/Interlisp/medley/issues/1#issuecomment-673226234>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQSTUJPX66BOZEDXAFAKPCLSANLSFANCNFSM4PE6P4XA>.
|
Beta Was this translation helpful? Give feedback.
-
It's safe to use private-use codes that way (there are 6400 of them, U+E000–U+F8FF), but unassigned regions can't be, because essentially all of them less than U+FFFF are assigned now. If 6400 code points isn't enough, there is U+F0000-FFFFD, which is more than enough for anything. I'll look into that when I get access to the mapping files. |
Beta Was this translation helpful? Give feedback.
-
That’s what I used on the Unicode side. On the XCCS side I used codes in the unused character sets starting at octal 5 (I think some Medley programs (Tedit? Sedit?) may use some of the lower code sets to signal different behabiors).
It only needs enough to represent the number of distinct unmapped characters that are seen in documents in a given Medley session. It would probably run out, for example, on a collection of Japanese documents if only the default character-set mappings, not the complete set of JIS mappings, are loaded.
… On Aug 12, 2020, at 8:52 PM, John Cowan ***@***.***> wrote:
It's safe to use private-use codes that way (there are 6400 of them, U+E000–U+F8FF), but unassigned regions can't be, because essentially all of them less than U+FFFF are assigned now. If 6400 code points isn't enough, there is U+F0000-FFFFD, which is more than enough for anything. I'll look into that when I get access to the mapping files.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <https://github.com/Interlisp/medley/issues/1#issuecomment-673237635>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQSTUJKCKGSU7SIX4WUUADDSANPQFANCNFSM4PE6P4XA>.
|
Beta Was this translation helpful? Give feedback.
-
I thought the overflow was the Unicode -> XCCS direction (there are some unicode code points that aren't represented in XCCS) but XCCS -> Unicode could be complete? |
Beta Was this translation helpful? Give feedback.
-
No, there are few things in XCCS that I can’t find in Unicode. For example, the underlined capital letters and arrows inside circles.
There are some others where the glyphs don’t look the same but might be logically equivalent, in which case they seem like reasonable approximations.
… On Aug 13, 2020, at 10:02 AM, Larry Masinter ***@***.***> wrote:
I'm confused -- I thought the overflow was the Unicode -> XCCS direction (there are some unicode code points that aren't represented in XCCS) but XCCS -> Unicode was complete.
So the "unassigned" codes of Unicode wouldn't matter except they'd be codes that wouldn't have XCCS mappings (likely an error); it's the other direction Unicode -> XCCS from outside UTF8 files with new additions like emoji that you'd want some way to "stash" inside an XCCS string
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <https://github.com/Interlisp/medley/issues/1#issuecomment-673594836>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQSTUJJA4VSJYZZ3YDKBQMTSAQMBZANCNFSM4PE6P4XA>.
|
Beta Was this translation helpful? Give feedback.
-
The way to solve those two problems is with a couple of combining diacritics, U+0332 COMBINING LOW LINE and U+20DD COMBINING ENCLOSING CIRCLE. Thus underlined A is U+0065 U+0332 A̲ and circled up arrow is U+2191 U+20DD ↑⃝. The appearance may be suboptimal (it's extremely suboptimal on this Mac), but the codes are correct, and you put them in the mapping table separated by a space. If you attach the mapping table I can look it over. |
Beta Was this translation helpful? Give feedback.
-
John,
Here is a drop-box link to all of the mapping tables. All but the xerox/ ones came off of the old Unicode 3.0 CDROM, the xerox/ subdirectory contains the ones that I have put together. I have tried to match the simple format of the other tables.
https://www.dropbox.com/sh/e3nrjc2o7ot6v13/AABmvaKl_6JHLpDP_a_6I6dya?dl=0
It would be simple to specify charcode sequences in the Xerox mapping tables, but I don’t know whether the format from the Unicode 3.0 CD allows for that. But Medley as currently constituted would still have to ignore these, since the primitive interface to reading and writing traffics in single character codes.
But more complete tables might have value for other purposes. Is it a standard convention to allow multiple codes separated by space in files with this format? Or should I upgrade to a more general format.
Let me know what you think.
… On Aug 13, 2020, at 12:44 PM, John Cowan ***@***.***> wrote:
The way to solve those two problems is with a couple of combining diacritics, U+0332 COMBINING LOW LINE and U+20DD COMBINING ENCLOSING CIRCLE. Thus underlined A is U+0065 U+0332 A̲ and circled up arrow is U+2191 U+20DD ↑⃝. The appearance may be suboptimal (it's extremely suboptimal on this Mac), but the codes are correct, and you put them in the mapping table separated by a space.
If you attach the mapping table I can look it over.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <https://github.com/Interlisp/medley/issues/1#issuecomment-673674511>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQSTUJIDX6GLVQSRM6F7W33SAQ7ADANCNFSM4PE6P4XA>.
|
Beta Was this translation helpful? Give feedback.
-
Thanks for the link. I can't find anything in the old mapping tables, because Unicode pretty much made sure that if something existed as a single codepoint in a character set at the time, it had a single-codepoint equivalent in Unicode. The only thing I can find vaguely similar is the APL mapping, which maps both single 8-bit characters and sequences of the form xx08yy, in other words character-backspace-character, into single Unicode characters. But even though APL's underlined caps were encoded on the APL side using the same approach of A-backspace-underscore, for whatever reason they weren't put in the mapping table using a combining low line. (Modern APL systems use lower case instead of underlined caps.) Still, I can't believe that it would be Wrong to use such sequences in the table, and even if Medley can't process them, outboard programs would be able to. |
Beta Was this translation helpful? Give feedback.
-
On further reflection, I think I understand the logic of sequences a little better. It reminds me of discussions I had with Joe Becker in the early days about the difference between Lisp-internal encoding of atoms and strings that programs operate on and the mapping to their file and rendering representations. The internal representation for characters with missing mappings are the codepoints assigned in the XCCS standard. Thus underlined A is seen by internal programs as 42,301 and circled right arrow appears as 357,333. Those would be rendered when Medley displays strings and atoms by the corresponding bitmaps in the various font tables (which probably don't exist, a separate issue). The mapping tables associate the best (most likely?) Unicode code points for given XCCS characters, and the UTF8 transformation maps those Unicode code points to file byte-sequences. But there is no reason why the table couldn't specify a sequence of Unicode characters for a given XCCS character, on the theory that such a sequence would be the most likely file-representation for something with the intended appearance. A UTF8 byte sequence encoding 2 or more Unicode characters could be recognized as mapping to the single XCCS/Medley code. Presumably external editing or rendering programs will produce the intended images for the Unicode character sequences. So we allow code to code-sequence mappings, and each side is then responsible for its own rendering. The probability of ever seeing any of these particular examples is infinitesimally small, but there may be more likely occurrences and extending the machinery would allow for more complete specifications. |
Beta Was this translation helpful? Give feedback.
-
But there is no reason why the table couldn't specify a sequence of Unicode
characters for a given XCCS character, on the theory that such a sequence
would be the most likely file-representation for something with the
intended appearance. A UTF8 byte sequence encoding 2 or more Unicode
characters could be recognized as mapping to the single XCCS/Medley code.
Presumably external editing or rendering programs will produce the intended
images for the Unicode character sequences.
Exactly so.
So we allow code to code-sequence mappings, and each side is then
responsible for its own rendering. The probability of ever seeing any of
these particular examples is infinitesimally small,
Unless there is an APL interpreter hiding somewhere in the image!
|
Beta Was this translation helpful? Give feedback.
-
Turns out that the sequence idea, though logically correct, does not have a very attractive implementation in Medley. The unicode standard puts the combining characters after the base character, so that A would come before the underline. As noted, this particular case is very unlikely, but the much more likely case is for combining characters that represent diacritics (e.g. the ¨ in ü). ü appears as the base character u followed by the combining ¨, and that sequence should logically map into Medley as the single u-umlaut code in character set 361 (F1). The mapping entry But the basechar-combiningchar file order means that whenever a UTF8 byte sequence for the character u appears, we would have to read the next sequence of utf8 bytes to see whether it is followed by ¨ or some other combining character. The stream would have to advance perhaps several bytes to see what follows, and then back up its state if it doesn't see a combiner (e.g. reset the file ptr so that the following base char will be delivered on the next call to READC). Apart from a bit more complexity in the code, it might slow down character reading by a factor of two (the effect on writing should be negligible). However, if we don't set this up, an external string like "Jürgen" with 6 characters will appear inside Medley as the 7 character string "Ju¨rgen". So maybe we have to take the hit. |
Beta Was this translation helpful? Give feedback.
-
Unicode "normalization" NFC takes uncombined characters U + Umlaut and turns them into combined characters |
Beta Was this translation helpful? Give feedback.
-
On reflection I think there is an implementation strategy that will minimize the added cost when a base character is not followed by one of its associated combining characters.
So the policy question is whether we want to try to ensure byte-to-byte equivalence when we read and then write vs. try to provide an equivalence between external file appearance and Medley-internal appearance and especially Medley-internal processing (what is the nth char, u or u-umlaut?).
I think this is independent of the problem of dealing with other kinds of external encodings and formats. If you make the mistake of treating an ISO8859-1 (or 2 or …) single-byte file as a UTF8 file and the file has any non-Ascii bytes, then you will likely trip over an illegal UTF8 byte sequence and get an error.
There is a separate package that defines various ISO external formats (although probably needs to be updated to pivot through the larger set of Unicode mappings). We (or someone) could provide an additional heuristic, no-guarantees program for format/encoding discovery (along the lines of language-identification programs in NLP), and the user could stick that on the newly provided STREAM-AFTER-OPEN-FNS and switch the external format. But I don’t think we should take responsibility for that.
… On Aug 15, 2020, at 9:38 AM, Larry Masinter ***@***.***> wrote:
Unicode "normalization" NFC takes uncombined characters U + Umlaut and turns them into combined characters
http://www.unicode.org/reports/tr15/ <http://www.unicode.org/reports/tr15/>. I think for consistency that Medley shouldn't normalize on input but maintain 1-1 correspondence -- partly because there is a risk of discovering text files in other charsets (some ISO-8859-1 files, for example)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <https://github.com/Interlisp/medley/issues/1#issuecomment-674420804>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQSTUJN5VN5XRN477XP5VF3SA22WRANCNFSM4PE6P4XA>.
|
Beta Was this translation helpful? Give feedback.
-
if you start from lisp and write an Iisp XCCS string or symbol and read it back into lisp you should get the same XCCS codes, even if the external format was UTF8. |
Beta Was this translation helpful? Give feedback.
-
XCCS to Unicode and back is the easy case, since we can choose which of possibly several canonically equivalent Unicode representations to use for a given XCCS string—choose the one that we know will read back into the original. We are in control of the internal source, so we can control the external target.
The other direction is more delicate: if there are 2 or more Unicode sequences that are defined as canonically equivalent (= have the same appearance), and we want to choose the XCCS representation that preserves the apparent string-character relations (for a user who doesn’t know or care about combining chars and the like), then the write-out of that XCCS string may not have the original Unicode representation (although by Unicode specification it would look as if it does—it would be canonically equivalent according to Unicode even though the bytes are different).
I think it would be a mistake if NCHARS and NTHCHAR gave different results for strings or atoms that look exactly the same, depending on how they happened to be represented in a Unicode file. And certainly a mistake for there to be 2 or more atoms floating around with exactly the same visible print-names as determined by how some random external editors chose to represent the characters. That would be an invitation to hours-long debugging sessions.
… On Aug 15, 2020, at 3:03 PM, Larry Masinter ***@***.***> wrote:
if you start from lisp and write an Iisp XCCS string or symbol and read it back into lisp you should get the same XCCS codes, even if the external format was UTF8.
and reaas UTF8 and read it back
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <https://github.com/Interlisp/medley/issues/1#issuecomment-674451478>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQSTUJIB3PNDAYJEH7LZCNTSA4AZZANCNFSM4PE6P4XA>.
|
Beta Was this translation helpful? Give feedback.
-
It's my impression that the occurrence of unnormalized Unicode is rare. |
Beta Was this translation helpful? Give feedback.
-
Normalization Form C (aka NFC) could be applied at file read from UTF8 files. |
Beta Was this translation helpful? Give feedback.
-
Here are some principles I'd like to put forward:
Here is my attempt to spell out the consequences of these principles. (Note that my understanding of XCCS architecture is about 35 years old, when I was last working with Star/Viewpoint and Interlisp. I have forgotten much, but I have been faithful to thee, Cynara — in my fashion.) Principle 1 means that an XCCS A-with-underscore is output as LATIN CAPITAL LETTER A followed by COMBINING LOW LINE. It also means that an XCCS combining dot above followed by Q (heat transfer per unit time in thermodynamics) is output as LATIN CAPITAL LETTER Q followed by COMBINING DOT ABOVE. These are correct Unicode. Principle 2 means that correct Unicode is not always enough. Thus not only is XCCS û output as LATIN SMALL LETTER U WITH CIRCUMFLEX, but in addition XCCS combining-circumflex followed by u is output in the same way, because that is the way in which û is normally represented in Unicode. It also means that (assuming that ć doesn't have its own character in XCCS) that XCCS combining acute followed by c is output as LATIN SMALL LETTER C WITH ACUTE. Principle 3 means that LATIN SMALL LETTER A followed by COMBINING GRAVE does not have be handled in any special way, because it is most unlikely to occur in Unicode text. And if I'm right that there are no precomposed characters in XCCS that aren't also in Unicode, then we have to deal with any Unicode combining-character sequences on input. However, we do have to input LATIN SMALL LETTER AE WITH MACRON as combining macron followed by æ. Note also that Hebrew and Arabic vowel signs aren't treated as diacritics by either XCCS or Unicode, so they need no special treatment. Perfect round-tripping from Unicode to XCCS to Unicode is simply not practical. There are 143,859 assigned codepoints in Unicode at present, and the number will only grow. So there will be many, many characters and indeed character sequences that XCCS simply can't handle without loss. But although perfect round_tripping from XCCS to Unicode to XCCS is possible, it isn't something that should be routinely done. Medley should be able to read and write either UTF-8 or XCCS files, and then a batch converter running outboard of Medley can do exactly the right thing with Unicode composing characters. By the way, is there any chance of scanning the XCCS book and putting it on bitsavers.org? I have no clue where my 2.x book was and I have never seen a later book at all. |
Beta Was this translation helpful? Give feedback.
-
One more point: U+FFFE and U+FFFF are undefined codepoints in themselves (they should never be used); they are not representations of undefined characters. U+FFFD is the Unicode codepoint used to mark an unconvertible character. It's also, indeed more often, used to represent an undecodable sequence of bytes. |
Beta Was this translation helpful? Give feedback.
-
Those codes are used in my mapping tables just to indicate to the Medley table-processsing algorithms that a given XCCS code will never need a mapping (undefined in the XCCS standard) vs. one for which a mapping hasn’t been determined.
Those are interpreted just when the tables are being manipulated, they never make it into the online mapping tables.
… On Aug 15, 2020, at 6:57 PM, John Cowan ***@***.***> wrote:
One more point: U+FFFE and U+FFFF are undefined codepoints in themselves (they should never be used); they are not representations of undefined characters. U+FFFD is the Unicode codepoint used to mark an unconvertible character. It's also, indeed more often, used to represent an undecodable sequence of bytes.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <https://github.com/Interlisp/medley/issues/1#issuecomment-674467397>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQSTUJIAJP7XPVG735SOFVDSA44I7ANCNFSM4PE6P4XA>.
|
Beta Was this translation helpful? Give feedback.
-
So, let’s say that our mapping tables associate with an XCCS code the NFC-canonical code or code sequence that best corresponds to that XCCS code (as hopefully they do now). That is the code or code sequence that is used when the XCCS code is written to a file. And, those Unicode codes/sequences will be read as the corresponding XCCS code as per the table entry. The round-trip from XCCS-to_Unicode-to-XCCS will be an identity. (Your principles 1 and 2)
Separately, we should get (presumably from Unicode) a table that maps all non-canonical Unicode codes and code sequences to their NFC canonical representatives. Then we can extend the Medley Unicode-to-XCCS online translation table (but not the XCCS-to-Unicode) so that all of those vairants are also mapped to the XCCS code that corresponds to their representative.
This means that the round-trip from Unicode-to-XCCS-to-Unicode may not be an identity, because it will do implicit canonicalization. (Principle 3) But it will preserve Unicode character-meanings under the NFC equivalence.
Note that the way it is implemented now it will also preserve round-trip equivalence in both directions for characters that have no correspondences, since those are mapped on the fly to unique unused/private codes.
There is a separate question about equivalence and canonicalization on the XCCS side (whether for example the virtual keyboard entry for ü produces a single XCCS composite character or a ¨u sequence (XCCS puts the combiners in front)). I think that’s out of scope for current goals.
Does anybody have a pointer to Unicode NFC canonicalization tables? It will be relatively easy to take those into account, but I don’t want to start guessing the possible variants. I like the strategy of letting our tables map only canonicals to canonicals (even if some of them, like underlined A may be sequences), and then deal with variations separately on either side.
—Ron
P.S. At some point I can scan the XCCS Version 3 that I have. It’s beginning to fall apart anyway. But, to put it up, do we need to get copyright permission from Xerox?
… On Aug 15, 2020, at 6:41 PM, John Cowan ***@***.***> wrote:
Here are some principles I'd like to put forward:
When writing XCCS format to the outside world, it should be written as fully correct Unicode that correctly captures the meaning.
Insofar as practicable, generated Unicode should conform to Normalization Form C <https://unicode.org/reports/tr15/>. (This means that wherever possible precomposed Unicode characters are used instead of base characters followed by diacritics. This normalization form is used for almost all text, particularly text on the Web, so it is important to generate it.)
When reading Unicode and storing it in XCCS format, it does not matter if complete round-tripping is lost in edge cases.
Here is my attempt to spell out the consequences of these principles. (Note that my understanding of XCCS architecture is about 35 years old, when I was last working with Star/Viewpoint and Interlisp. I have forgotten much, but I have been faithful to thee, Cynara — in my fashion.)
Principle 1 means that an XCCS A-with-underscore is output as LATIN CAPITAL LETTER A followed by COMBINING LOW LINE. It also means that an XCCS combining dot above followed by Q (heat transfer per unit time in thermodynamics) is output as LATIN CAPITAL LETTER Q followed by COMBINING DOT ABOVE. These are correct Unicode.
Principle 2 means that correct Unicode is not always enough. Thus not only is XCCS û output as LATIN SMALL LETTER U WITH CIRCUMFLEX, but in addition XCCS combining-circumflex followed by u is output in the same way, because that is the way in which û is normally represented in Unicode. It also means that (assuming that ć doesn't have its own character in XCCS) that XCCS combining acute followed by c is output as LATIN SMALL LETTER C WITH ACUTE.
Principle 3 means that LATIN SMALL LETTER A followed by COMBINING GRAVE does not have be handled in any special way, because it is most unlikely to occur in Unicode text. And if I'm right that there are no precomposed characters in XCCS that aren't also in Unicode, then we have to deal with any Unicode combining-character sequences on input. However, we do have to input LATIN SMALL LETTER AE WITH MACRON as combining macron followed by æ. Note also that Hebrew and Arabic vowel signs aren't treated as diacritics by either XCCS or Unicode, so they need no special treatment.
Perfect round-tripping from Unicode to XCCS to Unicode is simply not practical. There are 143,859 assigned codepoints in Unicode at present, and the number will only grow. So there will be many, many characters and indeed character sequences that XCCS simply can't handle without loss. But although perfect round_tripping from XCCS to Unicode to XCCS is possible, it isn't something that should be routinely done. Medley should be able to read and write either UTF-8 or XCCS files, and then a batch converter running outboard of Medley can do exactly the right thing with Unicode composing characters.
By the way, is there any chance of scanning the XCCS book and putting it on bitsavers.org? I have no clue where my 2.x book was and I have never seen a later book at all.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <https://github.com/Interlisp/medley/issues/1#issuecomment-674466501>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQSTUJON6SRPG5PBO4H7YLTSA42NDANCNFSM4PE6P4XA>.
|
Beta Was this translation helpful? Give feedback.
-
On Sun, Aug 16, 2020 at 12:52 AM rmkaplan ***@***.***> wrote:
So, let’s say that our mapping tables associate with an XCCS code the
NFC-canonical code or code sequence that best corresponds to that XCCS code
Unfortunately, that can't be done in a purely tabular way, because XCCS is
a generative character encoding, just as Unicode is, and unlike fixed
encodings like ISO-Latin-*. For example, you can (IIRC) specify pretty
much any diacritic followed by any base character, so that the repertoire
of what a human being will recognize as a "character" is very very large,
much larger than the code set itself.
In Unicode it is effectively infinite, because Unicode allows any number of
diacritics associated with a base character (e.g. in the Vietnamese name
"Hồ Chí Minh", where there is both a circumflex (vowel quality) and an
acute (tone). I'm not sure if XCCS allows this or not. (This is one
reason why I need the XCCS architecture documentation: see below.)
Here's my first take at a pipeline. Of course, many of these stages are
table-driven, but that doesn't mean they can be done with just a table.
1) Decompose certain XCCS characters. For example, 0xF1C1, or Á, should be
decomposed into 23C6 (acute diacritic) followed by 0041 (A).
2) Rearrange diacritics. If only one diacritic is present/allowed, this is
just a matter of putting the diacritic after its base rather than before.
If there are multiple diacritics, then it's also necessary to put the
diacritics in the correct order, as determined by their Combining Character
Class (there's a Unicode table for that). Lower-numbered classes go before
higher-numbered ones, except that class 0 diacritics are fences: other
diacritics are not rearranged around them.
3) Translate character by character to Unicode, expanding the relatively
few cases where no single Unicode equivalent exists. We now have Unicode
in something very close to Normalization Form D (details will depend on
things like how Korean hangul characters are encoded in XCCS).
4) Convert to NFC using the Unicode Canonical Composition Algorithm. There
is a table of characters that need not be examined further, and some of the
work was already done in step 2.
Separately, we should get (presumably from Unicode) a table that maps all
non-canonical Unicode codes and code sequences to their NFC canonical
representatives. Then we can extend the Medley Unicode-to-XCCS online
translation table (but not the XCCS-to-Unicode) so that all of those
vairants are also mapped to the XCCS code that corresponds to their
representative.
For the same reasons as above, no such table is feasible: to do a proper
job requires an algorithm, though it has only three stages: decompose,
translate to XCCS, reorder diacritics in front of their base.
There may be other issues too. How does XCCS handle Chinese? If it is
encoded in a separate part of the XCCS space, Unicode->XCCS conversion will
be ambiguous.
There is a separate question about equivalence and canonicalization on the
XCCS side (whether for example the virtual keyboard entry for ü produces a
single XCCS composite character or a ¨u sequence (XCCS puts the combiners
in front)). I think that’s out of scope for current goals.
The above algorithm will do the latter, although there is no problem with
adding an XCCS recomposition step to handle this.
P.S. At some point I can scan the XCCS Version 3 that I have. It’s
beginning to fall apart anyway. But, to put it up, do we need to get
copyright permission from Xerox?
I can't believe that they actually care at this point. But if you are
worried about it, you can scan just the prose and the list of tables (I
don't need the detailed character tables themselves) and send it to me at
cowan@ccil.org, and I'll keep it on the down low. Once I get my hands on
that, I can be much more definite.
|
Beta Was this translation helpful? Give feedback.
-
https://www.git-scm.com/docs/gitattributes |
Beta Was this translation helpful? Give feedback.
-
how much of this discussion belongs in the Unicode documentation? |
Beta Was this translation helpful? Give feedback.
-
It would be useful to have some mapping between XCCS which is used by Medley and Unicode, for things like hardcopy of Interlisp files with the control character font shifts.
Medley supports postscript with raster fonts, I'm pretty sure, because outline fonts weren't feasible at the time with the performance of printers
Beta Was this translation helpful? Give feedback.
All reactions