Add back character sets that had characters outside 16 bit plane #1964

rmkaplan · 2025-01-13T17:34:54Z

Some of the mappings had Unicodes outside of 16 bits, those character sets had been excluded before.

Now the character sets are included, but those particular lines in the mapping file (e.g. the Gothic characters in the Runic-Gothic character set) have been commented out, so that the other characters can be included.

Update title line

rmkaplan · 2025-01-18T07:27:43Z

Based on @hjellinek suggestion, I did some timing tests comparing the original table format with formats that uses the same top-level array but use either a hashtable or a digital search for the second level. The hash and digital search were both better than what I had before, so I simplified the code to the hash. (I might eventually go to the digital search, but I would first have to move by MULTI-ALIST macros over to Lispusers).

I also added a new format, :UTF-8-SLUG just like LUTF-8 except that its OUTCHARFN produces the unicode slug for codes whose mappings were not found in the table-files.

And new functions XTOUCODE? and UTOXCODE? that return the corresponding mapping for codes in the table-files, NIL otherwise.

If multiple XCCS codes map to the same Unicode, the normal UNICODE.TRANSLATE (and XTOUCODE) will return the lowest Unicode. But XTOUCODE? will return the list--caller has to decide what to do. Alternatives in the inverse direction behave in the same way.

Note that callers of UNICODE.TRANSLATE must be recompiled.

Please test this functionality/interface. Also I hope that the previously reported performance issues have been fixed.

rmkaplan · 2025-01-18T20:52:51Z

I did some timing using only a single global hash array for all the characters, that is at least as fast, maybe faster, then doing an initial array branch into smaller hash arrays. And simpler still.

hjellinek · 2025-01-19T19:22:35Z

I did some timing using only a single global hash array for all the characters, that is at least as fast, maybe faster, then doing an initial array branch into smaller hash arrays. And simpler still.

Thanks, @rmkaplan, for the new functionality. The increased speed is a bonus. I'm glad my suggestion worked out so well.

rmkaplan · 2025-01-20T02:57:02Z

I did some more careful speed testing with mapping tables that contained all of the X-to-U pairs, not just the common ones, and with looking up all of the possible codes, not just charset 0. The single-hash was a significant loser with the much larger mappings, by a factor of 6. So I reverted to a top-level branch to hash arrays that contain no more than 128 characters.

The multi-alist is slightly better than the multi-hash for a 512 branching array, but significantly better (~ 25%) with a 1024 branch. But I'll stick with the hash for now.

rmkaplan · 2025-01-20T06:09:45Z

I reworked the UNICODE.TRANSLATE macro so that it could be shared by XTOUCODE and XTOUCODE? etc.

This should not be called directly by functions outside of UNICODE, to avoid dependencies on internal structures. Use the XTOUCODE etc. function interface.

hjellinek · 2025-01-21T00:17:31Z

I'm testing it now. For whatever reason, my htmltest.funny-chars function is able to display characters in more character sets than before, e.g., Arabic and Hebrew work now. Which is good!

I did a spot test with Runic. XCCS defines characters in several Runic variants, and, as I just learned with the help of the new APIs, Unicode seems only to define characters in a single Runic script.

I guessed that there's an invariant such that, given an open output stream STREAM with format set to :UTF-8-SLUG, it is the case that:
for all X such that (XTOUCODE? X) returns NIL, (\OUTCHAR STREAM X) should write the Unicode slug -- REPLACEMENT CHARACTER U+FFFD (�) -- to the output stream STREAM.

However, instead of REPLACEMENT CHARACTER U+FFFD (�) I see U+E000, which is the initial codepoint of the Unicode private use area. Does this mean that the :UTF-8-SLUG format is acting like the :UTF-8 format, adding to the unmapped character table instead of outputting slugs? (EDIT: no, if it were acting like the :UTF-8 format I'd see U+E000, U+E001, U+E002, etc.)

Here's a screenshot from Chrome:

rmkaplan added 7 commits January 13, 2025 09:31

Add back character sets that had characters outside 16 bit plane

c305cbe

Update XCCS-353=SYMBOLS3.TXT

4e6d8dd

Update title line

Update UNICODE.TEDIT

11ac36f

Merge branch 'master' into rmk55--Add-Unicode-character-sets

98ea51d

Fix charset names

6ec2c35

Merge branch 'master' into rmk55--Add-Unicode-character-sets

870c68e

Reorganized the tables, added requested interfaces

a256d0d

rmkaplan mentioned this pull request Jan 18, 2025

Unmapped XCCS and Unicode characters #1971

Open

Use a single hash

fbaeb35

rmkaplan added 2 commits January 19, 2025 11:32

Merge branch 'master' into rmk55--Add-Unicode-character-sets

0e5c9a1

Top-level array branch beats a single hash

7f8c57a

cleanup UNICODE.TRANSLATE macro

3e276eb

Merge branch 'master' into rmk55--Add-Unicode-character-sets

751bc94

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add back character sets that had characters outside 16 bit plane #1964

Add back character sets that had characters outside 16 bit plane #1964

rmkaplan commented Jan 13, 2025

rmkaplan commented Jan 18, 2025

rmkaplan commented Jan 18, 2025

hjellinek commented Jan 19, 2025

rmkaplan commented Jan 20, 2025

rmkaplan commented Jan 20, 2025

hjellinek commented Jan 21, 2025 •

edited

Loading

Add back character sets that had characters outside 16 bit plane #1964

Are you sure you want to change the base?

Add back character sets that had characters outside 16 bit plane #1964

Conversation

rmkaplan commented Jan 13, 2025

rmkaplan commented Jan 18, 2025

rmkaplan commented Jan 18, 2025

hjellinek commented Jan 19, 2025

rmkaplan commented Jan 20, 2025

rmkaplan commented Jan 20, 2025

hjellinek commented Jan 21, 2025 • edited Loading

hjellinek commented Jan 21, 2025 •

edited

Loading