-
-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add back character sets that had characters outside 16 bit plane #1964
base: master
Are you sure you want to change the base?
Conversation
Update title line
Based on @hjellinek suggestion, I did some timing tests comparing the original table format with formats that uses the same top-level array but use either a hashtable or a digital search for the second level. The hash and digital search were both better than what I had before, so I simplified the code to the hash. (I might eventually go to the digital search, but I would first have to move by MULTI-ALIST macros over to Lispusers). I also added a new format, :UTF-8-SLUG just like LUTF-8 except that its OUTCHARFN produces the unicode slug for codes whose mappings were not found in the table-files. And new functions XTOUCODE? and UTOXCODE? that return the corresponding mapping for codes in the table-files, NIL otherwise. If multiple XCCS codes map to the same Unicode, the normal UNICODE.TRANSLATE (and XTOUCODE) will return the lowest Unicode. But XTOUCODE? will return the list--caller has to decide what to do. Alternatives in the inverse direction behave in the same way. Note that callers of UNICODE.TRANSLATE must be recompiled. Please test this functionality/interface. Also I hope that the previously reported performance issues have been fixed. |
I did some timing using only a single global hash array for all the characters, that is at least as fast, maybe faster, then doing an initial array branch into smaller hash arrays. And simpler still. |
Thanks, @rmkaplan, for the new functionality. The increased speed is a bonus. I'm glad my suggestion worked out so well. |
I did some more careful speed testing with mapping tables that contained all of the X-to-U pairs, not just the common ones, and with looking up all of the possible codes, not just charset 0. The single-hash was a significant loser with the much larger mappings, by a factor of 6. So I reverted to a top-level branch to hash arrays that contain no more than 128 characters. The multi-alist is slightly better than the multi-hash for a 512 branching array, but significantly better (~ 25%) with a 1024 branch. But I'll stick with the hash for now. |
I reworked the UNICODE.TRANSLATE macro so that it could be shared by XTOUCODE and XTOUCODE? etc. This should not be called directly by functions outside of UNICODE, to avoid dependencies on internal structures. Use the XTOUCODE etc. function interface. |
I'm testing it now. For whatever reason, my I did a spot test with Runic. XCCS defines characters in several Runic variants, and, as I just learned with the help of the new APIs, Unicode seems only to define characters in a single Runic script. I guessed that there's an invariant such that, given an open output stream However, instead of Here's a screenshot from Chrome: |
Some of the mappings had Unicodes outside of 16 bits, those character sets had been excluded before.
Now the character sets are included, but those particular lines in the mapping file (e.g. the Gothic characters in the Runic-Gothic character set) have been commented out, so that the other characters can be included.