Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[66_13] Reasonable herk->utf8 and utf8->herk #2150

Merged
merged 20 commits into from
Oct 12, 2024
Merged

Conversation

da-liii
Copy link
Contributor

@da-liii da-liii commented Oct 11, 2024

Why

Try to solve the Cork encoding defects by introducing the Herk encoding with minimal changes.

Herk encoding is adopted in TMU serialization and deserialization. It is much better than utf8->cork and cork->utf8. Because in utf8->cork and cork->utf8, there may be two unicode maps to the same cork code.

It does bring breaking changes for the TMU format, that's why we need to bump the version. But it is not a big change.

What

  1. UTF8 from 00 to 1F should be encoded as <#0> to <#1F> in Herk encoding
  2. UTF8 from A0 to FF should be encoded as <#A0> to <#FF> in Herk encoding if there is not cork encoding found
  3. Herk DF should be mapped to U+1E9E
  4. Herk 17 should be mapped to U+200B
  5. Herk 18 should be mapped to U+2080
  6. Herk 1A should be mapped to U+0237
  7. Herk 7F should be mapped to U+00AD
  8. Bump to TMU 1.0.5

How to test

Unit tests on branch-1.2

Before

(utf8->herk (string #\null)) =>  ; *** failed ***
; expected result: <#0>

(herk->utf8 (string #\x18)) => ▒ ; *** failed ***
; expected result: ₀

(herk->utf8 (string #\x1a)) => ▒ ; *** failed ***
; expected result: ȷ

(utf8->herk (string #\x10)) =>  ; *** failed ***
; expected result: <#10>

(utf8->herk (utf8->string #u(194 160))) =>   ; *** failed ***
; expected result: <#A0>

(herk->utf8 (string #\xdf)) => � ; *** failed ***
; expected result: ẞ

(utf8->herk (string #\xff)) => � ; *** failed ***
; expected result: <#FF>

Now TeXmacs/tests/66_13.scm should work fine!

Test doc

Several test cases are listed in TeXmacs/tests/tmu/unicode_256.tmu

The bug lies in the TMU reader.

@da-liii da-liii changed the title [66_13] Reversible herk->utf8 and utf8->herk [66_13] Reasonable herk->utf8 and utf8->herk Oct 11, 2024
@da-liii
Copy link
Contributor Author

da-liii commented Oct 11, 2024

UTF8 from A0 to FF should be encoded as <#A0> to <#FF> in Herk encoding if there is not cork encoding found

Render of U+00A9 will be solved in the later pull requests. This pull request aims to pin the definition of the herk encoding.

Copy link

@ProfFan ProfFan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First thank you for the efforts to improve TeXmacs. As I said before, I have a fair amount of experience in encoding & i18n but not much in TeXmacs itself. So please take what I said below with a grain of salt.

I do have a few questions here:

  1. From my understanding, TeXmacs Cork is a variation of the TeX Cork that is used in early TeX T1 fonts. However, since now the internals of Mogan is (is it?) all in Unicode (UTF-8/16?) there is no need to convert anymore.
  2. If (1) is true, then is the new encoding just UTF-8 (maybe with escapes for <> tags)?
  3. If (1) is false, then what is preventing the internals to be Unicode? Upon reviewing the TM source code, I suppose it is because TeXmacs still uses the cork-based hyphenation rules, and this requires Cork encoding. Is this true?
  4. If (3) is true, then is it possible to take the same approach as https://hyphenation.org/pdf/tb93miklavec.pdf and get rid of cork-specific code in the hyphenation engine, and ideally just use the same LaTeX hyphenation rules in https://hyphenation.org and https://ctan.org/pkg/hyph-utf8?

@da-liii
Copy link
Contributor Author

da-liii commented Oct 11, 2024

From my understanding, TeXmacs Cork is a variation of the TeX Cork that is used in early TeX T1 fonts. However, since now the internals of Mogan is (is it?) all in Unicode (UTF-8/16?) there is no need to convert anymore.

The internals of Mogan is still in TeXmacs Cork encoding. TMU format tries to get rid of the TeXmacs Cork encoding in the file format scope. I believe it is the first step to get rid of the TeXmacs Cork encoding in the codebase.

@da-liii
Copy link
Contributor Author

da-liii commented Oct 12, 2024

If (1) is false, then what is preventing the internals to be Unicode? Upon reviewing the TM source code, I suppose it is because TeXmacs still uses the cork-based hyphenation rules, and this requires Cork encoding. Is this true?

The full codebase is based on Cork encoding, for example:

  1. Scheme: kbd-map is based on cork encoding
  2. C++: the font related code is using cork encoding in the interface

what is preventing the internals to be Unicode?

No one is preventing the internals to be Unicode. But if you wanna use Unicode , you have to first support Unicode in S7 Scheme. S7 Scheme does not support Unicode string and Unicode char. GNU TeXmacs is using GNU Guile 1.8.x. GNU Guile 1.8.x does not support Unicode string too. GNU Guile 3 does support Unicode string. But if we adopt GNU Guile 3, it is a nightmare to make it work on Windows.

I started the Goldfish Scheme project. Assuming that I've completed the Unicode support (string and char) in Goldfish Scheme. There still a lot to do to make the codebase Unicode based but not TeXmacs Cork based.

And first of all, we have to introduce a UTF-8 format: TMU. The TM format is using ISO-8859 series. It depends on the natural languages:

System/Language/locale.cpp:  if (s == "bulgarian") return "ISO-8859-5";
System/Language/locale.cpp:  if (s == "croatian") return "ISO-8859-2";
System/Language/locale.cpp:  if (s == "czech") return "ISO-8859-2";
System/Language/locale.cpp:  if (s == "greek") return "ISO-8859-7";
System/Language/locale.cpp:  if (s == "hungarian") return "ISO-8859-2";
System/Language/locale.cpp:  if (s == "polish") return "ISO-8859-2";
System/Language/locale.cpp:  if (s == "romanian") return "ISO-8859-2";
System/Language/locale.cpp:  if (s == "russian") return "ISO-8859-5";
System/Language/locale.cpp:  if (s == "slovak") return "ISO-8859-2";
System/Language/locale.cpp:  if (s == "slovene") return "ISO-8859-2";
System/Language/locale.cpp:  if (s == "ukrainian") return "ISO-8859-5";
System/Language/locale.cpp:  return "ISO-8859-1";

@da-liii
Copy link
Contributor Author

da-liii commented Oct 12, 2024

If (3) is true, then is it possible to take the same approach as https://hyphenation.org/pdf/tb93miklavec.pdf and get rid of cork-specific code in the hyphenation engine, and ideally just use the same LaTeX hyphenation rules in https://hyphenation.org/ and https://ctan.org/pkg/hyph-utf8?

Thanks for pointing out the relation of Cork encoding and the hyphenation engine.

@da-liii
Copy link
Contributor Author

da-liii commented Oct 12, 2024

Cork and TeXmacs Cork and Herk

see https://en.wikipedia.org/wiki/Cork_encoding

Cork -> Unicode: Cork+0000 to Cork+00FF

  1. Cork encoding only defines Cork+0000 to Cork+00FF.
  2. When convert Cork+0000 to Cork+00FF to UTF-8, in TeXmacs Cork, the conversion is mainly defined in TeXmacs/langs/encoding/corktounicode.scm. For some reason, Cork+0018, Cork+001A, Cork+00DF is not converted to the expected
  3. And in Herk encoding, I've correct it. And added unit tests to make sure that every cork encoding (256 characters) is converted to the expected UTF-8. see TeXmacs/langs/encoding/herktounicode.scm.

You can run diff -u TeXmacs/langs/encoding/corktounicode.scm TeXmacs/langs/encoding/herktounicode.scm to view the difference.

Cork to Unicode: > Cork+00FF

For characters beyond the Cork scope. We will encode it to a hex format. For example:

  • Cork <#0100> -> U+0100
  • Cork <#100> -> U+0100

The leading 0 could be stripped. That's the same in TeXmacs Cork encoding and in Herk encoding.

Unicode to Cork

For Unicode to Cork conversion, it is much more complicated.

There are several reason why I have to introduce Herk encoding:

  1. In TeXmacs Cork encoding, U+2019 and U+0027 will be converted to Cork+0027, it means there is no way to distinguish U+2019 and U+0027
  2. In TeXmacs Cork encoding, there are random rules in that form: U+03C0 -> <mathpi> it should only make sense in the math mode. For general Unicode to Cork and Cork to Unicode conversion, these rules make the conversion too complicated.
  3. In TeXmacs Cork encoding, both the Cork to Unicode and Unicode to Cork are using TeXmacs/langs/encoding/corktounicode.scm and as I pointed out above, three Cork encoding is left out: Cork+0018, Cork+001A, Cork+00DF

For Herk Encoding, the rule is much simpler:

  1. [Cork+0000, Cork+00FF] are converted to Unicode according to the Cork encoding spec
  2. For Cork encoding beyond the scope, we are using <#HEX_DIGITS> and when converted to Unicode, just decode it
  3. Unicode is converted to [Cork+0000, Cork+00FF] if it does exist in the Cork encoding spec, and for Unicode which does not exist in the Cork encoding spec, we encode it in <#HEX_DIGITS> format.
  4. Things like <langle> and <mathpi>, it does exist in TMU format, and when converted to Herk Encoding it will keeps the same. And in the font engine, we have to take care of those special tags.

And Herk is named after the first two letters of Hex and the last two letters of Cork. Just like Cork, it is a city name in Europe.

@da-liii da-liii merged commit 3b9c0a5 into branch-1.2 Oct 12, 2024
7 checks passed
@da-liii da-liii deleted the da/66_13/better_herk branch October 12, 2024 02:25
@da-liii
Copy link
Contributor Author

da-liii commented Oct 12, 2024

I'm moving on to improve the font rendering part of well-defined Herk encoding. That's why I merged this pull request in a hurry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants