[66_13] Reasonable herk->utf8 and utf8->herk #2150

da-liii · 2024-10-11T15:54:51Z

Why

Try to solve the Cork encoding defects by introducing the Herk encoding with minimal changes.

Herk encoding is adopted in TMU serialization and deserialization. It is much better than utf8->cork and cork->utf8. Because in utf8->cork and cork->utf8, there may be two unicode maps to the same cork code.

It does bring breaking changes for the TMU format, that's why we need to bump the version. But it is not a big change.

What

UTF8 from 00 to 1F should be encoded as <#0> to <#1F> in Herk encoding
UTF8 from A0 to FF should be encoded as <#A0> to <#FF> in Herk encoding if there is not cork encoding found
- Will fix copy and paste of © https://symbl.cc/en/00A9-copyright-emoji/ when we use herk encoding in copy and paste
Herk DF should be mapped to U+1E9E
Herk 17 should be mapped to U+200B
Herk 18 should be mapped to U+2080
Herk 1A should be mapped to U+0237
Herk 7F should be mapped to U+00AD
Bump to TMU 1.0.5

How to test

Unit tests on branch-1.2

Before

(utf8->herk (string #\null)) =>  ; *** failed ***
; expected result: <#0>

(herk->utf8 (string #\x18)) => ▒ ; *** failed ***
; expected result: ₀

(herk->utf8 (string #\x1a)) => ▒ ; *** failed ***
; expected result: ȷ

(utf8->herk (string #\x10)) =>  ; *** failed ***
; expected result: <#10>

(utf8->herk (utf8->string #u(194 160))) =>   ; *** failed ***
; expected result: <#A0>

(herk->utf8 (string #\xdf)) => � ; *** failed ***
; expected result: ẞ

(utf8->herk (string #\xff)) => � ; *** failed ***
; expected result: <#FF>

Now TeXmacs/tests/66_13.scm should work fine!

Test doc

Several test cases are listed in TeXmacs/tests/tmu/unicode_256.tmu

The bug lies in the TMU reader.

da-liii · 2024-10-11T17:13:36Z

UTF8 from A0 to FF should be encoded as <#A0> to <#FF> in Herk encoding if there is not cork encoding found

Render of U+00A9 will be solved in the later pull requests. This pull request aims to pin the definition of the herk encoding.

ProfFan

First thank you for the efforts to improve TeXmacs. As I said before, I have a fair amount of experience in encoding & i18n but not much in TeXmacs itself. So please take what I said below with a grain of salt.

I do have a few questions here:

From my understanding, TeXmacs Cork is a variation of the TeX Cork that is used in early TeX T1 fonts. However, since now the internals of Mogan is (is it?) all in Unicode (UTF-8/16?) there is no need to convert anymore.
If (1) is true, then is the new encoding just UTF-8 (maybe with escapes for <> tags)?
If (1) is false, then what is preventing the internals to be Unicode? Upon reviewing the TM source code, I suppose it is because TeXmacs still uses the cork-based hyphenation rules, and this requires Cork encoding. Is this true?
If (3) is true, then is it possible to take the same approach as https://hyphenation.org/pdf/tb93miklavec.pdf and get rid of cork-specific code in the hyphenation engine, and ideally just use the same LaTeX hyphenation rules in https://hyphenation.org and https://ctan.org/pkg/hyph-utf8?

da-liii · 2024-10-11T23:58:16Z

From my understanding, TeXmacs Cork is a variation of the TeX Cork that is used in early TeX T1 fonts. However, since now the internals of Mogan is (is it?) all in Unicode (UTF-8/16?) there is no need to convert anymore.

The internals of Mogan is still in TeXmacs Cork encoding. TMU format tries to get rid of the TeXmacs Cork encoding in the file format scope. I believe it is the first step to get rid of the TeXmacs Cork encoding in the codebase.

da-liii · 2024-10-12T00:08:47Z

If (1) is false, then what is preventing the internals to be Unicode? Upon reviewing the TM source code, I suppose it is because TeXmacs still uses the cork-based hyphenation rules, and this requires Cork encoding. Is this true?

The full codebase is based on Cork encoding, for example:

Scheme: kbd-map is based on cork encoding
C++: the font related code is using cork encoding in the interface

what is preventing the internals to be Unicode?

No one is preventing the internals to be Unicode. But if you wanna use Unicode , you have to first support Unicode in S7 Scheme. S7 Scheme does not support Unicode string and Unicode char. GNU TeXmacs is using GNU Guile 1.8.x. GNU Guile 1.8.x does not support Unicode string too. GNU Guile 3 does support Unicode string. But if we adopt GNU Guile 3, it is a nightmare to make it work on Windows.

I started the Goldfish Scheme project. Assuming that I've completed the Unicode support (string and char) in Goldfish Scheme. There still a lot to do to make the codebase Unicode based but not TeXmacs Cork based.

And first of all, we have to introduce a UTF-8 format: TMU. The TM format is using ISO-8859 series. It depends on the natural languages:

System/Language/locale.cpp:  if (s == "bulgarian") return "ISO-8859-5";
System/Language/locale.cpp:  if (s == "croatian") return "ISO-8859-2";
System/Language/locale.cpp:  if (s == "czech") return "ISO-8859-2";
System/Language/locale.cpp:  if (s == "greek") return "ISO-8859-7";
System/Language/locale.cpp:  if (s == "hungarian") return "ISO-8859-2";
System/Language/locale.cpp:  if (s == "polish") return "ISO-8859-2";
System/Language/locale.cpp:  if (s == "romanian") return "ISO-8859-2";
System/Language/locale.cpp:  if (s == "russian") return "ISO-8859-5";
System/Language/locale.cpp:  if (s == "slovak") return "ISO-8859-2";
System/Language/locale.cpp:  if (s == "slovene") return "ISO-8859-2";
System/Language/locale.cpp:  if (s == "ukrainian") return "ISO-8859-5";
System/Language/locale.cpp:  return "ISO-8859-1";

da-liii · 2024-10-12T00:10:23Z

If (3) is true, then is it possible to take the same approach as https://hyphenation.org/pdf/tb93miklavec.pdf and get rid of cork-specific code in the hyphenation engine, and ideally just use the same LaTeX hyphenation rules in https://hyphenation.org/ and https://ctan.org/pkg/hyph-utf8?

Thanks for pointing out the relation of Cork encoding and the hyphenation engine.

da-liii · 2024-10-12T00:44:26Z

Cork and TeXmacs Cork and Herk

see https://en.wikipedia.org/wiki/Cork_encoding

Cork -> Unicode: Cork+0000 to Cork+00FF

Cork encoding only defines Cork+0000 to Cork+00FF.
When convert Cork+0000 to Cork+00FF to UTF-8, in TeXmacs Cork, the conversion is mainly defined in TeXmacs/langs/encoding/corktounicode.scm. For some reason, Cork+0018, Cork+001A, Cork+00DF is not converted to the expected
And in Herk encoding, I've correct it. And added unit tests to make sure that every cork encoding (256 characters) is converted to the expected UTF-8. see TeXmacs/langs/encoding/herktounicode.scm.

You can run diff -u TeXmacs/langs/encoding/corktounicode.scm TeXmacs/langs/encoding/herktounicode.scm to view the difference.

Cork to Unicode: > Cork+00FF

For characters beyond the Cork scope. We will encode it to a hex format. For example:

Cork <#0100> -> U+0100
Cork <#100> -> U+0100

The leading 0 could be stripped. That's the same in TeXmacs Cork encoding and in Herk encoding.

Unicode to Cork

For Unicode to Cork conversion, it is much more complicated.

There are several reason why I have to introduce Herk encoding:

In TeXmacs Cork encoding, U+2019 and U+0027 will be converted to Cork+0027, it means there is no way to distinguish U+2019 and U+0027
In TeXmacs Cork encoding, there are random rules in that form: U+03C0 -> <mathpi> it should only make sense in the math mode. For general Unicode to Cork and Cork to Unicode conversion, these rules make the conversion too complicated.
In TeXmacs Cork encoding, both the Cork to Unicode and Unicode to Cork are using TeXmacs/langs/encoding/corktounicode.scm and as I pointed out above, three Cork encoding is left out: Cork+0018, Cork+001A, Cork+00DF

For Herk Encoding, the rule is much simpler:

[Cork+0000, Cork+00FF] are converted to Unicode according to the Cork encoding spec
For Cork encoding beyond the scope, we are using <#HEX_DIGITS> and when converted to Unicode, just decode it
Unicode is converted to [Cork+0000, Cork+00FF] if it does exist in the Cork encoding spec, and for Unicode which does not exist in the Cork encoding spec, we encode it in <#HEX_DIGITS> format.
Things like <langle> and <mathpi>, it does exist in TMU format, and when converted to Herk Encoding it will keeps the same. And in the font engine, we have to take care of those special tags.

And Herk is named after the first two letters of Hex and the last two letters of Cork. Just like Cork, it is a city name in Europe.

da-liii · 2024-10-12T02:26:08Z

I'm moving on to improve the font rendering part of well-defined Herk encoding. That's why I merged this pull request in a hurry.

da-liii added 11 commits October 11, 2024 22:35

herk 0x

456b7cd

wip

03432dd

iwp

b42da07

iwp

b197699

wip

1b1d7c5

wip

b995858

wip

d01c1b4

wip

b05a871

wip

2a52cc1

wip

9d7be18

wip

0cab3a9

da-liii changed the title ~~[66_13] Reversible herk->utf8 and utf8->herk~~ [66_13] Reasonable herk->utf8 and utf8->herk Oct 11, 2024

wip

eea3188

da-liii requested review from jingkaimori, Oyyko, Minzihao and JackYansongLi October 11, 2024 16:33

wip

07fc54b

wip

adb4f98

ProfFan reviewed Oct 11, 2024

View reviewed changes

da-liii added 5 commits October 12, 2024 08:54

wip

f99fe74

wip

1f07d78

wip

f25671d

wip

02cea50

wip

354dbd9

wip

8b50e14

da-liii merged commit 3b9c0a5 into branch-1.2 Oct 12, 2024
7 checks passed

da-liii deleted the da/66_13/better_herk branch October 12, 2024 02:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[66_13] Reasonable herk->utf8 and utf8->herk #2150

[66_13] Reasonable herk->utf8 and utf8->herk #2150

da-liii commented Oct 11, 2024 •

edited

Loading

da-liii commented Oct 11, 2024

ProfFan left a comment

da-liii commented Oct 11, 2024

da-liii commented Oct 12, 2024

da-liii commented Oct 12, 2024

da-liii commented Oct 12, 2024 •

edited

Loading

da-liii commented Oct 12, 2024 •

edited

Loading

[66_13] Reasonable herk->utf8 and utf8->herk #2150

[66_13] Reasonable herk->utf8 and utf8->herk #2150

Conversation

da-liii commented Oct 11, 2024 • edited Loading

Why

What

How to test

Unit tests on branch-1.2

Test doc

da-liii commented Oct 11, 2024

ProfFan left a comment

Choose a reason for hiding this comment

da-liii commented Oct 11, 2024

da-liii commented Oct 12, 2024

da-liii commented Oct 12, 2024

da-liii commented Oct 12, 2024 • edited Loading

Cork and TeXmacs Cork and Herk

Cork -> Unicode: Cork+0000 to Cork+00FF

Cork to Unicode: > Cork+00FF

Unicode to Cork

da-liii commented Oct 12, 2024 • edited Loading

da-liii commented Oct 11, 2024 •

edited

Loading

da-liii commented Oct 12, 2024 •

edited

Loading

da-liii commented Oct 12, 2024 •

edited

Loading