Fix the out of alphabet token handling in analyses generation · apertium/lttoolbox@944ed25

Fred-Git-Hub · 2019-10-22T03:16:34Z

Under Unicode locale, std::iswalnum returns TRUE for intuitively non-alphanumeric Unicode character. Please see the example of the following URL.
https://en.cppreference.com/w/cpp/string/wide/iswalnum

With this code, lt-proc analysis will regard some Unicode words as Unknown even if it is defined in Monodix.

unhammer · 2019-10-22T09:09:34Z

What example at https://en.cppreference.com/w/cpp/string/wide/iswalnum ? the Cherokee letter HA ('Ꭽ') ?

Can you give a concrete example of something marked as Unknown even if defined in monodix, along with the relevant extracts from monodix? I don't understand how this code could contribute to that. (You might want to open an issue)

Fred-Git-Hub · 2019-10-27T08:14:11Z

Thank you for your reply, unhammer.

test.dix (Chinese; you may need to install Simplified Chinese font)
---
<?xml version="1.0" encoding="UTF-8"?>
<dictionary>

   <alphabet>
   </alphabet>

   <sdefs>
      <sdef n="noun"/>
      <sdef n="verb"/>
   </sdefs>

   <section id="main" type="standard">
      <e><p><l>我</l><r>我<s n="noun"/></r></p></e>
      <e><p><l>爱</l><r>爱<s n="verb"/></r></p></e>
      <e><p><l>你</l><r>你<s n="noun"/></r></p></e>
   </section>

</dictionary>
---

[3.5.0]
$ echo "我爱你" | lt-proc test.bin
^我/我<noun>$^爱/爱<verb>$^你/你<noun>$

[3.5.1]
$ echo "我爱你" | lt-proc test.bin
^我爱你/*我爱你$

1. Combination of Unicode characters without spaces
2. Split them into more than 2 POS tags
Above 2 conditions, this error will happen.

Other examples.
(Korean)
---
<?xml version="1.0" encoding="UTF-8"?>
<dictionary>

   <alphabet>
   </alphabet>

   <sdefs>
      <sdef n="verb"/>
      <sdef n="tail"/>
   </sdefs>

   <section id="main" type="standard">
      <e><p><l>보</l><r>보다<s n="verb"/></r></p></e>
      <e><p><l>면</l><r>면<s n="tail"/></r></p></e>
   </section>

</dictionary>
---

[3.5.0]
$ echo "보면" | lt-proc test.bin
^보/보다<verb>$^면/면<tail>$

[3.5.1]
$ echo "보면" | lt-proc test.bin
^보면/*보면$

(Japanese)
---
<?xml version="1.0" encoding="UTF-8"?>
<dictionary>

   <alphabet>
   </alphabet>

   <sdefs>
      <sdef n="verb"/>
      <sdef n="tail"/>
   </sdefs>

   <section id="main" type="standard">
      <e><p><l>食べ</l><r>食べる<s n="verb"/></r></p></e>
      <e><p><l>ない</l><r>ない<s n="tail"/></r></p></e>
   </section>

</dictionary>
---

[3.5.0]
$ echo "食べない" | lt-proc test.bin
^食べ/食べる<verb>$^ない/ない<tail>$

[3.5.1]
$ echo "食べない" | lt-proc test.bin
^食べない/*食べない$

Regards,
Fred

unhammer · 2019-10-28T08:05:03Z

Aha, so the problem is actually that lttoolbox expects analyses to be delimited by blanks. If you have an empty alphabet, in lttoolbox 3.5.0, then any symbol is a potential blank, so that allows 보면 to get two analyses even though they're not separated by regular spaces. But if you put any of the symbols into <alphabet>, they woud stop being potential blanks, so e.g. on trying to analyse 보 we would see that the next symbol is alphabetic and so the analysis "has to" include the next symbol in the form, so 보 alone can't get an analysis.

The change in 3.5.1 puts any alphanumeric into <alphabet> even if you don't do it yourself, basically the same as saying <alphabet>abcd…보면…食べない…etc</alphabet>. I've opened an issue to support your use-case: #75

-Original file line number
+Diff line change
@@ Expand Up / @@ -837,7 +837,7 @@ FSTProcessor::isEscaped(wchar_t const c) const @@
     bool
     FSTProcessor::isAlphabetic(wchar_t const c) const
     {
-      return alphabetic_chars.find(c) != alphabetic_chars.end();
+      return (bool)std::iswalnum(c) || alphabetic_chars.find(c) != alphabetic_chars.end();
     }
     void
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

4 comments on commit `944ed25`

Fred-Git-Hub commented on `944ed25` Oct 22, 2019

unhammer commented on `944ed25` Oct 22, 2019

Fred-Git-Hub commented on `944ed25` Oct 27, 2019

unhammer commented on `944ed25` Oct 28, 2019

Commit

There are no files selected for viewing

4 comments on commit 944ed25

Fred-Git-Hub commented on 944ed25 Oct 22, 2019

Choose a reason for hiding this comment

unhammer commented on 944ed25 Oct 22, 2019

Choose a reason for hiding this comment

Fred-Git-Hub commented on 944ed25 Oct 27, 2019

Choose a reason for hiding this comment

unhammer commented on 944ed25 Oct 28, 2019

Choose a reason for hiding this comment

4 comments on commit `944ed25`

Fred-Git-Hub commented on `944ed25` Oct 22, 2019

unhammer commented on `944ed25` Oct 22, 2019

Fred-Git-Hub commented on `944ed25` Oct 27, 2019

unhammer commented on `944ed25` Oct 28, 2019