Skip to content

Commit

Permalink
Fix the out of alphabet token handling in analyses generation
Browse files Browse the repository at this point in the history
Solves #45
Consider alphanumeric characters to be part of the vocabulary.
  • Loading branch information
AMR-KELEG authored and TinoDidriksen committed May 11, 2019
1 parent 38d22d4 commit 944ed25
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion lttoolbox/fst_processor.cc
Original file line number Diff line number Diff line change
Expand Up @@ -837,7 +837,7 @@ FSTProcessor::isEscaped(wchar_t const c) const
bool
FSTProcessor::isAlphabetic(wchar_t const c) const
{
return alphabetic_chars.find(c) != alphabetic_chars.end();
return (bool)std::iswalnum(c) || alphabetic_chars.find(c) != alphabetic_chars.end();
}

void
Expand Down

4 comments on commit 944ed25

@Fred-Git-Hub
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under Unicode locale, std::iswalnum returns TRUE for intuitively non-alphanumeric Unicode character. Please see the example of the following URL.
https://en.cppreference.com/w/cpp/string/wide/iswalnum

With this code, lt-proc analysis will regard some Unicode words as Unknown even if it is defined in Monodix.

@unhammer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What example at https://en.cppreference.com/w/cpp/string/wide/iswalnum ? the Cherokee letter HA ('Ꭽ') ?

Can you give a concrete example of something marked as Unknown even if defined in monodix, along with the relevant extracts from monodix? I don't understand how this code could contribute to that. (You might want to open an issue)

@Fred-Git-Hub
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your reply, unhammer.

test.dix (Chinese; you may need to install Simplified Chinese font)
---
<?xml version="1.0" encoding="UTF-8"?>
<dictionary>

   <alphabet>
   </alphabet>

   <sdefs>
      <sdef n="noun"/>
      <sdef n="verb"/>
   </sdefs>

   <section id="main" type="standard">
      <e><p><l>我</l><r>我<s n="noun"/></r></p></e>
      <e><p><l>爱</l><r>爱<s n="verb"/></r></p></e>
      <e><p><l>你</l><r>你<s n="noun"/></r></p></e>
   </section>

</dictionary>
---

[3.5.0]
$ echo "我爱你" | lt-proc test.bin
^我/我<noun>$^爱/爱<verb>$^你/你<noun>$

[3.5.1]
$ echo "我爱你" | lt-proc test.bin
^我爱你/*我爱你$

1. Combination of Unicode characters without spaces
2. Split them into more than 2 POS tags
Above 2 conditions, this error will happen.

Other examples.
(Korean)
---
<?xml version="1.0" encoding="UTF-8"?>
<dictionary>

   <alphabet>
   </alphabet>

   <sdefs>
      <sdef n="verb"/>
      <sdef n="tail"/>
   </sdefs>

   <section id="main" type="standard">
      <e><p><l>보</l><r>보다<s n="verb"/></r></p></e>
      <e><p><l>면</l><r>면<s n="tail"/></r></p></e>
   </section>

</dictionary>
---

[3.5.0]
$ echo "보면" | lt-proc test.bin
^보/보다<verb>$^면/면<tail>$

[3.5.1]
$ echo "보면" | lt-proc test.bin
^보면/*보면$

(Japanese)
---
<?xml version="1.0" encoding="UTF-8"?>
<dictionary>

   <alphabet>
   </alphabet>

   <sdefs>
      <sdef n="verb"/>
      <sdef n="tail"/>
   </sdefs>

   <section id="main" type="standard">
      <e><p><l>食べ</l><r>食べる<s n="verb"/></r></p></e>
      <e><p><l>ない</l><r>ない<s n="tail"/></r></p></e>
   </section>

</dictionary>
---

[3.5.0]
$ echo "食べない" | lt-proc test.bin
^食べ/食べる<verb>$^ない/ない<tail>$

[3.5.1]
$ echo "食べない" | lt-proc test.bin
^食べない/*食べない$

Regards,
Fred

@unhammer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, so the problem is actually that lttoolbox expects analyses to be delimited by blanks. If you have an empty alphabet, in lttoolbox 3.5.0, then any symbol is a potential blank, so that allows 보면 to get two analyses even though they're not separated by regular spaces. But if you put any of the symbols into <alphabet>, they woud stop being potential blanks, so e.g. on trying to analyse we would see that the next symbol is alphabetic and so the analysis "has to" include the next symbol in the form, so alone can't get an analysis.

The change in 3.5.1 puts any alphanumeric into <alphabet> even if you don't do it yourself, basically the same as saying <alphabet>abcd…보면…食べない…etc</alphabet>. I've opened an issue to support your use-case: #75

Please sign in to comment.