Dictionary is silently ignored when character encoding is not supported #120

schra · 2020-04-14T00:46:22Z

Description

A dictionary passed via --dict is silently ignored when it's in a character encoding that can't be handled.

This is especially problematic since aspell's default character encoding for the .aspell.en.pws is ISO/IEC 8859, which will trigger this bug.

How to reproduce

Create the file test.tex with the following content:

\begin{document}
My name is André and I work at TomTom.
\end{document}

Note that this will yield two spelling mistakes:

$ textidote --check en_us test.tex
TeXtidote v0.8.1 - A linter for LaTeX documents and others
(C) 2018-2019 Sylvain Hallé - All rights reserved

Found 2 warning(s)
Total analysis time: 1 second(s)

* L2C12-L2C17 Possible spelling mistake found. Suggestions: [Andre, Andrew,
  Andrea, Andrei, Andres] (11) [lt:en:MORFOLOGIK_RULE_EN_US]
  My name is André and I work at TomTom.
             ^^^^^^
* L2C32-L2C38 Possible spelling mistake found. Suggestions: [Tom Tom] (31)
  [lt:en:MORFOLOGIK_RULE_EN_US]
  My name is André and I work at TomTom.
                                 ^^^^^^^

Now I deleted my aspell dictionary and created a new dictionary by adding both "André" and "TomTom":

$ rm ~/.aspell.en.pws
# now press "a" two times in aspell
$ aspell check test.tex

Note the character encoding of the generated file:

$ file ~/.aspell.en.pws
/home/andre/.aspell.en.pws: ISO-8859 text
$ cat ~/.aspell.en.pws
personal_ws-1.1 en 2
Andr�
     TomTom

Now I call textidote again with the newly created dictionary. My expectation would be that now there are no mistakes found - since I whitelisted both words. However this is not the case - instead all words in the dictionary are silently ignored:

$ textidote --check en_us --dict ~/.aspell.en.pws test.tex
TeXtidote v0.8.1 - A linter for LaTeX documents and others
(C) 2018-2019 Sylvain Hallé - All rights reserved

Found 2 warning(s)
Total analysis time: 1 second(s)

* L2C12-L2C17 Possible spelling mistake found. Suggestions: [Andre, Andrew,
  Andrea, Andrei, Andres] (11) [lt:en:MORFOLOGIK_RULE_EN_US]
  My name is André and I work at TomTom.
             ^^^^^^
* L2C32-L2C38 Possible spelling mistake found. Suggestions: [Tom Tom] (31)
  [lt:en:MORFOLOGIK_RULE_EN_US]
  My name is André and I work at TomTom.
                                 ^^^^^^^

Workaround

As a workaround we can convert the dictionary to utf8 and then everything will work:

$ iconv -f ISO-8859-1 -t UTF-8 ~/.aspell.en.pws > .aspell.en.pws
$ file .aspell.en.pws
.aspell.en.pws: UTF-8 Unicode text
$ textidote --check en_us --dict .aspell.en.pws test.tex
TeXtidote v0.8.1 - A linter for LaTeX documents and others
(C) 2018-2019 Sylvain Hallé - All rights reserved

Found 0 warning(s)
Total analysis time: 1 second(s)

Everything is OK!

Remarks

Not sure if this is important but here is my aspell version:

$ aspell -v
@(#) International Ispell Version 3.1.20 (but really Aspell 0.60.8)

It would be handy if textidote produced a hard error if the character encoding isn't supported. It took me some time to debug why my dictionary was getting ignored.

And thanks for creating textidote, it's a very helpful program :)

The text was updated successfully, but these errors were encountered:

sylvainhalle · 2020-04-14T13:11:04Z

Thank you for your detailed bug report. The culprit lies in this line:

textidote/Source/Core/src/ca/uqac/lif/textidote/Main.java

Line 774 in 204a7af

sc = new Scanner(f);

Actually, the Scanner class itself silently fails when it reads a file that does not match the expected encoding, and just won't read anything from the file. However, no exception is thrown, so I cannot catch the encoding problem. The best that could be done is a warning given to the user if nothing has been read from the dictionary file.

schra · 2020-04-14T19:26:00Z

The best that could be done is a warning given to the user if nothing has been read from the dictionary file.

That works for me :)

Another solution would be the following:

There is a ctor of Scanner where you can pass the charset: https://docs.oracle.com/javase/7/docs/api/java/util/Scanner.html#Scanner(java.io.File,%20java.lang.String)

Apache Tika has a function for guessing the charset: https://tika.apache.org/1.17/api/org/apache/tika/parser/txt/CharsetDetector.html#detect--

They seem to support also the charset in question, ISO-8859-1: https://github.com/apache/tika/blob/5eec28ae0203820364dbcdef58335fd64aeb90ec/tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetDetector.java#L63

sylvainhalle · 2020-04-14T19:28:48Z

Indeed, but I'm not sure I want to introduce a dependency on Tika (64 MB jar file) just to use a single method...

sylvainhalle added the enhancement New feature or request label Apr 14, 2020

sylvainhalle added this to the v0.9 milestone Apr 14, 2020

sylvainhalle closed this as completed in 949111c May 15, 2020

sylvainhalle modified the milestones: v0.9, v0.8.2 May 15, 2020

sylvainhalle mentioned this issue Aug 15, 2021

No warnings found in CI #185

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dictionary is silently ignored when character encoding is not supported #120

Dictionary is silently ignored when character encoding is not supported #120

schra commented Apr 14, 2020 •

edited

Loading

sylvainhalle commented Apr 14, 2020

schra commented Apr 14, 2020

sylvainhalle commented Apr 14, 2020

Dictionary is silently ignored when character encoding is not supported #120

Dictionary is silently ignored when character encoding is not supported #120

Comments

schra commented Apr 14, 2020 • edited Loading

Description

How to reproduce

Workaround

Remarks

sylvainhalle commented Apr 14, 2020

schra commented Apr 14, 2020

sylvainhalle commented Apr 14, 2020

schra commented Apr 14, 2020 •

edited

Loading