Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dictionary is silently ignored when character encoding is not supported #120

Closed
schra opened this issue Apr 14, 2020 · 3 comments
Closed
Labels
enhancement New feature or request
Milestone

Comments

@schra
Copy link

schra commented Apr 14, 2020

Description

A dictionary passed via --dict is silently ignored when it's in a character encoding that can't be handled.

This is especially problematic since aspell's default character encoding for the .aspell.en.pws is ISO/IEC 8859, which will trigger this bug.

How to reproduce

Create the file test.tex with the following content:

\begin{document}
My name is André and I work at TomTom.
\end{document}

Note that this will yield two spelling mistakes:

$ textidote --check en_us test.tex
TeXtidote v0.8.1 - A linter for LaTeX documents and others
(C) 2018-2019 Sylvain Hallé - All rights reserved

Found 2 warning(s)
Total analysis time: 1 second(s)

* L2C12-L2C17 Possible spelling mistake found. Suggestions: [Andre, Andrew,
  Andrea, Andrei, Andres] (11) [lt:en:MORFOLOGIK_RULE_EN_US]
  My name is André and I work at TomTom.
             ^^^^^^
* L2C32-L2C38 Possible spelling mistake found. Suggestions: [Tom Tom] (31)
  [lt:en:MORFOLOGIK_RULE_EN_US]
  My name is André and I work at TomTom.
                                 ^^^^^^^

Now I deleted my aspell dictionary and created a new dictionary by adding both "André" and "TomTom":

$ rm ~/.aspell.en.pws
# now press "a" two times in aspell
$ aspell check test.tex

Note the character encoding of the generated file:

$ file ~/.aspell.en.pws
/home/andre/.aspell.en.pws: ISO-8859 text
$ cat ~/.aspell.en.pws
personal_ws-1.1 en 2
Andr�
     TomTom

Now I call textidote again with the newly created dictionary. My expectation would be that now there are no mistakes found - since I whitelisted both words. However this is not the case - instead all words in the dictionary are silently ignored:

$ textidote --check en_us --dict ~/.aspell.en.pws test.tex
TeXtidote v0.8.1 - A linter for LaTeX documents and others
(C) 2018-2019 Sylvain Hallé - All rights reserved

Found 2 warning(s)
Total analysis time: 1 second(s)

* L2C12-L2C17 Possible spelling mistake found. Suggestions: [Andre, Andrew,
  Andrea, Andrei, Andres] (11) [lt:en:MORFOLOGIK_RULE_EN_US]
  My name is André and I work at TomTom.
             ^^^^^^
* L2C32-L2C38 Possible spelling mistake found. Suggestions: [Tom Tom] (31)
  [lt:en:MORFOLOGIK_RULE_EN_US]
  My name is André and I work at TomTom.
                                 ^^^^^^^

Workaround

As a workaround we can convert the dictionary to utf8 and then everything will work:

$ iconv -f ISO-8859-1 -t UTF-8 ~/.aspell.en.pws > .aspell.en.pws
$ file .aspell.en.pws
.aspell.en.pws: UTF-8 Unicode text
$ textidote --check en_us --dict .aspell.en.pws test.tex
TeXtidote v0.8.1 - A linter for LaTeX documents and others
(C) 2018-2019 Sylvain Hallé - All rights reserved

Found 0 warning(s)
Total analysis time: 1 second(s)

Everything is OK!

Remarks

Not sure if this is important but here is my aspell version:

$ aspell -v
@(#) International Ispell Version 3.1.20 (but really Aspell 0.60.8)

It would be handy if textidote produced a hard error if the character encoding isn't supported. It took me some time to debug why my dictionary was getting ignored.

And thanks for creating textidote, it's a very helpful program :)

@sylvainhalle
Copy link
Owner

Thank you for your detailed bug report. The culprit lies in this line:

Actually, the Scanner class itself silently fails when it reads a file that does not match the expected encoding, and just won't read anything from the file. However, no exception is thrown, so I cannot catch the encoding problem. The best that could be done is a warning given to the user if nothing has been read from the dictionary file.

@sylvainhalle sylvainhalle added the enhancement New feature or request label Apr 14, 2020
@sylvainhalle sylvainhalle added this to the v0.9 milestone Apr 14, 2020
@schra
Copy link
Author

schra commented Apr 14, 2020

The best that could be done is a warning given to the user if nothing has been read from the dictionary file.

That works for me :)

Another solution would be the following:

There is a ctor of Scanner where you can pass the charset: https://docs.oracle.com/javase/7/docs/api/java/util/Scanner.html#Scanner(java.io.File,%20java.lang.String)

Apache Tika has a function for guessing the charset: https://tika.apache.org/1.17/api/org/apache/tika/parser/txt/CharsetDetector.html#detect--

They seem to support also the charset in question, ISO-8859-1: https://github.com/apache/tika/blob/5eec28ae0203820364dbcdef58335fd64aeb90ec/tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetDetector.java#L63

@sylvainhalle
Copy link
Owner

Indeed, but I'm not sure I want to introduce a dependency on Tika (64 MB jar file) just to use a single method...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants