-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Invalid UTF-8 byte or sequence" warning for ISO-8859-1 (Latin-1) packages (e.g. xstring) #689
Comments
A bit of my own (so far unsuccessful) investigation. The warning seems to originate from xtex-xetex0.c tectonic/tectonic/xetex-xetex0.c Lines 11432 to 11436 in ac94ca4
Which seems to be called from Line 356 in 5d8f0b7
when the encoding mode is (inadvertently) UTF8 Lines 367 to 368 in 5d8f0b7
The encoding mode seems to be set (or propagated) by set_input_file_encoding Line 70 in 5d8f0b7
Which seems to be called from either Line 111 in 5d8f0b7
or tectonic/tectonic/xetex-xetex0.c Line 16313 in ac94ca4
But here it quickly becomes more complicated |
@hugobuddel I'm not 100% sure, but I don't recall seeing any code that checks for a |
Thanks @pkgw , that could be that it is just not shown by default in XeTeX. This code block gave me an idea: tectonic/tectonic/xetex-xetex0.c Lines 16455 to 16471 in ac94ca4
I now have this in my main tex file, and it works (acronym includes xstring):
What should we do with this issue? Close it? |
Hmm now the code does not work anymore with pdflatex, but there is probably some conditional construct that I can use. |
@hugobuddel There is an |
OK, I think I'm going to close this issue since it is not obviously causing any problems, and there seems to be a general mechanism for working around it if absolutely needed. And, AFAIK, we're behaving the same as XeTeX here. Feel free to add further comments or open new issues as needed. It wouldn't hurt to think about adding/fixing support for the |
Yeah close it. The information is here for reference. xelatex gives the same warnings, but they are hidden in the log file. Tectonic throws them in your face though. The workaround is suboptimal though, because in this case xstring changed encoding between 2018 and 2020. So explicitly setting the encoding for the Latin1 2020 version will not work with the UTF-8 2018 version. But it allows using the 2020 bundle without warnings, which is good enough for now. I looked into improving the detection. The best place for that seems to be the Line 111 in ac94ca4
This 'sniffs the encoding', which it seems to do by searching for a byte order marker. (This is also the code that is skipped if an encoding is explicitly given.) It shouldn't be too hard to extend this to check the first line for If anything I'd prefer to learn Rust over C. Apparently the Lines 66 to 71 in ac94ca4
So wouldn't it be nicer to replace the encoding determination part of Unfortunately I couldn't get tectonic to compile yet, because this is an old machine that doesn't have an up-to-date harfbuzz. But that should be fixable. |
Yeah, it would definitely be nice to shift the encoding processing from the existing C stuff to standard Rust libraries. This particular processing layer should be pretty separable from the rest of the engine, and hopefully more extensive use of |
Compile
with tectonic, and there are hundreds of warnings like
Indeed xstring.tex contains many Invalid UTF-8 bytes, only in comments it seems, but it has a proper declaration:
where the é is non-utf-8.
There appear to be only a few of these in the 2020 bundle:
However, there are more than in the 2018 bundle, like xstring:
xstring is a required package for several (69 it seems) other packages, among which acronym, which is what I'd like to use.
It is my understanding that xelatex can also give this warning, but I have not been able to trigger it. So it seems that tectonic is doing something different.
Maybe tectonic does not properly parse the "!TeX encoding" header?
The text was updated successfully, but these errors were encountered: