"Invalid UTF-8 byte or sequence" warning for ISO-8859-1 (Latin-1) packages (e.g. xstring) #689

hugobuddel · 2020-11-23T15:12:41Z

Compile

\documentclass{article}
\usepackage{xstring}
\begin{document}
Test.
\end{document}

with tectonic, and there are hundreds of warnings like

warning: xstring.tex:774: Invalid UTF-8 byte or sequence at line 774 replaced by U+FFFD.
warning: xstring.tex:797: Invalid UTF-8 byte or sequence at line 797 replaced by U+FFFD.
warning: xstring.tex:797: Invalid UTF-8 byte or sequence at line 797 replaced by U+FFFD.
warning: xstring.tex:810: Invalid UTF-8 byte or sequence at line 810 replaced by U+FFFD.

Indeed xstring.tex contains many Invalid UTF-8 bytes, only in comments it seems, but it has a proper declaration:

% !TeX encoding = ISO-8859-1
[...]
\newcount\decimalpart% compteurs utilisés par xstring

where the é is non-utf-8.

There appear to be only a few of these in the 2020 bundle:

$ grep "!TeX encoding" *.tex *.sty
autoaligne.tex:% !TeX encoding = ISO-8859-1
chemfig.tex:% !TeX encoding = ISO-8859-1
hlist.tex:% !TeX encoding = ISO-8859-1
listofitems.tex:% !TeX encoding = ISO-8859-1
listofitemsold.tex:% !TeX encoding = ISO-8859-1
simplekv.tex:% !TeX encoding = ISO-8859-1
systeme.tex:% !TeX encoding = ISO-8859-1
xstring.tex:% !TeX encoding = ISO-8859-1
dijkstra.sty:% !TeX encoding = ISO-8859-1
glosmathtools.sty:% !TeX encoding = UTF-8
hlist.sty:% !TeX encoding = ISO-8859-1
modeles-factures-belges-associations.sty:% !TeX encoding = UTF-8
scratch.sty:% !TeX encoding = ISO-8859-1
scratch3.sty:% !TeX encoding = ISO-8859-1
simplekv.sty:% !TeX encoding = ISO-8859-1
spreadtab.sty:% !TeX encoding = ISO-8859-1

However, there are more than in the 2018 bundle, like xstring:

$ grep "!TeX encoding" *.tex *.sty
chemfig.tex:% !TeX encoding = ISO-8859-1
hlist.tex:% !TeX encoding = ISO-8859-1
simplekv.tex:% !TeX encoding = ISO-8859-1
systeme.tex:% !TeX encoding = ISO-8859-1
dijkstra.sty:% !TeX encoding = ISO-8859-1
hlist.sty:% !TeX encoding = ISO-8859-1
scratch.sty:% !TeX encoding = ISO-8859-1
simplekv.sty:% !TeX encoding = ISO-8859-1

xstring is a required package for several (69 it seems) other packages, among which acronym, which is what I'd like to use.

It is my understanding that xelatex can also give this warning, but I have not been able to trigger it. So it seems that tectonic is doing something different.

Maybe tectonic does not properly parse the "!TeX encoding" header?

The text was updated successfully, but these errors were encountered:

hugobuddel · 2020-11-23T15:53:34Z

A bit of my own (so far unsuccessful) investigation.

The warning seems to originate from xtex-xetex0.c

tectonic/tectonic/xetex-xetex0.c

Lines 11432 to 11436 in ac94ca4

 void bad_utf8_warning(void) 

 { 

 begin_diagnostic(); 

 diagnostic_begin_capture_warning_here(); 

 print_nl_cstr("Invalid UTF-8 byte or sequence");

Which seems to be called from

tectonic/tectonic/xetex-io.c

Line 356 in 5d8f0b7

get_uni_c(UFILE* f)

when the encoding mode is (inadvertently) UTF8

tectonic/tectonic/xetex-io.c

Lines 367 to 368 in 5d8f0b7

 switch (f->encodingMode) { 

 case UTF8:

The encoding mode seems to be set (or propagated) by set_input_file_encoding

tectonic/tectonic/xetex-io.c

Line 70 in 5d8f0b7

set_input_file_encoding(UFILE* f, int32_t mode, int32_t encodingData)

Which seems to be called from either

tectonic/tectonic/xetex-io.c

Line 111 in 5d8f0b7

 u_open_in(UFILE **f, int32_t filefmt, const char *fopen_mode, int32_t mode, int32_t encodingData) 

or

tectonic/tectonic/xetex-xetex0.c

Line 16313 in ac94ca4

void do_extension(void)

But here it quickly becomes more complicated

pkgw · 2020-11-23T16:02:14Z

@hugobuddel I'm not 100% sure, but I don't recall seeing any code that checks for a !TeX encoding line or anything like that — I think that's just an unsupported feature in XeTeX/Tectonic. I believe that XeTeX is probably flagging the same warning, but just that you don't see it in normal operations unless you go poking through the log files.

hugobuddel · 2020-11-23T16:10:43Z

Thanks @pkgw , that could be that it is just not shown by default in XeTeX.

This code block gave me an idea:

tectonic/tectonic/xetex-xetex0.c

Lines 16455 to 16471 in ac94ca4

 case XETEX_INPUT_ENCODING_EXTENSION_CODE: 

 { 

 scan_and_pack_name(); 

 i = get_encoding_mode_and_info(&j); 

 if (i == XETEX_INPUT_MODE_AUTO) { 

 error_here_with_diagnostic("Encoding mode `auto' is not valid for \\XeTeXinputencoding"); 

 capture_to_diagnostic(NULL); 

 { 

 help_ptr = 2; 

 help_line[1] = "You can't use `auto' encoding here, only for \\XeTeXdefaultencoding."; 

 help_line[0] = "I'll ignore this and leave the current encoding unchanged."; 

 } 

 error(); 

 } else 

 set_input_file_encoding(input_file[in_open], i, j); 

 } 

 break;

I now have this in my main tex file, and it works (acronym includes xstring):

\XeTeXdefaultencoding "Latin1"
\usepackage[printonlyused]{acronym}
\XeTeXdefaultencoding "UTF8"

What should we do with this issue? Close it?

hugobuddel · 2020-11-23T16:12:09Z

Hmm now the code does not work anymore with pdflatex, but there is probably some conditional construct that I can use.

pkgw · 2020-11-23T16:49:46Z

@hugobuddel There is an ifxetex package that provides an \ifxetex command that you could use to fence off your \XeTeXdefaultencoding commands if you really want to maintain cross-engine compatibility. As far as I know pdflatex doesn't implement any of the \XeTeX... special commands.

pkgw · 2020-11-24T13:31:56Z

OK, I think I'm going to close this issue since it is not obviously causing any problems, and there seems to be a general mechanism for working around it if absolutely needed. And, AFAIK, we're behaving the same as XeTeX here. Feel free to add further comments or open new issues as needed.

It wouldn't hurt to think about adding/fixing support for the !TeX encoding marker at some point, and PRs for that would be welcome, but it's probably not going to be a priority unless it turns out to be causing more significant problems for people.

hugobuddel · 2020-11-24T14:28:12Z

Yeah close it. The information is here for reference. xelatex gives the same warnings, but they are hidden in the log file. Tectonic throws them in your face though.

The workaround is suboptimal though, because in this case xstring changed encoding between 2018 and 2020. So explicitly setting the encoding for the Latin1 2020 version will not work with the UTF-8 2018 version. But it allows using the 2020 bundle without warnings, which is good enough for now.

I looked into improving the detection. The best place for that seems to be the u_open_in() function:

tectonic/tectonic/xetex-io.c

Line 111 in ac94ca4

 u_open_in(UFILE **f, int32_t filefmt, const char *fopen_mode, int32_t mode, int32_t encodingData) 

This 'sniffs the encoding', which it seems to do by searching for a byte order marker. (This is also the code that is skipped if an encoding is explicitly given.) It shouldn't be too hard to extend this to check the first line for % !TeX encoding = ISO-8859-1. I can probably do that in C, but I'm far from proficient in that.

If anything I'd prefer to learn Rust over C. Apparently the UFILE struct has a rust_input_handle_t member, which is a Rust InputHandle struct. It seem that it is exactly this encoding sniffing that is annoying:

tectonic/src/io/mod.rs

Lines 66 to 71 in ac94ca4

 /// figure out if the TeX engine needs rerunning. TeX makes our life more 

 /// difficult, though, since it has somewhat funky file access patterns. LaTeX 

 /// file opens work by opening a file and immediately closing it, which tests 

 /// whether the file exists, and then by opening it again for real. Under the 

 /// hood, XeTeX reads a couple of bytes from each file upon open to sniff its 

 /// encoding. So we can't just stream data from `read()` calls into the SHA2

So wouldn't it be nicer to replace the encoding determination part of u_open_in with some function of InputHandle? (Not that I would currently know how to approach that.)

Unfortunately I couldn't get tectonic to compile yet, because this is an old machine that doesn't have an up-to-date harfbuzz. But that should be fixable.

pkgw · 2020-11-24T14:54:01Z

Yeah, it would definitely be nice to shift the encoding processing from the existing C stuff to standard Rust libraries. This particular processing layer should be pretty separable from the rest of the engine, and hopefully more extensive use of cbindgen can allow us to migrate this subsystem to a separate helper crate. Now that I have a better understanding of how Rust "-sys" crates can provide C header files to each other, I think I see a path for this. (The same tools should also help us bundle our own Harfbuzz to help avoid compilation challenges like yours, AKA issue #619.)

pkgw closed this as completed Nov 24, 2020

pkgw mentioned this issue Nov 24, 2020

Figure out what to do with harfbuzz dependency - bundle it? #619

Closed

japm48 mentioned this issue Sep 4, 2021

option for additional file search directories #816

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Invalid UTF-8 byte or sequence" warning for ISO-8859-1 (Latin-1) packages (e.g. xstring) #689

"Invalid UTF-8 byte or sequence" warning for ISO-8859-1 (Latin-1) packages (e.g. xstring) #689

hugobuddel commented Nov 23, 2020

hugobuddel commented Nov 23, 2020

pkgw commented Nov 23, 2020

hugobuddel commented Nov 23, 2020

hugobuddel commented Nov 23, 2020

pkgw commented Nov 23, 2020

pkgw commented Nov 24, 2020 •

edited

Loading

hugobuddel commented Nov 24, 2020

pkgw commented Nov 24, 2020

"Invalid UTF-8 byte or sequence" warning for ISO-8859-1 (Latin-1) packages (e.g. xstring) #689

"Invalid UTF-8 byte or sequence" warning for ISO-8859-1 (Latin-1) packages (e.g. xstring) #689

Comments

hugobuddel commented Nov 23, 2020

hugobuddel commented Nov 23, 2020

pkgw commented Nov 23, 2020

hugobuddel commented Nov 23, 2020

hugobuddel commented Nov 23, 2020

pkgw commented Nov 23, 2020

pkgw commented Nov 24, 2020 • edited Loading

hugobuddel commented Nov 24, 2020

pkgw commented Nov 24, 2020

pkgw commented Nov 24, 2020 •

edited

Loading