Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Invalid UTF-8 byte or sequence" warning for ISO-8859-1 (Latin-1) packages (e.g. xstring) #689

Closed
hugobuddel opened this issue Nov 23, 2020 · 8 comments

Comments

@hugobuddel
Copy link

Compile

\documentclass{article}
\usepackage{xstring}
\begin{document}
Test.
\end{document}

with tectonic, and there are hundreds of warnings like

warning: xstring.tex:774: Invalid UTF-8 byte or sequence at line 774 replaced by U+FFFD.
warning: xstring.tex:797: Invalid UTF-8 byte or sequence at line 797 replaced by U+FFFD.
warning: xstring.tex:797: Invalid UTF-8 byte or sequence at line 797 replaced by U+FFFD.
warning: xstring.tex:810: Invalid UTF-8 byte or sequence at line 810 replaced by U+FFFD.

Indeed xstring.tex contains many Invalid UTF-8 bytes, only in comments it seems, but it has a proper declaration:

% !TeX encoding = ISO-8859-1
[...]
\newcount\decimalpart% compteurs utilisés par xstring

where the é is non-utf-8.

There appear to be only a few of these in the 2020 bundle:

$ grep "!TeX encoding" *.tex *.sty
autoaligne.tex:% !TeX encoding = ISO-8859-1
chemfig.tex:% !TeX encoding = ISO-8859-1
hlist.tex:% !TeX encoding = ISO-8859-1
listofitems.tex:% !TeX encoding = ISO-8859-1
listofitemsold.tex:% !TeX encoding = ISO-8859-1
simplekv.tex:% !TeX encoding = ISO-8859-1
systeme.tex:% !TeX encoding = ISO-8859-1
xstring.tex:% !TeX encoding = ISO-8859-1
dijkstra.sty:% !TeX encoding = ISO-8859-1
glosmathtools.sty:% !TeX encoding = UTF-8
hlist.sty:% !TeX encoding = ISO-8859-1
modeles-factures-belges-associations.sty:% !TeX encoding = UTF-8
scratch.sty:% !TeX encoding = ISO-8859-1
scratch3.sty:% !TeX encoding = ISO-8859-1
simplekv.sty:% !TeX encoding = ISO-8859-1
spreadtab.sty:% !TeX encoding = ISO-8859-1

However, there are more than in the 2018 bundle, like xstring:

$ grep "!TeX encoding" *.tex *.sty
chemfig.tex:% !TeX encoding = ISO-8859-1
hlist.tex:% !TeX encoding = ISO-8859-1
simplekv.tex:% !TeX encoding = ISO-8859-1
systeme.tex:% !TeX encoding = ISO-8859-1
dijkstra.sty:% !TeX encoding = ISO-8859-1
hlist.sty:% !TeX encoding = ISO-8859-1
scratch.sty:% !TeX encoding = ISO-8859-1
simplekv.sty:% !TeX encoding = ISO-8859-1

xstring is a required package for several (69 it seems) other packages, among which acronym, which is what I'd like to use.

It is my understanding that xelatex can also give this warning, but I have not been able to trigger it. So it seems that tectonic is doing something different.

Maybe tectonic does not properly parse the "!TeX encoding" header?

@hugobuddel
Copy link
Author

A bit of my own (so far unsuccessful) investigation.

The warning seems to originate from xtex-xetex0.c

tectonic/tectonic/xetex-xetex0.c

Lines 11432 to 11436 in ac94ca4

void bad_utf8_warning(void)
{
begin_diagnostic();
diagnostic_begin_capture_warning_here();
print_nl_cstr("Invalid UTF-8 byte or sequence");

Which seems to be called from

get_uni_c(UFILE* f)

when the encoding mode is (inadvertently) UTF8
switch (f->encodingMode) {
case UTF8:

The encoding mode seems to be set (or propagated) by set_input_file_encoding
set_input_file_encoding(UFILE* f, int32_t mode, int32_t encodingData)

Which seems to be called from either
u_open_in(UFILE **f, int32_t filefmt, const char *fopen_mode, int32_t mode, int32_t encodingData)

or
void do_extension(void)

But here it quickly becomes more complicated

@pkgw
Copy link
Collaborator

pkgw commented Nov 23, 2020

@hugobuddel I'm not 100% sure, but I don't recall seeing any code that checks for a !TeX encoding line or anything like that — I think that's just an unsupported feature in XeTeX/Tectonic. I believe that XeTeX is probably flagging the same warning, but just that you don't see it in normal operations unless you go poking through the log files.

@hugobuddel
Copy link
Author

Thanks @pkgw , that could be that it is just not shown by default in XeTeX.

This code block gave me an idea:

tectonic/tectonic/xetex-xetex0.c

Lines 16455 to 16471 in ac94ca4

case XETEX_INPUT_ENCODING_EXTENSION_CODE:
{
scan_and_pack_name();
i = get_encoding_mode_and_info(&j);
if (i == XETEX_INPUT_MODE_AUTO) {
error_here_with_diagnostic("Encoding mode `auto' is not valid for \\XeTeXinputencoding");
capture_to_diagnostic(NULL);
{
help_ptr = 2;
help_line[1] = "You can't use `auto' encoding here, only for \\XeTeXdefaultencoding.";
help_line[0] = "I'll ignore this and leave the current encoding unchanged.";
}
error();
} else
set_input_file_encoding(input_file[in_open], i, j);
}
break;

I now have this in my main tex file, and it works (acronym includes xstring):

\XeTeXdefaultencoding "Latin1"
\usepackage[printonlyused]{acronym}
\XeTeXdefaultencoding "UTF8"

What should we do with this issue? Close it?

@hugobuddel
Copy link
Author

Hmm now the code does not work anymore with pdflatex, but there is probably some conditional construct that I can use.

@pkgw
Copy link
Collaborator

pkgw commented Nov 23, 2020

@hugobuddel There is an ifxetex package that provides an \ifxetex command that you could use to fence off your \XeTeXdefaultencoding commands if you really want to maintain cross-engine compatibility. As far as I know pdflatex doesn't implement any of the \XeTeX... special commands.

@pkgw
Copy link
Collaborator

pkgw commented Nov 24, 2020

OK, I think I'm going to close this issue since it is not obviously causing any problems, and there seems to be a general mechanism for working around it if absolutely needed. And, AFAIK, we're behaving the same as XeTeX here. Feel free to add further comments or open new issues as needed.

It wouldn't hurt to think about adding/fixing support for the !TeX encoding marker at some point, and PRs for that would be welcome, but it's probably not going to be a priority unless it turns out to be causing more significant problems for people.

@pkgw pkgw closed this as completed Nov 24, 2020
@hugobuddel
Copy link
Author

Yeah close it. The information is here for reference. xelatex gives the same warnings, but they are hidden in the log file. Tectonic throws them in your face though.

The workaround is suboptimal though, because in this case xstring changed encoding between 2018 and 2020. So explicitly setting the encoding for the Latin1 2020 version will not work with the UTF-8 2018 version. But it allows using the 2020 bundle without warnings, which is good enough for now.

I looked into improving the detection. The best place for that seems to be the u_open_in() function:

u_open_in(UFILE **f, int32_t filefmt, const char *fopen_mode, int32_t mode, int32_t encodingData)

This 'sniffs the encoding', which it seems to do by searching for a byte order marker. (This is also the code that is skipped if an encoding is explicitly given.) It shouldn't be too hard to extend this to check the first line for % !TeX encoding = ISO-8859-1. I can probably do that in C, but I'm far from proficient in that.

If anything I'd prefer to learn Rust over C. Apparently the UFILE struct has a rust_input_handle_t member, which is a Rust InputHandle struct. It seem that it is exactly this encoding sniffing that is annoying:

tectonic/src/io/mod.rs

Lines 66 to 71 in ac94ca4

/// figure out if the TeX engine needs rerunning. TeX makes our life more
/// difficult, though, since it has somewhat funky file access patterns. LaTeX
/// file opens work by opening a file and immediately closing it, which tests
/// whether the file exists, and then by opening it again for real. Under the
/// hood, XeTeX reads a couple of bytes from each file upon open to sniff its
/// encoding. So we can't just stream data from `read()` calls into the SHA2

So wouldn't it be nicer to replace the encoding determination part of u_open_in with some function of InputHandle? (Not that I would currently know how to approach that.)

Unfortunately I couldn't get tectonic to compile yet, because this is an old machine that doesn't have an up-to-date harfbuzz. But that should be fixable.

@pkgw
Copy link
Collaborator

pkgw commented Nov 24, 2020

Yeah, it would definitely be nice to shift the encoding processing from the existing C stuff to standard Rust libraries. This particular processing layer should be pretty separable from the rest of the engine, and hopefully more extensive use of cbindgen can allow us to migrate this subsystem to a separate helper crate. Now that I have a better understanding of how Rust "-sys" crates can provide C header files to each other, I think I see a path for this. (The same tools should also help us bundle our own Harfbuzz to help avoid compilation challenges like yours, AKA issue #619.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants