Problem to write multi-byte utf-8 text to message output

At present this does not happen often. Most warning and error messages are presently ASCII range 32 to 126. But tests 427664 and 427672 in the [tidy-html5-tests](https://github.com/htacg/tidy-html5-tests) repo are reporting a warning about a **bad** attribute name. One that does **not** start with a 'letter', a-z or A-Z. In the case of these two tests, the name starts with an extended ASCII value of 195 (0xc3), so is an invalid attr name.

In the running of these two tests, the config_default.conf is used, which contains, among other things, `char-encoding: latin1`. So when Tidy reads this `c3` from the stream, it is valid `latin1`, mime `ISO-8859-1`, but Tidy converts it to utf-8, `c3 83`, Tidy's internal default encoding, to stores in the lexer. So far no problem.

In reporting this invalid attribute name, the now utf-8 name is correctly copied to the warning message string using `vsnprintf` or `vsprintf`, which also have no problem copying the utf-8 to the message.

The problem comes when outputting that formatted message to either `stderr`, or the user's message file. The service `messagePos` presently outs the message on a **byte-by-byte** basis using `TY_(WriteChar)( *cp, doc->errout );`! There are several problems with this!
1. In most system the use of `*cp` does not protect against sign extending the character to `0xffffffc3`
2. WriteChar is an encoding service, so can not be used on a `byte-by-byte` basis to out multi-byte utf-8
3. The user has configured `latin1` output, but this `doc->errout` is presently set to utf-8

The result is that we get the wrong output! If in a system which sign extends the byte, we will get `EF BF BF EF BF BF` for each utf-8 byte, and when not sign extended will get `C3 83 C2 83`, both of which are **wrong**.

Now the solution depends on whether we wish to respect the users output encoding choice, in this case `latin1`, or continue to output only utf-8 to the message output.

The message is presently correctly encoded as utf-8, so could be output as a single text stream, and would result in a valid utf-8 message.

It gets a little more complicated if we wish to respect the users encoding choice. Then we would need to pass the output to WriteChar as complete, up to 4 byte utf-8 character sequences, to have them correctly translated, to `latin1' in this case.

And I am very sure something needs to be done about this to support **localization** of the message strings. As far as I can see, so far we have only tested sort of translated 'block' messages, and not translated error and warning messages, that will also pass through this **byte-by-byte** `messagePos` service, with very poor results!

As usual seek comments on which way to jump!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Problem to write multi-byte utf-8 text to message output #383

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Problem to write multi-byte utf-8 text to message output #383

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions