Skip to content

Problem to write multi-byte utf-8 text to message output #383

Closed
@geoffmcl

Description

@geoffmcl

At present this does not happen often. Most warning and error messages are presently ASCII range 32 to 126. But tests 427664 and 427672 in the tidy-html5-tests repo are reporting a warning about a bad attribute name. One that does not start with a 'letter', a-z or A-Z. In the case of these two tests, the name starts with an extended ASCII value of 195 (0xc3), so is an invalid attr name.

In the running of these two tests, the config_default.conf is used, which contains, among other things, char-encoding: latin1. So when Tidy reads this c3 from the stream, it is valid latin1, mime ISO-8859-1, but Tidy converts it to utf-8, c3 83, Tidy's internal default encoding, to stores in the lexer. So far no problem.

In reporting this invalid attribute name, the now utf-8 name is correctly copied to the warning message string using vsnprintf or vsprintf, which also have no problem copying the utf-8 to the message.

The problem comes when outputting that formatted message to either stderr, or the user's message file. The service messagePos presently outs the message on a byte-by-byte basis using TY_(WriteChar)( *cp, doc->errout );! There are several problems with this!

  1. In most system the use of *cp does not protect against sign extending the character to 0xffffffc3
  2. WriteChar is an encoding service, so can not be used on a byte-by-byte basis to out multi-byte utf-8
  3. The user has configured latin1 output, but this doc->errout is presently set to utf-8

The result is that we get the wrong output! If in a system which sign extends the byte, we will get EF BF BF EF BF BF for each utf-8 byte, and when not sign extended will get C3 83 C2 83, both of which are wrong.

Now the solution depends on whether we wish to respect the users output encoding choice, in this case latin1, or continue to output only utf-8 to the message output.

The message is presently correctly encoded as utf-8, so could be output as a single text stream, and would result in a valid utf-8 message.

It gets a little more complicated if we wish to respect the users encoding choice. Then we would need to pass the output to WriteChar as complete, up to 4 byte utf-8 character sequences, to have them correctly translated, to `latin1' in this case.

And I am very sure something needs to be done about this to support localization of the message strings. As far as I can see, so far we have only tested sort of translated 'block' messages, and not translated error and warning messages, that will also pass through this byte-by-byte messagePos service, with very poor results!

As usual seek comments on which way to jump!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions