Description
At present this does not happen often. Most warning and error messages are presently ASCII range 32 to 126. But tests 427664 and 427672 in the tidy-html5-tests repo are reporting a warning about a bad attribute name. One that does not start with a 'letter', a-z or A-Z. In the case of these two tests, the name starts with an extended ASCII value of 195 (0xc3), so is an invalid attr name.
In the running of these two tests, the config_default.conf is used, which contains, among other things, char-encoding: latin1
. So when Tidy reads this c3
from the stream, it is valid latin1
, mime ISO-8859-1
, but Tidy converts it to utf-8, c3 83
, Tidy's internal default encoding, to stores in the lexer. So far no problem.
In reporting this invalid attribute name, the now utf-8 name is correctly copied to the warning message string using vsnprintf
or vsprintf
, which also have no problem copying the utf-8 to the message.
The problem comes when outputting that formatted message to either stderr
, or the user's message file. The service messagePos
presently outs the message on a byte-by-byte basis using TY_(WriteChar)( *cp, doc->errout );
! There are several problems with this!
- In most system the use of
*cp
does not protect against sign extending the character to0xffffffc3
- WriteChar is an encoding service, so can not be used on a
byte-by-byte
basis to out multi-byte utf-8 - The user has configured
latin1
output, but thisdoc->errout
is presently set to utf-8
The result is that we get the wrong output! If in a system which sign extends the byte, we will get EF BF BF EF BF BF
for each utf-8 byte, and when not sign extended will get C3 83 C2 83
, both of which are wrong.
Now the solution depends on whether we wish to respect the users output encoding choice, in this case latin1
, or continue to output only utf-8 to the message output.
The message is presently correctly encoded as utf-8, so could be output as a single text stream, and would result in a valid utf-8 message.
It gets a little more complicated if we wish to respect the users encoding choice. Then we would need to pass the output to WriteChar as complete, up to 4 byte utf-8 character sequences, to have them correctly translated, to `latin1' in this case.
And I am very sure something needs to be done about this to support localization of the message strings. As far as I can see, so far we have only tested sort of translated 'block' messages, and not translated error and warning messages, that will also pass through this byte-by-byte messagePos
service, with very poor results!
As usual seek comments on which way to jump!