-
Notifications
You must be signed in to change notification settings - Fork 429
Problem to write multi-byte utf-8 text to message output #383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
PS: See Issue 3 in the |
Well, if we accept, agree the message output will always be utf-8 text, then the translation service This can be achieved by the following relatively small patch. Please ignore the removal of the Debug output. This is already now handled in the
Of course the last newline output must still use Further, the acceptance of always utf-8 for the message output seems the best choice, since if the user does not give an output file name then this is to standard Yes, I can read around, and see in my own systems, that Windows can have some problems correctly displaying multi-byte utf-8, despite using And this should be fine for most *nix systems... Of course it will also means changing the To facilitate testing in other OSes, have pushed this fix to the In the absence of further comments here, or directly on #379, or #380, will merge this branch shortly to |
My expectation is that the output-encoding only effect the actual HTML output; it never occurred to me that it affect the reporting. I would argue the following:
I.e., I'm fine with your suggestion. |
This needs more work... all message file output is already valid utf-8, and does not need to be |
As in the previous case these messages are already valid utf-8 text, and thus, if output on a byte-by-byte basis, must not use WriteChar, except for the EOL char. Of course this output can be to either a user ouput file, if configured, otherwise stderr.
Created a
Pleas checkout, and test this branch... thanks... |
As in the previous case these messages are already valid utf-8 text, and thus, if output on a byte-by-byte basis, must not use WriteChar, except for the EOL char. Of course this output can be to either a user ouput file, if configured, otherwise stderr.
Now merged into master and think all issues here fixed, version 5.1.48, so closing this... |
At present this does not happen often. Most warning and error messages are presently ASCII range 32 to 126. But tests 427664 and 427672 in the tidy-html5-tests repo are reporting a warning about a bad attribute name. One that does not start with a 'letter', a-z or A-Z. In the case of these two tests, the name starts with an extended ASCII value of 195 (0xc3), so is an invalid attr name.
In the running of these two tests, the config_default.conf is used, which contains, among other things,
char-encoding: latin1
. So when Tidy reads thisc3
from the stream, it is validlatin1
, mimeISO-8859-1
, but Tidy converts it to utf-8,c3 83
, Tidy's internal default encoding, to stores in the lexer. So far no problem.In reporting this invalid attribute name, the now utf-8 name is correctly copied to the warning message string using
vsnprintf
orvsprintf
, which also have no problem copying the utf-8 to the message.The problem comes when outputting that formatted message to either
stderr
, or the user's message file. The servicemessagePos
presently outs the message on a byte-by-byte basis usingTY_(WriteChar)( *cp, doc->errout );
! There are several problems with this!*cp
does not protect against sign extending the character to0xffffffc3
byte-by-byte
basis to out multi-byte utf-8latin1
output, but thisdoc->errout
is presently set to utf-8The result is that we get the wrong output! If in a system which sign extends the byte, we will get
EF BF BF EF BF BF
for each utf-8 byte, and when not sign extended will getC3 83 C2 83
, both of which are wrong.Now the solution depends on whether we wish to respect the users output encoding choice, in this case
latin1
, or continue to output only utf-8 to the message output.The message is presently correctly encoded as utf-8, so could be output as a single text stream, and would result in a valid utf-8 message.
It gets a little more complicated if we wish to respect the users encoding choice. Then we would need to pass the output to WriteChar as complete, up to 4 byte utf-8 character sequences, to have them correctly translated, to `latin1' in this case.
And I am very sure something needs to be done about this to support localization of the message strings. As far as I can see, so far we have only tested sort of translated 'block' messages, and not translated error and warning messages, that will also pass through this byte-by-byte
messagePos
service, with very poor results!As usual seek comments on which way to jump!
The text was updated successfully, but these errors were encountered: