Skip to content

Problem to write multi-byte utf-8 text to message output #383

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
geoffmcl opened this issue Mar 17, 2016 · 6 comments
Closed

Problem to write multi-byte utf-8 text to message output #383

geoffmcl opened this issue Mar 17, 2016 · 6 comments

Comments

@geoffmcl
Copy link
Contributor

At present this does not happen often. Most warning and error messages are presently ASCII range 32 to 126. But tests 427664 and 427672 in the tidy-html5-tests repo are reporting a warning about a bad attribute name. One that does not start with a 'letter', a-z or A-Z. In the case of these two tests, the name starts with an extended ASCII value of 195 (0xc3), so is an invalid attr name.

In the running of these two tests, the config_default.conf is used, which contains, among other things, char-encoding: latin1. So when Tidy reads this c3 from the stream, it is valid latin1, mime ISO-8859-1, but Tidy converts it to utf-8, c3 83, Tidy's internal default encoding, to stores in the lexer. So far no problem.

In reporting this invalid attribute name, the now utf-8 name is correctly copied to the warning message string using vsnprintf or vsprintf, which also have no problem copying the utf-8 to the message.

The problem comes when outputting that formatted message to either stderr, or the user's message file. The service messagePos presently outs the message on a byte-by-byte basis using TY_(WriteChar)( *cp, doc->errout );! There are several problems with this!

  1. In most system the use of *cp does not protect against sign extending the character to 0xffffffc3
  2. WriteChar is an encoding service, so can not be used on a byte-by-byte basis to out multi-byte utf-8
  3. The user has configured latin1 output, but this doc->errout is presently set to utf-8

The result is that we get the wrong output! If in a system which sign extends the byte, we will get EF BF BF EF BF BF for each utf-8 byte, and when not sign extended will get C3 83 C2 83, both of which are wrong.

Now the solution depends on whether we wish to respect the users output encoding choice, in this case latin1, or continue to output only utf-8 to the message output.

The message is presently correctly encoded as utf-8, so could be output as a single text stream, and would result in a valid utf-8 message.

It gets a little more complicated if we wish to respect the users encoding choice. Then we would need to pass the output to WriteChar as complete, up to 4 byte utf-8 character sequences, to have them correctly translated, to `latin1' in this case.

And I am very sure something needs to be done about this to support localization of the message strings. As far as I can see, so far we have only tested sort of translated 'block' messages, and not translated error and warning messages, that will also pass through this byte-by-byte messagePos service, with very poor results!

As usual seek comments on which way to jump!

@geoffmcl
Copy link
Contributor Author

PS: See Issue 3 in the test repo for further details...

@geoffmcl
Copy link
Contributor Author

Well, if we accept, agree the message output will always be utf-8 text, then the translation service WriteChar can be simply replaced by the TidyOutputSink *outp->putByte()! As stated, the stream is already valid utf-8.

This can be achieved by the following relatively small patch. Please ignore the removal of the Debug output. This is already now handled in the putByte, and this avoids a duplication in my debug log. -

diff --git a/src/message.c b/src/message.c
index 2328515..6de12bc 100755
--- a/src/message.c
+++ b/src/message.c
@@ -260,30 +260,32 @@ static void messagePos( TidyDocImpl* doc, TidyReportLevel level, uint code,
     if ( go )
     {
         enum { sizeBuf=1024 };
-        char *buf = TidyDocAlloc(doc,sizeBuf);
+        TidyOutputSink *outp = &doc->errout->sink;
+        char *buf = (char *)TidyDocAlloc(doc,sizeBuf);
         const char *cp;
+        byte b;
         if ( line > 0 && col > 0 )
         {
             ReportPosition(doc, line, col, buf, sizeBuf);
-#if !defined(NDEBUG) && defined(_MSC_VER)
-            SPRTF("%s",buf);
-#endif
             for ( cp = buf; *cp; ++cp )
-                TY_(WriteChar)( *cp, doc->errout );
+            {
+                b = (*cp & 0xff);
+                outp->putByte( outp->sinkData, b );
+            }
         }

         LevelPrefix( level, buf, sizeBuf );
-#if !defined(NDEBUG) && defined(_MSC_VER)
-            SPRTF("%s",buf);
-            SPRTF("%s\n",messageBuf);
-#else
         for ( cp = buf; *cp; ++cp )
-            TY_(WriteChar)( *cp, doc->errout );
-
+        {
+            b = (*cp & 0xff);
+            outp->putByte( outp->sinkData, b );
+        }
         for ( cp = messageBuf; *cp; ++cp )
-            TY_(WriteChar)( *cp, doc->errout );
+        {
+            b = (*cp & 0xff);
+            outp->putByte( outp->sinkData, b );
+        }
         TY_(WriteChar)( '\n', doc->errout );
-#endif
         TidyDocFree(doc, buf);
     }
     TidyDocFree(doc, messageBuf);

Of course the last newline output must still use WriteChar to effect the newline translation that is in effect. And maybe the extraction of the value and and-ing with 0xff is over-kill, since b is a byte already, but reads cleanly this way...

Further, the acceptance of always utf-8 for the message output seems the best choice, since if the user does not give an output file name then this is to standard stderr...

Yes, I can read around, and see in my own systems, that Windows can have some problems correctly displaying multi-byte utf-8, despite using SetConsoleOutputCP(CP_UTF8);, and other changes..., I now think it depends on some deeper registry values, or something, which I do not particularly want to effect on my basic 'english' system... but should be fine on systems that already use multi-byte utf-8, like Chinese, Japanese, etc...

And this should be fine for most *nix systems...

Of course it will also means changing the expects in tests for these tests...

To facilitate testing in other OSes, have pushed this fix to the issue-380 branch.

In the absence of further comments here, or directly on #379, or #380, will merge this branch shortly to master.

@balthisar
Copy link
Member

My expectation is that the output-encoding only effect the actual HTML output; it never occurred to me that it affect the reporting. I would argue the following:

  • HTML output still has its use cases for different encoding formats.
  • For reporting this is the 20th century and perhaps we should encourage the adoption of UTF universally, other than for HTML output.

I.e., I'm fine with your suggestion.

@geoffmcl
Copy link
Contributor Author

This needs more work... all message file output is already valid utf-8, and does not need to be converted...

geoffmcl added a commit that referenced this issue Mar 23, 2016
As in the previous case these messages are already valid utf-8 text, and
thus, if output on a byte-by-byte basis, must not use WriteChar, except
for the EOL char.

Of course this output can be to either a user ouput file, if configured,
otherwise stderr.
geoffmcl added a commit that referenced this issue Mar 23, 2016
@geoffmcl
Copy link
Contributor Author

Created a fr.po branch with -

  • another fix for message output
  • a language_fr.h WIP for testing

Pleas checkout, and test this branch... thanks...

geoffmcl added a commit that referenced this issue Mar 24, 2016
As in the previous case these messages are already valid utf-8 text, and
thus, if output on a byte-by-byte basis, must not use WriteChar, except
for the EOL char.

Of course this output can be to either a user ouput file, if configured,
otherwise stderr.
geoffmcl added a commit that referenced this issue Mar 24, 2016
geoffmcl added a commit that referenced this issue Mar 27, 2016
@geoffmcl
Copy link
Contributor Author

Now merged into master and think all issues here fixed, version 5.1.48, so closing this...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants