Skip to content
This repository was archived by the owner on Jun 30, 2021. It is now read-only.

Problem with tests 427664 and 427672 on certain OSes! #3

Closed
geoffmcl opened this issue Mar 13, 2016 · 8 comments
Closed

Problem with tests 427664 and 427672 on certain OSes! #3

geoffmcl opened this issue Mar 13, 2016 · 8 comments

Comments

@geoffmcl
Copy link
Contributor

There is a difference in output message text using ARM / Raspberry Pi 2, RPI, first reported as Issue 258, Issue 266, Issue 269, by @vielmetti, back in Sep 13, 2015. Thank you for that report. Maybe others...

One or both may also be a problem on the MAC OS X, reported by @balthisar. To be verified.

First to try to examine the exact reason for these two test...

Test 427664 - now https://sourceforge.net/p/tidy/bugs/4/
#4 Missing attr values cause NULL segfault
Created: 2001-05-27 Creator: Terry Teague
Test 427672 - now https://sourceforge.net/p/tidy/bugs/10/
#10 Non-std attrs w/multibyte names segfault
Created: 2001-05-27 Creator: Terry Teague

Both these test inputs existed in SF CVS source, without a special config file. Both reported a segfault at that time! And both input files seems exactly the same as in this github tests repo. And in a binary compare, namely in_427664.html == case-427664.html and in_427672.html == case-427672.html! So no change has been made in the inputs. Of course, the SF CVS has no `testbase-expects' output to compare with...

However, re-running tidy04aug00, even adding the suggested -utf8 option, on each file, does NOT produce a segfault, as far as I can see...

But running tidy04aug00, for which I do not have the source, on both inputs, using DrMemory, does show it has -

 Error #1: UNADDRESSABLE ACCESS beyond top of stack: reading 4 byte(s)

But this is not exacly a segfault due to a NULL pointer! And repeating the tests using tidy2000, for which we do have the source, does not show any problems...

And while, for some reason I can not yet run DrMemory using the current tidy 5.1.45++, it also appears to not have a segfault! And need to also try in linux using valgrind, ASAN, testing...

But, for sure, that segfault seems to have been solved, the reason for the two tests.

So there remains this mystery of the character encoding differences in the message output in certain OS environments, which still need to be solved.

What is in the testbase input, and testbase-expects?

Essentially both input file have <body name="xx">. A comment in the files says the name is supposed to be 2 bytes hex c3 87, but it is not! Now maybe this is a corruption from a long way back, but even in SF CVS source the name is a 4 byte sequence of C3 31 2F 32.

Thus, in their present state, both inputs do not verify as valid utf-8 text. They would if changed back to the c3 87 given in the comment, and yet to test if that changes the situation.

In parsing this document, tidy finds this 4 byte sequence is not a valid attribute name, and outputs a warning. Now it is the value output for that name in the warning message differs in RPI OS. And maybe in OS X, still to be verified.

Tidy in Windows, and Ubuntu linux consistently outputs a 9 byte sequence EF BF BF EF BF BF 31 2F 32, and this is what is in testbase-expects, so the compare is exact. No problem.

While Tidy in RPI outputs, in testbase-results a 7 byte sequence c3 83 c2 83 31 2f 32, so the diff fails. A problem.

What can we do?

  • Reduce the attribute name to just 1/2, which is still invalid, so keeps the tests meaning.
  • Change the file back to valid utf-8 c3 87, change the expected accordingly - to be tested.
  • Maybe a fix in Tidy code could force RPI to use EF BF BF output.
  • If also a problem in OS X, maybe exclude the 2 tests.
  • If only in RPI, be ready to explain that this difference exists in these 2 tests.
  • Other choices?

I seek ideas and comments on what would be best?

As previously expressed, I think it is important that we have a consistent set of tests across all OSes, and to not have to try and explain a difference every time someone stumbles across it.

Help Needed

@geoffmcl
Copy link
Contributor Author

Ok, I just added a test 427664-1 to the tests repo, in which I tried the 2nd option I listed above, namely to correct the utf-8 back to the value in the comment in the file. That is cases/testbase/case-427664-1.html now contains valid utf-8 of c3 87, which is actually an uppercase C cedilla, a valid utf-8 character.

Then added a cases/testbase-results, case-427664-1.html and case-427664-1.txt, generated in win32, and added the test to cases/testbase/_manifest.txt, and pushed... remember to the refactor branch only at this time...

Now this new test passes both win32 and Ubuntu linux testing, no problems... but...

BUT, RPI still insists on flagging a difference in the message output! ;=((

The testbase-expects/case-427664-1.txt contains, what I now think of as a bad sequence EF BF BF EF BF BF EF BF BF EF BF BF EF BF BF value, mirrored in the results generated by WIN32 and Ubuntu, so no difference, while RPI generates c3 83 c2 83 c3 a2 c2 80 c2 a1, so shows a difference. UGH!

Now the way I read it, neither of test message outputs are correct!!!

Why is the expects output not exactly the valid utf-8 as per the input, of c3 87? And in this case, RPI seems to be a little closer than win32 and ubuntu, but also still very wrong???

This points to a real character encoding problem in tidy, for message output!

Of course the invalid attribute is not output to the html, so we can not see what tidy would do in that case... another test to be done...

But it found a valid utf-8 input, but flagged it as a non-valid attribute name, ok, no problem, but why does it not repeat the same valid utf-8 in the message output? Or have I done something wrong in the tests... read, analysed the wrong file, etc... something stupid like that...

Anyway, continuing to explore this problem... but any help, pointers, ideas, very welcome...

@geoffmcl
Copy link
Contributor Author

Back to looking more closely at the attribute name as a 4 byte sequence of C3 31 2F 32 in the original case-427664.html input... and considering exactly what should be in the expects...

When the c3 is decoded from the stream, tidy is in char-encode mode of latin1, so this is returned un-translated from the stream, but to store it in the lexer, it is stored as a 2 byte utf-8 sequence, tidy's internal default -

    else if (c <= 0x7FF)  /* 110X XXXX  two bytes */
    {
        buf[0] = (tmbchar) ( 0xC0 | (c >> 6) );
        buf[1] = (tmbchar) ( 0x80 | (c & 0x3F) );
        bytes = 2;
    }

Namely, c3 83, then the next 3 bytes as 31 2f 32, so that is good so far. So this now 5 byte utf-8 sequence is duplicated as the attr name on finding the =.

Now the attribute name fails in IsValidAttrName(attribute), in that the first character is not a letter, and tidy begins to emit TY_(ReportAttrError)(TidyDocImpl* doc, Node *node, AttVal *av, uint code).

And tidy uses messageNode(doc, TidyWarning, code, node, fmt, tagdesc, name, value);, which uses retval = vsnprintf(buffer, count - 1, format, args);, correctly copies the 5 byte utf-8 sequence to the message.

Ok, there's the problem!

While the warning message is correctly encoded as utf-8, the messagePos( TidyDocImpl* doc, TidyReportLevel level, uint code, int line, int col, ctmbstr msg, va_list args ) service uses TY_(WriteChar)( *cp, doc->errout ); on a byte-by-byte basis! UGH!

The StreamOut* out->encoding is correctly set to out->encoding == UTF8, which is what the stream is, but such a write can not be on a byte-by-byte basis, because it will use TY_(EncodeCharToUTF8Bytes)( c, NULL, &out->sink, &count );, so given the single first byte, c3, it will find this as bad utf-8, and insert a substitute bad character sequence.

    /* replacement char 0xFFFD encoded as UTF-8 */
    PutByte(0xEF, out); PutByte(0xBF, out); PutByte(0xBF, out);

Also during this careful analysis I found at least two instances where loading this c3 into a uint will produce a sign extended value of 0xffffffc3. In most cases that would not particularly matter, but here it does matter!

And ensuring there is no sign extention helps a little in this case, the message output still contains the bad sequence of bytes C3 83 C2 83 31 2F 32! So that fully answers why RPI is different. In that OS/architecture the byte does not get sign extended!

So that clearly explains why there is a difference, but does nothing about fixing the problem, which is there in all OSes/architectures!

Now, presumably, if this was not an attribute name, but just some text, then in outputting the utf-8 tidy would translate it back to latin1 in the html. That is still be be tested and verified.

But that open another question. Should the message file be utf-8 encoded? In this case the user has a configuration of char-encoding: latin1. Should not the message file also obey the same request.

Anyway, next is to explore carefully how tidy correctly outputs utf-8 to html. Obviously it can not be using a byte-by-byte output! There must be a way to correctly sequence utf-8...

Or maybe instead of writing out these message byte-by-byte, they are just written whole?

In my debug mode, I do that to the log file, and correctly get C3 83 31 2F 32 in the log file. Maybe it is wrong in principle to put this message output through an encoding service. The messages are always valid utf-8 already, unless of course we want to make that second change. Namely to output the message file per the users output-encoding request...

And to consider filing a tidy issue on tidy-html5...

Some things to think about here! Some feedback would be very encouraging...

@geoffmcl
Copy link
Contributor Author

Have opened an Issue 383 in the tidy-html5 repo to address this problem...

geoffmcl added a commit that referenced this issue Mar 18, 2016
@geoffmcl
Copy link
Contributor Author

After applying a 'fix' in tidy-html5, in the issue-380 branch, updated the expects to suit this version 5.1.45-Exp3, and set cases_version.txt accordingly

Please checkout and build tidy-html5, issue-380, v.5.1.45-Exp3, and re-run all the regression tests. So far in Windows get a 100% pass...

@geoffmcl
Copy link
Contributor Author

And it 100% passed in Ubuntu 14.04 LTS - YIPEE ;=)).

And added a convenient run-tests.sh script...

Now to try Raspberry Pi 2, running Raspbian GNU/Linux 8.0 (jessie), with ARMv7 Processor rev 5 (v7l)...

@geoffmcl
Copy link
Contributor Author

Well that was a little slower! Had to modify the scripts to get things to run at all... seems again the bash is not exactly the same, or something else was wrong... so created a run-testsG.sh, testallG.sh, _environG.sh and testoneG.sh to get it all working smoothly...

And boy do I hate using pico as an editor in my remote putty connection to RPI... feels like back in the dark ages ;=))

AND of course had to remember to add another tidy config of -lang en_us, since in that system, due to the locale, it selects en_gb, which gives a difference on all message outputs...

But eventually got it all working...

And it PASSED 100% - all results matched exacly to all expects... PHEW!

Have left the new case-427664-1.html there, but had to create its own config, case-427664-1.conf, since that test contains only valid utf-8, so can not use the default config which uses latin1 as the encoding...

If others could test and report in the MAC OS X, and/or in every other system, maybe this issue can be closed ;=))

@balthisar
Copy link
Member

@geoffmcl, sorry I've been slow to get to this, but I was able to do this much on Mac OS X:

testall.sh: Done 225 tests - see /Users/jderry/Development/htacg/tidy-html5-tests/cases/testbase-results.txt
run-tests.sh: Running 'diff -ua /Users/jderry/Development/htacg/tidy-html5-tests/cases/testbase-expects /Users/jderry/Development/htacg/tidy-html5-tests/cases/testbase-results'
run-tests.sh: Appear to have PASSED test 2
run-tests.sh: See full results in /Users/jderry/Development/htacg/tidy-html5-tests/cases/testbase-results.txt

@geoffmcl
Copy link
Contributor Author

@balthisar thanks for testing in OS X, and advising the compare fully PASSES ;=))

I have now also conducted the tests in 2 more systems -

  • Ubuntu 14.04.3 LTS trusty - Linux 3.13.0-77-generic i686 - 32-bit - PASSED
  • Windows XP - Media Center Edition - V2002 Service Pack 3 - 32-bit - PASSED

That is now a full pass in six (6) systems!

As a git test, pushed some small changes from RPI... seems all working now...

  • _environment.sh - use relative paths, but left code there if users really want this
  • testone.sh - add default -lang en_us config

Perhaps it is time to merge this refactor branch into master?

Will close this and shortly merge issue-380 to tidy-html5 master...

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants