Skip to content

Regression in 5.1.32 #342

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
serjepatoff opened this issue Jan 5, 2016 · 13 comments
Closed

Regression in 5.1.32 #342

serjepatoff opened this issue Jan 5, 2016 · 13 comments
Labels
Milestone

Comments

@serjepatoff
Copy link

Hello,

We have found a regression that shows up when processing some htmls. When using version 5.1.32 (commit 9942856) content is corrupted. In ealier version 5.1.14 (commit 7e69ceb) everything is OK.

Options used are as follows:

tidyOptSetBool(tdoc, TidyDropEmptyElems, no);
tidyOptSetInt(tdoc, TidyWrapLen, 0);
tidyOptSetBool(tdoc, TidyXhtmlOut, yes);
tidyOptSetBool(tdoc, TidyMark, no);
tidyOptSetInt(tdoc, TidyDoctypeMode, TidyDoctypeUser);
tidyOptSetBool(tdoc, TidyNumEntities, yes);
tidySetCharEncoding(tdoc, "utf8");
tidyOptSetBool(tdoc, TidyForceOutput, yes);
tidyOptSetBool(tdoc, TidyShowWarnings, no);
tidyOptSetInt(tdoc, TidyShowErrors, 0);

There are 2 html content samples in attachment. For each sample there are
(1) original html,
(2) html after tidying with 5.1.14,
(3) html after tidying with 5.1.32,
(4) screenshot of content after tidying with 5.1.14 in iOS UIWebView,
(5) screenshot of content after tidying with 5.1.32 in iOS UIWebView.

Here is tree listing of files in attachment, hope that naming would be self-explanatory to you.

.
├── html
│   ├── Content1_original
│   ├── Content1_after_5_1_14
│   ├── Content1_after_5_1_32
│   ├── Content2_original
│   ├── Content2_after_5_1_14
│   └── Content2_after_5_1_32
└── png
    ├── Content1_after_5_1_14.png
    ├── Content2_after_5_1_14.png
    ├── Content1_after_5_1_32.png
    └── Content2_after_5_1_32.png

Attachment.zip

Thank you very much for your attention.

@balthisar
Copy link
Member

@serjepatoff, I'll track it down when I can. Thanks for the report.

@geoffmcl
Copy link
Contributor

geoffmcl commented Jan 7, 2016

@serjepatoff, yes thanks for the report... took me some time to set up a repeatable test using a 5.1.14 exe... compare the outputs... etc...

And can confirm there is a change, even a loss, in output, that happened some time after 5.1.14... and is still there in the current 5.1.33Exp... will try to narrow that down... maybe by building successives versions, commit by commit, to see when this specifically occurred...

It would be really good if you could trim down the contents of say Content1_original, now my input5\in_342-1.html, to the minimum, and still show just one of the problems, like say the missing name Michael Kuzmin...

This would really assist in debugging... helps narrow down the what exactly is going wrong...

About that time there was quite a change, see #65, but may not be related. Just to be sure could you try with --skip-nested no, or tidyOptSetBool(tdoc, TidySkipNested, no);. Does this change anything? Probably not... just guessing...

Will try to set up the same test, and other tests... and get back soonest...

@geoffmcl geoffmcl added the Bug label Jan 7, 2016
@geoffmcl geoffmcl added this to the 5.2 milestone Jan 7, 2016
@geoffmcl
Copy link
Contributor

geoffmcl commented Jan 7, 2016

@serjepatoff, have narrowed it down to a change after 5.1.24, commit 496c81c, Date: Wed Nov 18 20:02:54 2015 +0100

That is to say, it appears Tidy 5.1.24 outputs the same as 5.1.14...

Tidy version 5.1.25, commit 2388fb0, starts to be different... and it looks like there were two, or more, stages in the changes...

You can checkout a source for each commit, like this -

$ git checkout 496c81c48d

This gets the source at that moment in time, 5.1.24 in this case, and you can check the version.txt file, and build a tidy for testing...

I am certainly developing a collect of versions - 5.1.14, 5.1.17, 5.1.19, 5.1.22, 5.1.23, 5.1.24, 5.1.25, 5.1.27, 5.1.33Exp ;=))

Maybe something to do with Issue #307, #167, #169 - regression of nestd anchors - your sample 1 document certainly appears to have some anchors not closed... but the hrefs are large and hard to follow...

Will look deeper as time permits, but maybe this will assist in tracking this down... and putting together a simple sample for debug...

@geoffmcl
Copy link
Contributor

geoffmcl commented Jan 9, 2016

@serjepatoff, yes, it had everything to do with the above patches, which began allowing mixed content in an anchor, as per the HTML5 spec...

So your first sample does not have a doctype, so tidy would stay in html5 mode... and treat the page as html5...

But your config does set the doctype - tidyOptSetInt(tdoc, TidyDoctypeMode, TidyDoctypeUser); - but this was not being looked at...

So it is a simple case, if no doctype, and the TidyDoctypeMode is not 'auto', the default, nor 'html5', then flip tidy back into html4-- mode...

In this mode, tidy will again force an anchor closed as soon as it meets another element, not allowed in an anchor... and we are back to 5.1.14 anchor like parsing...

The simple patch so far is -

diff --git a/src/parser.c b/src/parser.c
index e751130..c2d78b0 100644
--- a/src/parser.c
+++ b/src/parser.c
@@ -4744,12 +4744,21 @@ void TY_(ParseDocument)(TidyDocImpl* doc)
         /*\
          *  #72, avoid MISSING_DOCTYPE if show-body-only. 
          *  #191, also if --doctype omit, that is TidyDoctypeOmit
+         *  #342, adjust tags to html4-- if not 'auto' or 'html5'
         \*/
-        if (!TY_(FindDocType)(doc) && !showingBodyOnly(doc)) 
+        if (!TY_(FindDocType)(doc)) 
         {
             ulong dtmode = cfg( doc, TidyDoctypeMode );
-            if (dtmode != TidyDoctypeOmit)
+            if ((dtmode != TidyDoctypeOmit) && !showingBodyOnly(doc))
                 TY_(ReportError)(doc, NULL, NULL, MISSING_DOCTYPE);
+            if ((dtmode != TidyDoctypeAuto) && (dtmode != TidyDoctypeHtml5))
+            {
+                /*\
+                 *  Issue #342 - if not doctype 'auto', or 'html5'
+                 *  then reset mode htm4-- parsing
+                \*/
+                TY_(AdjustTags)(doc); /* Dynamically modify the tags table to html4-- mode */
+            }
         }
         TY_(InsertNodeAtEnd)( &doc->root, html);
         TY_(ParseHTML)( doc, html, IgnoreWhitespace );

If you are testing with command line tidy, then you need to add say the command --doctype 'user', or add doctype: "user" to the config...

I have still to test it on your second sample, but expect it to work... I will get around to adding that patch to the repo, but that may be a few days, unless someone beats me to it with a PR ;=))

It would be appreciated if you could add this patch, compile from source, test and report... thanks...

@geoffmcl
Copy link
Contributor

geoffmcl commented Jan 9, 2016

Oops, that patch is against the issue-329 branch, which is where I was working at the time...

To get it -

$ git checkout issue-329

@geoffmcl
Copy link
Contributor

geoffmcl commented Feb 1, 2016

@serjepatoff eventually got around to pushing this to master...

Would be appreciated if you could pull master 5.1.35 plus, and test...

And if this fixes this issue close it... thanks...

@geoffmcl
Copy link
Contributor

No further comment for over a week, so closing this for now...

Please fell free to reopen, or raise another issue... thanks...

@serjepatoff
Copy link
Author

Thank you very much Geoff. And sorry for absence. We had no time to test the fixes because of tight schedule of our product release, so we lived with stable 4.9.10 until now. Now we have a little time and I'm starting to test new Tidy.

@serjepatoff
Copy link
Author

I've just integrated 5.1.36issue-373(9cf97d5), checked up many samples and everything became OK, except for one html where tidying breaks table layout.
report2.zip

@geoffmcl
Copy link
Contributor

@serjepatoff no problem, and thank you for retesting...

I have now updated the issue-373 branch with all the latest from master, and changed the version to 5.1.41issue-373 which I used for re-testing...

Concerning your zip, I assume there was some mistake in the doc1_before.html since it started with an img tag... that would cause a real mess in parsing so removed it... Also since the patch made was for when there was no doctype, but a user doctype given, also removed the html5 doctype...

Then to match your config used a config file below. Note commented out the tidy-mark since I wanted to be sure I was using the correct 5.1.41issue-373 version

// cfg_342-4.txt
drop-empty-elements: no
wrap: 0
output-xhtml: yes
// tidy-mark: no
doctype: "user"
numeric-entities: yes
char-encoding: utf8
indent: yes

And yes something is wrong with the second? table width, and the final text is right aligned, so have re-opened this issue...

Will see if I can find what is wrong... but it would be appreciated if you could update to this latest and re-check your samples are all ok, except the case you posted... thanks...

@geoffmcl geoffmcl reopened this Feb 18, 2016
@serjepatoff
Copy link
Author

Geoff, there is no mistake in doc1_before.html

It is real-world case found "in the wild".
Unfortunately some content makers (especially e-mail html template designers) do not respect rules and standards, abusing web-engines as much as they can.

As far as I can guess, it is 1x1 pseudo image used to track html email opening.

@serjepatoff
Copy link
Author

#And yes something is wrong with the second?

Yes, bottom part of table falls out of bounds.
screen shot 2016-02-18 at 21 06 09
screen shot 2016-02-18 at 21 02 42

@balthisar
Copy link
Member

@serjepatoff, did you mean to close this? This means you agree Tidy is behaving correctly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants