Segfault on nasty deep webpage #633

usamec · 2017-10-18T09:35:52Z

Tidy tool (version 5.4.0) seqfault on nasty deep webpage (sv.stargate.wikia.com/wiki/M2J).
Commandline: tidy -ashtml bad4.html -output wat.html --char-encoding "utf8"

The text was updated successfully, but these errors were encountered:

balthisar · 2017-10-18T10:23:33Z

Duplicate of #343 and others. I'll leave this one open this time in case someone want to contribute a max stack depth option; otherwise refactoring the recursive process isn't really in the cards right now.

geoffmcl · 2017-10-25T18:56:09Z

@usamec wow, using wget, that is something like a 2.3MB file, over 100,000 lines, some really long...

Tidy exhausts the default Windows stack after some 3,551 interations into ParseBlock, parsing a massive, meaningless, seemingly empty <dl><dd><dl><dd><dl><dd><dl><dd><dl><dd><dl><dd>...etc, etc, etc sequence...

Even my default Notepad++ editor has big difficulty with this file... and seemed a near endless load in Chrome, not ended until I closed it...

As @balthisar has pointed out, this is a repeat of #343, and others, and no work is either being done on a full rewrite of tidy to remove the recursive process, nor on possibly adding some type of configurable recursive counter, and stopping the process before the inevitable out-of-stack CRASH...

Accordingly marking this as Won't Fix, and closing it... but as usual thanks for pointing this out again... we just hope there are not too many such nasty deep beasts in the wild... thanks...

geoffmcl · 2017-10-27T14:32:15Z

@usamec FWIIW that file, of its 111,563 lines, some 83 blanks, and 46 spaces only, has one line some 881,372 characters long, which I think is what gave my editor a big headache, and it is entirely made up of the sequence <dl><dd><dl><dd>...etc, etc, etc...<dl><dd>Hi... meaningless, stupid html...

usamec · 2017-10-27T20:43:41Z

Agree with the stupid. Problem is that sometimes you hit this page (e.g. when doing some crawl of the web) and you want to parse it and do not get stuck on it. And every tool I have tried gets stuck on it. I can patch many of them to either stop working after some time passes, or when stack is too deep, or whatever, but it would be nice to have something, which works in linear time of webpage size without dirty hacks.

geoffmcl · 2017-10-29T15:03:27Z

@usamec yeah, sorry that tool can not be tidy...

I do not exactly understand why you are crawling the web, and parsing random pages, but again FWIIW my simple personal perl html parser script was able to handle that file, and gave me the stats... html parsing in perl is not all that difficult. Maybe you could write a script to do that... just an idea...

usamec · 2017-10-29T17:13:45Z

@geoffmcl it was part of the commoncrawl (which includes really crazy pages). Did not tried perl, since I do not use it, but thanks for the idea.

balthisar · 2021-08-15T01:17:28Z

5.9.9 fixes this.

geoffmcl added Bug Won't Fix labels Oct 25, 2017

geoffmcl added this to the 5.5 milestone Oct 25, 2017

geoffmcl closed this as completed Oct 25, 2017

geoffmcl mentioned this issue Nov 30, 2020

Recursion limit exceeded #850

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segfault on nasty deep webpage #633

Segfault on nasty deep webpage #633

usamec commented Oct 18, 2017

balthisar commented Oct 18, 2017

geoffmcl commented Oct 25, 2017

geoffmcl commented Oct 27, 2017

usamec commented Oct 27, 2017

geoffmcl commented Oct 29, 2017

usamec commented Oct 29, 2017

balthisar commented Aug 15, 2021

Segfault on nasty deep webpage #633

Segfault on nasty deep webpage #633

Comments

usamec commented Oct 18, 2017

balthisar commented Oct 18, 2017

geoffmcl commented Oct 25, 2017

geoffmcl commented Oct 27, 2017

usamec commented Oct 27, 2017

geoffmcl commented Oct 29, 2017

usamec commented Oct 29, 2017

balthisar commented Aug 15, 2021