Skip to content

Segfault on nasty deep webpage #633

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
usamec opened this issue Oct 18, 2017 · 7 comments
Closed

Segfault on nasty deep webpage #633

usamec opened this issue Oct 18, 2017 · 7 comments
Milestone

Comments

@usamec
Copy link

usamec commented Oct 18, 2017

Tidy tool (version 5.4.0) seqfault on nasty deep webpage (sv.stargate.wikia.com/wiki/M2J).
Commandline: tidy -ashtml bad4.html -output wat.html --char-encoding "utf8"

@balthisar
Copy link
Member

Duplicate of #343 and others. I'll leave this one open this time in case someone want to contribute a max stack depth option; otherwise refactoring the recursive process isn't really in the cards right now.

@geoffmcl
Copy link
Contributor

@usamec wow, using wget, that is something like a 2.3MB file, over 100,000 lines, some really long...

Tidy exhausts the default Windows stack after some 3,551 interations into ParseBlock, parsing a massive, meaningless, seemingly empty <dl><dd><dl><dd><dl><dd><dl><dd><dl><dd><dl><dd>...etc, etc, etc sequence...

Even my default Notepad++ editor has big difficulty with this file... and seemed a near endless load in Chrome, not ended until I closed it...

As @balthisar has pointed out, this is a repeat of #343, and others, and no work is either being done on a full rewrite of tidy to remove the recursive process, nor on possibly adding some type of configurable recursive counter, and stopping the process before the inevitable out-of-stack CRASH...

Accordingly marking this as Won't Fix, and closing it... but as usual thanks for pointing this out again... we just hope there are not too many such nasty deep beasts in the wild... thanks...

@geoffmcl geoffmcl added this to the 5.5 milestone Oct 25, 2017
@geoffmcl
Copy link
Contributor

@usamec FWIIW that file, of its 111,563 lines, some 83 blanks, and 46 spaces only, has one line some 881,372 characters long, which I think is what gave my editor a big headache, and it is entirely made up of the sequence <dl><dd><dl><dd>...etc, etc, etc...<dl><dd>Hi... meaningless, stupid html...

@usamec
Copy link
Author

usamec commented Oct 27, 2017

Agree with the stupid. Problem is that sometimes you hit this page (e.g. when doing some crawl of the web) and you want to parse it and do not get stuck on it. And every tool I have tried gets stuck on it. I can patch many of them to either stop working after some time passes, or when stack is too deep, or whatever, but it would be nice to have something, which works in linear time of webpage size without dirty hacks.

@geoffmcl
Copy link
Contributor

@usamec yeah, sorry that tool can not be tidy...

I do not exactly understand why you are crawling the web, and parsing random pages, but again FWIIW my simple personal perl html parser script was able to handle that file, and gave me the stats... html parsing in perl is not all that difficult. Maybe you could write a script to do that... just an idea...

@usamec
Copy link
Author

usamec commented Oct 29, 2017

@geoffmcl it was part of the commoncrawl (which includes really crazy pages). Did not tried perl, since I do not use it, but thanks for the idea.

@balthisar
Copy link
Member

5.9.9 fixes this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants