-
Notifications
You must be signed in to change notification settings - Fork 429
Segfault on nasty deep webpage #633
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Duplicate of #343 and others. I'll leave this one open this time in case someone want to contribute a max stack depth option; otherwise refactoring the recursive process isn't really in the cards right now. |
@usamec wow, using wget, that is something like a 2.3MB file, over 100,000 lines, some really long... Tidy exhausts the default Windows stack after some 3,551 interations into Even my default Notepad++ editor has big difficulty with this file... and seemed a near endless load in Chrome, not ended until I closed it... As @balthisar has pointed out, this is a repeat of #343, and others, and no work is either being done on a full rewrite of tidy to remove the recursive process, nor on possibly adding some type of configurable recursive counter, and stopping the process before the inevitable Accordingly marking this as |
@usamec FWIIW that file, of its 111,563 lines, some 83 blanks, and 46 spaces only, has one line some 881,372 characters long, which I think is what gave my editor a big headache, and it is entirely made up of the sequence |
Agree with the stupid. Problem is that sometimes you hit this page (e.g. when doing some crawl of the web) and you want to parse it and do not get stuck on it. And every tool I have tried gets stuck on it. I can patch many of them to either stop working after some time passes, or when stack is too deep, or whatever, but it would be nice to have something, which works in linear time of webpage size without dirty hacks. |
@usamec yeah, sorry that tool can not be I do not exactly understand why you are |
@geoffmcl it was part of the commoncrawl (which includes really crazy pages). Did not tried perl, since I do not use it, but thanks for the idea. |
5.9.9 fixes this. |
Tidy tool (version 5.4.0) seqfault on nasty deep webpage (sv.stargate.wikia.com/wiki/M2J).
Commandline:
tidy -ashtml bad4.html -output wat.html --char-encoding "utf8"
The text was updated successfully, but these errors were encountered: