Skip to content

stack exhaustion in tidy-html5 5.1.25 #343

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gaa-cifasis opened this issue Jan 7, 2016 · 8 comments
Closed

stack exhaustion in tidy-html5 5.1.25 #343

gaa-cifasis opened this issue Jan 7, 2016 · 8 comments

Comments

@gaa-cifasis
Copy link

Hello,

We found a stack exhaustion in tidy-html5 (version: 5.1.25). You can find a test case to reproduce it here [1.4MB]. Technical details are here:

$ gdb -ex 'tty /dev/null' --args ./tidy exhaustion.html
(gdb) run
Starting program: ./tidy exhaustion.html
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff4e53b2b in ?? () from /usr/lib/x86_64-linux-gnu/libasan.so.0
(gdb)
(gdb) bt
#0 0x00007ffff4e53b2b in ?? () from /usr/lib/x86_64-linux-gnu/libasan.so.0
#1 0x00007ffff4e60443 in malloc () from /usr/lib/x86_64-linux-gnu/libasan.so.0
#2 0x000000000047c581 in defaultAlloc (allocator=0x721de0 <prvTidyg_default_allocator>, size=2048)

at /home/vagrant/repos/tidy-html5-5.1.25/src/alloc.c:59

#3 0x000000000040ecf6 in messagePos (doc=0x607c00018900, level=TidyWarning, line=6169, col=44, msg=0x497f60 "nested emphasis %s",

args=0x7fffff7ff9f0) at /home/vagrant/repos/tidy-html5-5.1.25/src/localize.c:1225

#4 0x000000000040f9da in messageNode (doc=0x607c00018900, level=TidyWarning, node=0x601600f086d0, msg=0x497f60 "nested emphasis %s")

at /home/vagrant/repos/tidy-html5-5.1.25/src/localize.c:1343

#5 0x00000000004113d5 in prvTidyReportWarning (doc=0x607c00018900, element=0x601600f086d0, node=0x601600f08620, code=9)

at /home/vagrant/repos/tidy-html5-5.1.25/src/localize.c:1681

#6 0x000000000043bcdd in prvTidyParseInline (doc=0x607c00018900, element=0x601600f086d0, mode=MixedContent)

at /home/vagrant/repos/tidy-html5-5.1.25/src/parser.c:1757

#7 0x0000000000436ecf in ParseTag (doc=0x607c00018900, node=0x601600f086d0, mode=MixedContent)

at /home/vagrant/repos/tidy-html5-5.1.25/src/parser.c:775

#8 0x000000000043e07a in prvTidyParseInline (doc=0x607c00018900, element=0x601600f08780, mode=MixedContent)

at /home/vagrant/repos/tidy-html5-5.1.25/src/parser.c:2202

#9 0x0000000000436ecf in ParseTag (doc=0x607c00018900, node=0x601600f08780, mode=MixedContent)

at /home/vagrant/repos/tidy-html5-5.1.25/src/parser.c:775

#10 0x000000000043e07a in prvTidyParseInline (doc=0x607c00018900, element=0x601600f08830, mode=MixedContent)

at /home/vagrant/repos/tidy-html5-5.1.25/src/parser.c:2202

#11 0x0000000000436ecf in ParseTag (doc=0x607c00018900, node=0x601600f08830, mode=MixedContent)

at /home/vagrant/repos/tidy-html5-5.1.25/src/parser.c:775

#12 0x000000000043e07a in prvTidyParseInline (doc=0x607c00018900, element=0x601600f088e0, mode=MixedContent)

at /home/vagrant/repos/tidy-html5-5.1.25/src/parser.c:2202

#13 0x0000000000436ecf in ParseTag (doc=0x607c00018900, node=0x601600f088e0, mode=MixedContent)

at /home/vagrant/repos/tidy-html5-5.1.25/src/parser.c:775

#14 0x000000000043e07a in prvTidyParseInline (doc=0x607c00018900, element=0x601600f08990, mode=MixedContent)

at /home/vagrant/repos/tidy-html5-5.1.25/src/parser.c:2202

#15 0x0000000000436ecf in ParseTag (doc=0x607c00018900, node=0x601600f08990, mode=MixedContent)

at /home/vagrant/repos/tidy-html5-5.1.25/src/parser.c:775

#16 0x000000000043e07a in prvTidyParseInline (doc=0x607c00018900, element=0x601600f08a40, mode=MixedContent)

at /home/vagrant/repos/tidy-html5-5.1.25/src/parser.c:2202

#17 0x0000000000436ecf in ParseTag (doc=0x607c00018900, node=0x601600f08a40, mode=MixedContent)

at /home/vagrant/repos/tidy-html5-5.1.25/src/parser.c:775

#18 0x000000000043e07a in prvTidyParseInline (doc=0x607c00018900, element=0x601600f08f10, mode=MixedContent)

at /home/vagrant/repos/tidy-html5-5.1.25/src/parser.c:2202

#19 0x0000000000436ecf in ParseTag (doc=0x607c00018900, node=0x601600f08f10, mode=MixedContent)

at /home/vagrant/repos/tidy-html5-5.1.25/src/parser.c:775

#20 0x000000000043e07a in prvTidyParseInline (doc=0x607c00018900, element=0x601600f08fc0, mode=MixedContent)

at /home/vagrant/repos/tidy-html5-5.1.25/src/parser.c:2202

#21 0x0000000000436ecf in ParseTag (doc=0x607c00018900, node=0x601600f08fc0, mode=MixedContent)

at /home/vagrant/repos/tidy-html5-5.1.25/src/parser.c:775

#22 0x000000000043e07a in prvTidyParseInline (doc=0x607c00018900, element=0x601600f09070, mode=MixedContent)

at /home/vagrant/repos/tidy-html5-5.1.25/src/parser.c:2202

#23 0x0000000000436ecf in ParseTag (doc=0x607c00018900, node=0x601600f09070, mode=MixedContent)

---Type to continue, or q to quit---q

(.. really long back trace)

In my opinion, tidy-html5 shouldn't crash because you can force it to execute a really long sequence of functions calls (tail recursion maybe can help?).

Regards,
Gus.

@geoffmcl
Copy link
Contributor

geoffmcl commented Jan 7, 2016

@gaa-cifasis, hi Gus, yes this is a Tidy Feature ;=))

I guess if I searched back far enough in the bugs, in SF, I would find it maybe mentioned, maybe several times... It is really easy to create a rediculously long html sequence that will exhaust the stack... but it is usually not a very pratical html file...

You are welcome to commence a re-write of library tidy and do this another way. The current library uses recursive calls as the DOM like stack/tree... I too could think of other, maybe better ways, like all of us probably, but this would be a major re-write of the parser!!!

I remember reading at that time, way back in history, that if you do run across such a rediculous beast in the wild there are maybe ways to increase the stack allocation of the executable, if you really must use libtidy to handle this file... but there are probably some limits on that...

But hey, thanks for generating that beautiful 1.4 MB sample... and look forward to ideas on a tidy re-write to handle ANYTHING ;=))

@geoffmcl geoffmcl added this to the Indefinite future milestone Jan 7, 2016
@benkasminbullock
Copy link
Contributor

How can this be a "Feature"?

@geoffmcl
Copy link
Contributor

@benkasminbullock, well only in the sense that it has existed since the first tidy I have, Dave Raggett's tidy04aug00, so this stack exhaustion is built into the fabric of tidy code... built into the way it works... so it is a feature? And a feature request to change it...

I do not want to label it a bug, although that is what it also is, as I want to keep that label here for fixable bugs! Things being worked on...

But this big-bad-bug, or limitation of tidy has no solution other than a full re-write, a redesign, of the parser code. It is maybe a design flaw, built into the code, but I do not have labels like this? Nor do I particularly want to create them... so chose feature from what I have... no big idea...

If I was in your repo I could add labels like enhancement, and certainly help wanted, ...

I have added Technical Support, meaning a discussion on tidy workings. As stated, look forward to ideas on a tidy re-write, re-design, to handle ANYTHING ;=)) and do something about this perpetual bug...

@geoffmcl
Copy link
Contributor

geoffmcl commented Apr 2, 2016

@gaa-cifasis, @benkasminbullock in combing through, and closing all the old SF bugs, found this was mentioned as far back as 2005-12-01, by Lee Jensen, Bug 742!

Now what Lee mentioned, and we discussed, was that maybe there could be configuration options like --max-stack-depth NNNN. The default could be 0, which means there is no limit...

Already in my MSVC Debug code I have a static counter for the depth of say the calls to ParseInline, and running this sample, using my debug version, it exhaused the stack at a depth of 5747, as indicated by the last entries in my debug output log, just before the windows tidy-stopped-working dialog -

Entering ParseInline 5746...
R=682 C=64: Returning starttag node <bdo> stream
Entering ParseInline 5747...
R=682 C=69: Returning starttag node <bdo> stream

A second run stopped at 5753, then 5750,... but, for sure it stops!

Naturally each exit from ParseInline also decrements this static counter -

R=681 C=69: Returning starttag node <bdo> stream
Entering ParseInline 5740...
R=682 C=1: Returning endtag node <bdo> stream
Exit ParseInline 2 5739...
R=682 C=7: Returning endtag node <bdo> stream
Exit ParseInline 2 5738...
R=682 C=13: Returning endtag node <bdo> stream
Exit ParseInline 2 5737...
R=682 C=19: Returning starttag node <bdo> stream
Entering ParseInline 5738...
R=682 C=24: Returning starttag node <bdo> stream
Entering ParseInline 5739...

Of course using a static counter is not the best, since we would also have to ensure it is cleared for each document, but it could be a simple variable as part of the TidyDocument lexer structure, which is created new for each document... And we might need to count more than just ParseInline... that would need to be checked, verified...

Anyway, this seems like a viable possibility...

What do other think? Should we bother? Thoughts, comments, even a PR welcome... thanks...

@balthisar
Copy link
Member

Would the static counter be configurable? Or we need to have a really good sample from different processors, as this would certainly give different results on x86 versus 68020 (assuming we could compile it).

@geoffmcl
Copy link
Contributor

geoffmcl commented May 4, 2017

@balthisar good idea to close this... since it would be quite an effort to do something meaningful in a cross platform, multi-cpu way...

But really the stacked calls are the heart of tidy functionality... the best way would be a total rewrite! Employing an entirely different methodology...

Did not know we had a Won't Fix label... ;=)) Will keep it in mind...

@geoffmcl
Copy link
Contributor

Hi Lilly,

Re: libtidy stack overflow

As discussed in this issue - circa 2016 -
#343
this is a feature of how tidy was written,
and was perhaps first mentioned back in 2005 -
https://sourceforge.net/p/tidy/bugs/742/
although there could be other references...

As can be seen, there has been some discussion
about adding an option to count, and say limit
the stack depth, but no code has been presented
so far...

So at this time, there is NO SOLUTION pending...
sorry...

We, at Tidy, would certainly appreciate any further
thoughts or code, on this, but please share it on
issue #343 so others can respond, rather than just
me... thanks...

Regards,
Geoff.

PS: Will cross post this there...

On 30/07/17 23:53, Lilly Random wrote:
Lilly Random <incognito.lilly at gmail dot com>

Hi,

I am planning to use your lib tidy for html parsing and traverse. However, it
seems that if I generate a sample html file with a lot of nested lists it
crashes because of stack overflow. I have to take into account any possible
input without crashing, so is this something that can be added in the near
future? Eg. having the depth when adding nodes (parent depth +1), and then stop
processing if certain depth is exceeded?
Thanks a lot.

@balthisar
Copy link
Member

5.9.9 fixes this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants