Skip to content

Ambiguity in parsing inside of <script> blocks #348

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
KevinCarhart opened this issue Jan 16, 2016 · 16 comments
Closed

Ambiguity in parsing inside of <script> blocks #348

KevinCarhart opened this issue Jan 16, 2016 · 16 comments
Milestone

Comments

@KevinCarhart
Copy link

Hi Geoff and co,

For the following HTML:
<script type="text/javascript">
ua=/</g;va=/>/g;
</script>
libtidy (as called by edbrowse) reports:
'<' + '/' + letter not allowed here

So I think it's interpreting this
</g
as the first three characters of a closing HTML tag.
(FYI if you don't recognize it, what it's actually doing in the JS world is assigning a regular expression criterion to a variable. The criterion just happens to be a <, so that's how the ambiguity arises.)

This is deeply related to the famous and brain-bending issue #65 but since #65 is closed, I am posting a new one.

thanks!!
Kevin

@geoffmcl
Copy link
Contributor

@KevinCarhart hi Kevin... I saw this mentioned on the eb dev list a few times... I guess I try to handle too many emails... so this is a better place to put it... thanks...

This is not really related to #65... that was more about the script parser keeping count of nested open elements, and not exiting until the count went zero, hence the idea of too greedy... At first I tried to solve this by parsing the actual javascript, but tidy should not get into that! That is a veritable mine field!

It was eventually solved by adding the --skip-nested option, which defaults to on, but the old behaviour can be used by setting it off.

But virtually since the beginning of time, tidy has been adding an escape character if it found '<' + '/' + letters, and transforms that into <\/letters, in every case, and issues the warning... I have always wondered why! Now you have found a case where this is a bad idea!

My chrome browser readily accepts ua = /</g;, but marks the assignment as an error if this is changed to ua = /<\/g;! That is understandable, since the addition of the escape character makes the second / literal, and thus the regex is no longer closed, / + regex + /g. A javascript parsing error!

Now it would be a simple hack to not do such escaping, but I wonder about the consequences in all other cases!!! See lexer.c:2154

As you may know, as part of our tidy regression testing we keep the known output in a testbase folder, and similarly for the developing ruby regression testing, so at the very least to not do such escaping would probably throw up some changes... well all of that can be handled...

But it is a simple question and decision.

Should tidy be involved in always escaping '<' + '/' + letters in javascript? What is the case for doing this? Here is one example where it is clearly bad! But what about all the other examples that can be constructed?

Note I have added an s to letter. Tidy will always get to the end of the letter sequence before doing what it does. And it will always have compared the letter sequence to the container string, looking for, in this case </script>, and sets a matched variable...

What about a simple document.write('<p>para</p>');? Tidy would change this to <\/p>. Why is it doing that? What is gained? Is it important to continue to do that? Is there a problem if it does not? What is the problem? It seems the javascript parser does not see a problem.

I am always reluctant to change such old, established behaviour without discussion, understanding and good reasons!

So really need some help in deciding... especially concerning the original purpose! What did such escaping solve that would now not be solved if removed? Is this going to be different between html4-- and html5++ modes? Maybe it should be a new option, say escape-javascript yes/no? etc, etc, etc...

Please help with comments, especially W3C reference specs on <script...> content... maybe that would help... has it changed?

@geoffmcl geoffmcl added this to the 5.2 milestone Jan 17, 2016
@CMB
Copy link
Contributor

CMB commented Jan 18, 2016

@geoffmcl,
We also believe this discussion applies to style and textarea
as well as script.

The definition of style and script say that the allowed content is
non-replaceable character data. The definition of textarea says that
the allowed content is replaceable character data.

https://www.w3.org/TR/html-markup/script.html (definition of script element)

https://www.w3.org/TR/html-markup/style.html (definition of style element)

https://www.w3.org/TR/html-markup/textarea.html (definition of textarea element)

Definitions for replaceable and non-replaceable character data are given
in the HTML 5 spec section 4.5:
https://www.w3.org/TR/html-markup/syntax.html#text-syntax

As far as I can tell, the big difference between nonreplaceable
and replaceable character data is that non-replaceable may contain
ambiguous ampersands and may not contain character references.
After all, it's used for things like JavaScript source, where
ampersand is an operator.

In all of these cases, it appears that less-than slash inside of the
(replaceable or non-replaceable) character data should be left unescaped.

Regards,

@KevinCarhart
Copy link
Author

Thank you @geoffmcl, yes, I concur about reluctance to change something that is established.
not really related to 65 - oh, okay.
Thank you @CMB. Those are extremely salient to the problem and I sure wasn't familiar with the replaceable/nonreplaceable distinction.
Honestly I don't know. It is a good question. Why now? Why has it been escaping for many years?
Is the description of replaceable and nonreplaceable as part of the HTML5 spec sufficent rationale for doing the change?
Maybe there's something to be gleaned about this in old tidy listserv archives.

@zcorpan
Copy link

zcorpan commented Jan 21, 2016

https://www.w3.org/TR/html-markup/script.html (definition of script element)

This is not the HTML spec. Don't read this.

See https://html.spec.whatwg.org/multipage/scripting.html#restrictions-for-contents-of-script-elements

What about a simple document.write('<p>para</p>');?

Tidy does this because it was invalid SGML back in the pre-HTML5 days when the people writing HTML specs pretended that HTML was an application of SGML. But that is not relevant anymore, and there's no point in escaping the above.

It is relevant to escape <!--, </script, etc., but since browsers parse scripts exactly the same now, it seems to me it's not really necessary for Tidy to change the contents of scripts at all. Without knowledge of the JS syntax, it seems to me it can't do a good job of escaping the problematic strings.

@geoffmcl
Copy link
Contributor

@CMB, @KevinCarhart, @zcorpan, thank you for the feedback and information on this... I am slowly becoming convinced that Tidy should get out of the script escaping business!, especially after reading another error example on the edbrowse list, js on page... but always come back to it has been there for so long...

I did try trudging back in the list achives, and found quite a lot said about 'script parsing', like this one back in 2004, that suggests an option to control this. And even then you can see Klaus is challenging with sort of "show what harm is done?". We certainly now do have some examples where this is clearly wrong.

I would certainly appreciate being pointed to other relevant posts...

This option idea is certainly how I will likely proceed... perhaps something like --escape-scripts yes, initially the default, with a description like -

TidyEscapeScripts,
This option specifies that Tidy should escape script elements that look like closing tags. 
That is '</p>' would become '<\/p>'. Set to 'no' to prevent this action.

And I am keeping in mind that this may also apply to style and textarea, but will concentrate on script parsing first...

As always appreciate more comments on this quite, what feels like, dramatic change in Tidy's behaviour...

And if you want to try your hand at coding this, and presenting a PR, then please create an issue-348 branch to hold this change... thanks...

@KevinCarhart
Copy link
Author

Thank you @zcorpan @CMB @geoffmcl
I also found this list thread between Björn and Denver in 2007.
https://sourceforge.net/p/tidy/mailman/message/6803282/
The subject is "JavaScript regular expression (regexp) parsed incorrectly"
I think what Denver is reporting here has a similar cause to the example problems that we have hit recently.
Is it persuasive? I'm not sure, but it seems to be one of the most relevant moments in old traffic, in addition to the one from Klaus.

@geoffmcl
Copy link
Contributor

@KevinCarhart thanks for the additional sf803 link, and yes it seems it is the same </g regex as your case...

During that epoch, Björn was a major contributor to Tidy, and you can see him arguing against changing Tidy, and gives a clear html4 reference, which enforces elements must not include '<' + '/' + some letter. And introduced me to a new word, ETAGO for the </...

To which Denver only replies "The appendix states: 'The following notes are informative, not normative.'", but none the less closes the bug, even though he presented a sample which passes the validator.

However, Björn was right, it only passes because 'You are using XHTML and an awful lot of CDATA and other magic to make the Validator not see this, but Tidy isn't as "clever"'. Remove that magic and it will fail!

But we are now well into the html5 era, and I can easily construct a similar example, without CDATA or magic, that passes validation, which tidy will mess up, and then javascript will flag it as an error, eg -

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>#348-5 - Tidy problem</title>
<script>
function htmlEncode(s){return (s+'').replace(/\&/g,'&amp;').replace(/\</g,'&lt;').replace(/\>/g,'&gt;')}
</script>
</head>
<body>
<p>
This document is not parsed correctly and the proposed clean up by Tidy is erroneous.
Document passes https://validator.w3.org/check - 20160213</p>
</body>
</html>

So yes, this is persuasive!

I am convinced there should be an option the user can use to prevent this. Will work on it when I get a chance. Just prevent this block of code from running. Note it is already restricted to a javascript container.

Once the new option is added, say --escape-scripts yes, to tidyenum.h, config.c, and language_en.h, then the if becomes say -

 if ((TY_(IsJavaScript)(container)) && cfgBool(doc, TidyEscapeScripts))
 {
    /* Issue #348 - only escape the ETAGO if option is Yes */
    ... do warning and escape the ETAGO ...

As always would appreciate patches, or a PR...

@zcorpan
Copy link

zcorpan commented Feb 13, 2016

I thought this was tidy-html5, not tidy-html4? Also see my earlier comment above...

@geoffmcl
Copy link
Contributor

@zcorpan, well not exactly. Despite the repo name being tidy-html5, and the 5 being used in many places, this is actually classical tidy4 to handle legacy documents and modern tidy5 to handle the new, rolled into one...

So the option idea is to allow the user to choose between those modes ;=))

It is true we may later choose to default such an option to off... as we have done with other options that establish the mode...

@CMB
Copy link
Contributor

CMB commented Feb 13, 2016

Geoff McLane notifications@github.com writes:

Once the new option is added, say --escape-scripts yes, to tidyenum.h,
config.c, and language_en.h, then the if becomes say -

if ((TY_(IsJavaScript)(container)) && cfgBool(doc, TidyEscapeScripts))

@geoffmcl Yes, sounds good, except that we will also need to worry about

<style> and <textarea>, which have a similar issue.

@geoffmcl
Copy link
Contributor

@CMB sorry for the delay... I actually encoded this many weeks ago, but could not find the right time to push it...

I am not sure we need to be concerned about the <style> and <textarea> tags because, as far as I can see this escaping only happened if it was a <script...> tag. Note the (TY_(IsJavaScript)(container) as part of the if, but please advise if I am wrong...

Now this has been pushed to the master branch, but the option defaults to yes, so you need to add --escape-scripts no to your config...

This is in the 5.1.47++ version, so you need to pull, and build that. Please advise if this does not fix the problem. I have only done minimal testing...

At the same time I have added an README/OPTIONS.md, describing the steps required to add a new option to Tidy.

Appreciated if you get the chance to test this version, and the new option, and if ok, close this... thanks...

@KevinCarhart
Copy link
Author

All right! Thank you @geoffmcl. I will try this now.

@KevinCarhart
Copy link
Author

@geoffmcl @CMB Thank you for coding this and making the readme.md, Geoff. I compiled libtidy with the new option set to true. Then I tried (in edbrowse with libtidy) a test page like what I posted on 1/15. And, it does the trick. Then I tested on a complex case. We've been working with a couple of complex examples recently such as https://www.oakgov.com/sheriff/Pages/Inmates-Current.aspx (raised by @eklhad) and https://groups.google.com/forum/#!topic/boost-compute/xJS05dkQEJk (raised by Sebastian.) I'm investigating the Google example and it appears that having the new parse option set to true means that the page can proceed instead of getting out of sync, which means the run of the page then exposes the next bottleneck, some other unrelated problem. So this is progress! If something improved on a Google page, that may be about as tricky as it gets and we may have improvements in a variety of pages that users want.

Chris, what do you think about closing this now? I am not clear on whether <style> and <textarea> are outstanding, because Chris said, I think we will also need to worry about <style> and <textarea> which have a similar issue, but then Geoff said a few days ago that he doesnt think thats the case. If there was more to do, we could start a new issue. If youre ready, I will press the button.

@CMB
Copy link
Contributor

CMB commented Mar 23, 2016 via email

@geoffmcl
Copy link
Contributor

@KevinCarhart, @CMB thanks for testing and reporting...

And searching deeper into the if check - (TY_(IsJavaScript)(container) - I can now see that if the container contains no atributes, that service returns yes? Strange! But if it does have attributes, then only if there is a type or name attribute which then contains the string javascript will it return yes... In other words it is not doing exactly as I thought!

The file tags.h does have a define of what I expected that to be, namely nodeIsSCRIPT( node )... but...

But now I have added an AND to the if - && cfgBool(doc, TidyEscapeScripts), so compiling libtidy with that with set no, or adding a permanent config item --escape-scripts no, or in your case, in html-tidy.c, in html2node() service, around line 122, simply adding the code tidyOptSetBool( tdoc, TidyEscapeScripts, no );, I think you will get the desired bahaviour.

In any case, would prefer this closed and a new issue opened if you run across a use cases, hopefully with minimal sample html, where this blocking of the escaping still fails... thanks...

@KevinCarhart
Copy link
Author

Thanks! Ok, I will bring the suggested code tidyOptSetBool( tdoc, TidyEscapeScripts, no ); back to the edbrowse list and likely do a patch with that. Closing 348 here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants