-
Notifications
You must be signed in to change notification settings - Fork 429
Ambiguity in parsing inside of <script> blocks #348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@KevinCarhart hi Kevin... I saw this mentioned on the eb dev list a few times... I guess I try to handle too many emails... so this is a better place to put it... thanks... This is not really related to #65... that was more about the script parser keeping count of nested open elements, and not exiting until the count went zero, hence the idea of It was eventually solved by adding the But virtually since the beginning of time, tidy has been adding an escape character if it found '<' + '/' + letters, and transforms that into My chrome browser readily accepts Now it would be a simple hack to not do such escaping, but I wonder about the consequences in all other cases!!! See lexer.c:2154 As you may know, as part of our tidy regression testing we keep the But it is a simple question and decision. Should tidy be involved in always escaping '<' + '/' + letters in javascript? What is the case for doing this? Here is one example where it is clearly bad! But what about all the other examples that can be constructed? Note I have added an What about a simple I am always reluctant to change such old, established behaviour without discussion, understanding and good reasons! So really need some help in deciding... especially concerning the original purpose! What did such escaping solve that would now not be solved if removed? Is this going to be different between html4-- and html5++ modes? Maybe it should be a new option, say Please help with comments, especially W3C reference specs on |
@geoffmcl, The definition of style and script say that the allowed content is https://www.w3.org/TR/html-markup/script.html (definition of script element) https://www.w3.org/TR/html-markup/style.html (definition of style element) https://www.w3.org/TR/html-markup/textarea.html (definition of textarea element) Definitions for replaceable and non-replaceable character data are given As far as I can tell, the big difference between nonreplaceable In all of these cases, it appears that less-than slash inside of the Regards, |
Thank you @geoffmcl, yes, I concur about reluctance to change something that is established. |
This is not the HTML spec. Don't read this. See https://html.spec.whatwg.org/multipage/scripting.html#restrictions-for-contents-of-script-elements
Tidy does this because it was invalid SGML back in the pre-HTML5 days when the people writing HTML specs pretended that HTML was an application of SGML. But that is not relevant anymore, and there's no point in escaping the above. It is relevant to escape |
@CMB, @KevinCarhart, @zcorpan, thank you for the feedback and information on this... I am slowly becoming convinced that Tidy should get out of the script escaping business!, especially after reading another error example on the edbrowse list, js on page... but always come back to it has been there for so long... I did try trudging back in the list achives, and found quite a lot said about 'script parsing', like this one back in 2004, that suggests an option to control this. And even then you can see Klaus is challenging with sort of "show what harm is done?". We certainly now do have some examples where this is clearly wrong. I would certainly appreciate being pointed to other relevant posts... This option idea is certainly how I will likely proceed... perhaps something like
And I am keeping in mind that this may also apply to style and textarea, but will concentrate on script parsing first... As always appreciate more comments on this quite, what feels like, dramatic change in Tidy's behaviour... And if you want to try your hand at coding this, and presenting a PR, then please create an |
Thank you @zcorpan @CMB @geoffmcl |
@KevinCarhart thanks for the additional sf803 link, and yes it seems it is the same During that epoch, Björn was a major contributor to Tidy, and you can see him arguing against changing Tidy, and gives a clear html4 reference, which enforces To which Denver only replies "The appendix states: 'The following notes are informative, not normative.'", but none the less closes the bug, even though he presented a sample which passes the validator. However, Björn was right, it only passes because 'You are using XHTML and an awful lot of CDATA and other magic to make the Validator not see this, but Tidy isn't as "clever"'. Remove that But we are now well into the html5 era, and I can easily construct a similar example, without CDATA or magic, that passes validation, which tidy will mess up, and then javascript will flag it as an error, eg - <!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>#348-5 - Tidy problem</title>
<script>
function htmlEncode(s){return (s+'').replace(/\&/g,'&').replace(/\</g,'<').replace(/\>/g,'>')}
</script>
</head>
<body>
<p>
This document is not parsed correctly and the proposed clean up by Tidy is erroneous.
Document passes https://validator.w3.org/check - 20160213</p>
</body>
</html> So yes, this is persuasive! I am convinced there should be an option the user can use to prevent this. Will work on it when I get a chance. Just prevent this block of code from running. Note it is already restricted to a javascript container. Once the new option is added, say
As always would appreciate patches, or a PR... |
I thought this was tidy-html5, not tidy-html4? Also see my earlier comment above... |
@zcorpan, well not exactly. Despite the repo name being tidy-html5, and the So the option idea is to allow the user to choose between those modes ;=)) It is true we may later choose to default such an option to |
Geoff McLane notifications@github.com writes:
@geoffmcl Yes, sounds good, except that we will also need to worry about <style> and <textarea>, which have a similar issue. |
@CMB sorry for the delay... I actually encoded this many weeks ago, but could not find the right time to push it... I am not sure we need to be concerned about the Now this has been pushed to the This is in the 5.1.47++ version, so you need to pull, and build that. Please advise if this does not fix the problem. I have only done minimal testing... At the same time I have added an Appreciated if you get the chance to test this version, and the new option, and if ok, close this... thanks... |
All right! Thank you @geoffmcl. I will try this now. |
@geoffmcl @CMB Thank you for coding this and making the readme.md, Geoff. I compiled libtidy with the new option set to true. Then I tried (in edbrowse with libtidy) a test page like what I posted on 1/15. And, it does the trick. Then I tested on a complex case. We've been working with a couple of complex examples recently such as https://www.oakgov.com/sheriff/Pages/Inmates-Current.aspx (raised by @eklhad) and https://groups.google.com/forum/#!topic/boost-compute/xJS05dkQEJk (raised by Sebastian.) I'm investigating the Google example and it appears that having the new parse option set to true means that the page can proceed instead of getting out of sync, which means the run of the page then exposes the next bottleneck, some other unrelated problem. So this is progress! If something improved on a Google page, that may be about as tricky as it gets and we may have improvements in a variety of pages that users want. Chris, what do you think about closing this now? I am not clear on whether |
@geoffmcl Thank you for the fix!
@KevinCarhart Yes I'm pretty sure textarea and style are still problem
children, but not to the degree that script was.
|
@KevinCarhart, @CMB thanks for testing and reporting... And searching deeper into the if check - The file tags.h does have a define of what I expected that to be, namely But now I have added an AND to the if In any case, would prefer this closed and a new issue opened if you run across a use cases, hopefully with minimal sample html, where this blocking of the escaping still fails... thanks... |
Thanks! Ok, I will bring the suggested code |
Hi Geoff and co,
For the following HTML:
<script type="text/javascript">
ua=/</g;va=/>/g;
</script>
libtidy (as called by edbrowse) reports:
'<' + '/' + letter not allowed here
So I think it's interpreting this
</g
as the first three characters of a closing HTML tag.
(FYI if you don't recognize it, what it's actually doing in the JS world is assigning a regular expression criterion to a variable. The criterion just happens to be a
<
, so that's how the ambiguity arises.)This is deeply related to the famous and brain-bending issue #65 but since #65 is closed, I am posting a new one.
thanks!!
Kevin
The text was updated successfully, but these errors were encountered: