Skip to content
This repository was archived by the owner on Apr 12, 2024. It is now read-only.

fix: text that looks like an html tag but is not causes [$sanitize:badparse] error #8193

Closed

Conversation

sylvain-hamel
Copy link

The provided unit test fails with this error.

Error: [$sanitize:badparse] The sanitizer was unable to parse the following block of html: <
nonEndingTag href <nonEndingTag href

The provided change fixes it.

Can you please backport the fix into v1.2.x as this is a show stopper for us. Thanks

@mary-poppins
Copy link

Thanks for the PR! Please check the items below to help us merge this faster. See the contributing docs for more information.

  • Uses the issue template (#8193)

If you need to make changes to your pull request, you can update the commit with git commit --amend.
Then, update the pull request with git push -f.

Thanks again for your help!

@caitp
Copy link
Contributor

caitp commented Jul 14, 2014

Hi, thanks for re-opening this. Could you also add another unit test to show how this will work when this happens in text within a tag?

For instance, <p>10 < 100</p> --- we should have a test to ensure that that works correctly.

Another case like <p>10 < <span>100</span></p> would also be good.

@sylvain-hamel
Copy link
Author

I added the first test case but I cannot add the second one because of the way the tests are written. The internal handler used in the tests was written to handle only one tag; not nested tags.

@caitp
Copy link
Contributor

caitp commented Jul 14, 2014

You can simply run sanitize() and check if we get the expected value.

@sylvain-hamel
Copy link
Author

Thanks. Test added.

@@ -81,6 +82,16 @@ describe('HTML', function() {
expect(text).toEqual('text');
});

it('should parse unterminated tags as regular content', function() {
htmlParser('<a text1 text2 <a text1 text2', handler);
expect(text).toEqual('<a text1 text2 <a text1 text2');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is wrong because it breaks html parsing rules. browsers either ignore this and throw it away or try to autocorrect the html.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is correct --- I guess we need a non-alpha character to precede the < (in testing, <ê, <3, <- are all fine). I don't think we can come even close to being as crazy as the actual parsing rules though, that would be way too much code.

  • A sequence of bytes starting with a 0x3C byte (ASCII <), optionally a 0x2F byte (ASCII /), and finally a byte in the range 0x41-0x5A or 0x61-0x7A (an ASCII letter)
    1. Advance the position pointer so that it points at the next 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII FF), 0x0D (ASCII CR), 0x20 (ASCII space), or 0x3E (ASCII >) byte.
    2. Repeatedly get an attribute until no further attributes can be found, then jump to the step below labeled next byte.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should sanitize convert the < to &lt; in text that looks like unterminated tags?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes (for example, http://jsfiddle.net/fMDyg/)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you take it from here and do the necessary change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can have a go at it :>

@IgorMinar
Copy link
Contributor

cait, can you own this?

we should be careful about letting unescaped < and > through as that could result in security holes.

@caitp
Copy link
Contributor

caitp commented Jul 15, 2014

will do

@caitp
Copy link
Contributor

caitp commented Jul 15, 2014

@IgorMinar we have tests in the tree which verify that we correctly parse invalid HTML (

expectHTML('a< SCRIPT >A< SCRIPT >evil< / scrIpt >B< / scrIpt >c.').toEqual('ac.');
) --- but what we're doing there isn't technically correct as far as the HTML spec is concerned (at least modern browsers won't treat this as a start tag at all, and will instead just encode the < before script)

Since I'm fixing up this CL, should I change that too? Or would that be too much of a breaking change. (testing on jsfiddle, even ie8 behaves correctly here)

caitp added a commit to caitp/angular.js that referenced this pull request Jul 16, 2014
… text content

ngSanitize will now permit opening braces in text content, provided they are not followed by either
an unescaped backslash, or by an ASCII letter (u+0041 - u+005A, u+0061 - u+007A), in compliance with
rules of the parsing spec, without taking insertion mode into account.

BREAKING CHANGE

Previously, $sanitize would "fix" invalid markup in which a space preceded alphanumeric characters
in a start-tag. Following this change, any opening angle bracket which is not followed by either a
forward slash, or by an ASCII letter (a-z | A-Z) will not be considered a start tag delimiter, per
the HTML parsing spec (http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html).

Closes angular#8193
@caitp caitp closed this in f6681d4 Jul 16, 2014
caitp added a commit that referenced this pull request Jul 16, 2014
… text content

ngSanitize will now permit opening braces in text content, provided they are not followed by either
an unescaped backslash, or by an ASCII letter (u+0041 - u+005A, u+0061 - u+007A), in compliance with
rules of the parsing spec, without taking insertion mode into account.

BREAKING CHANGE

Previously, $sanitize would "fix" invalid markup in which a space preceded alphanumeric characters
in a start-tag. Following this change, any opening angle bracket which is not followed by either a
forward slash, or by an ASCII letter (a-z | A-Z) will not be considered a start tag delimiter, per
the HTML parsing spec (http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html).

Closes #8212
Closes #8193
@sylvain-hamel
Copy link
Author

@caitp Great news! thanks. Would you please backport the fix to v1.2 as this bug greatly affects us.Thanks.

@caitp
Copy link
Contributor

caitp commented Jul 16, 2014

@sylvain-hamel good news, it was backported 36d2658

@sylvain-hamel
Copy link
Author

@caitp Great. I'll test the fix tomorrow.

@caitp
Copy link
Contributor

caitp commented Jul 16, 2014

@sylvain-hamel actually I broke our ci-checks task porting that into v1.2.x, I think I need to revert that and sort that out, but if you try with that sha it should work anyways

caitp added a commit to caitp/angular.js that referenced this pull request Jul 16, 2014
… text content

ngSanitize will now permit opening braces in text content, provided they are not followed by either
an unescaped backslash, or by an ASCII letter (u+0041 - u+005A, u+0061 - u+007A), in compliance with
rules of the parsing spec, without taking insertion mode into account.

BREAKING CHANGE

Previously, $sanitize would "fix" invalid markup in which a space preceded alphanumeric characters
in a start-tag. Following this change, any opening angle bracket which is not followed by either a
forward slash, or by an ASCII letter (a-z | A-Z) will not be considered a start tag delimiter, per
the HTML parsing spec (http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html).

Closes angular#8212
Closes angular#8193
@sylvain-hamel
Copy link
Author

Hi @caitp, I tired your fix (from 36d2658) and it does not work as I expected.

The following markup still causes the same failure ([$sanitize:badparse] The sanitizer was unable to parse the following block of html: <a href).

<div ng-init="value='<a href'"></div>
<div ng-bind-html="value"></div>

My expectation is that ng-bind-html should treat unterminated tags as regular content. In this case I expected <a href to be converted to &lt;a href as discussed previously in this thread:

Me:

Should sanitize convert the < to &lt; in text that looks like unterminated tags?

You:

Yes (for example, http://jsfiddle.net/fMDyg/)

Note that the fix I had submitted did work, however it left the < unencoded which is not clean.

@caitp
Copy link
Contributor

caitp commented Jul 23, 2014

@sylvain-hamel this is how the HTML parsing rules work: if you have a <, (optionally) followed by a /, followed by a character from the set [a-zA-Z], it is an opening tag, and the next step is to collect attributes.

$sanitize was changed (in that patch) to match this behaviour --- obviously it's a bit more complicated in a real HTML parser, but it's really the best we can do without a huge file size. So basically, that is working exactly how it should, and you should encode the < manually if you need a letter or slash to follow it.

@sylvain-hamel
Copy link
Author

@caitp

this is how the HTML parsing rules work [...] next step is to collect attributes.

that's fine if the thing being parsed is a element. But here it is not, IMO sanitize should detect that and not treat it as HTML and instead just make it safe by encoding it.

What would be a good reason not to add this feature to sanitize?

@caitp
Copy link
Contributor

caitp commented Jul 23, 2014

Because we have no concept of insertion modes in the $sanitize parser, it would be too complex to support that.

Basically just escape your < the way you would in regular HTML if you need a < to precede a / or [a-zA-Z] --- if there's a space after, or any other character, it doesn't need to be encoded.

@sylvain-hamel
Copy link
Author

In my case this is user provided content. If the user entered some valid html, I want the rendered result. But if he just entered some text that looks like html then I need it to be escaped and rendered as-is.

Basically just escape your < the way you would in regular HTML

In order for me to only encode the text that looks like html, I basically need to implement the feature I'm asking sanitize to support.

Do you agree that no input should ever cause sanitize to crash like this? It should be able to transform anything into something safe.

@caitp
Copy link
Contributor

caitp commented Jul 23, 2014

what exactly do you get out of binding the html <a ... anyway? what are you expecting to get out of that? Maybe it would be worth your while just creating a text node that looks like that manually, it would be trivial to write a directive to do that.

@sylvain-hamel
Copy link
Author

given a view like this:

<textarea ng-model="value"/>
<hr />
<div ng-bind-html="value"/>

if the user enters this in the textarea:

b is <b>smaller than</b>  a. I can even say: b<a and that's shorter.

I want the output to be this:


b is smaller than a. I can even say: b<a and that's shorter.


But with ng-bind-html this input causes sanitize to crash.

@sylvain-hamel
Copy link
Author

@caitp Does my use case make sense to you?

@caitp
Copy link
Contributor

caitp commented Jul 28, 2014

No, I think this is something that we shouldn't support in ngSanitize. There is no shortage of ways around it, though. As far as I'm concerned, ngSanitize should honour < a, but not <a beause <a is a tag, and if we can't find the end of the tag, then we have a problem

@caitp
Copy link
Contributor

caitp commented Jul 28, 2014

You could write a version of ngBindHtml which doesn't throw, or just creates a text node with the contents if it does throw, but I don't think ngBindHtml should support non-html

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants