Fixes in HTML Lexer to support HTML empty comment statements #327

jfbyers · 2024-03-01T10:38:21Z

Summary and intro

The HTML Lexer fails to detect empty HTML comment declarations leading into the next piece of HTML object in the input to not be detected as such.

Basically, for an input such as <!><img src=1 onError=alert(1)>, the lexer considers the whole blob as an HTML comment instead of an empty comment declaration (<!>) and then an img tag (which is what browsers would do).

This bug in the lexer, provides wrong information to the HtmlChangeListener

HTML RFC context

In the HTML RFC, comments are defined as:

To include comments in an HTML document, use a comment declaration. A comment declaration consists of `<!' followed by zero or more comments followed by `>'. Each comment starts with `--' and includes all text up to and including the next occurrence of `--'.

This means that <!> is a valid comment declaration with zero comments inside. Which means the following HTML code would trigger the alert(1) if rendered in a browser:

<html><body>
<!><script>alert(1)</script>
</body></html>

In addition, the pattern  (not sure why yet, but tested in all major browsers). This means the following HTML code would also trigger the alert(1) if rendered in a browser:

<html><body>
<!--><script>alert(1)</script>
</body></html>

The bug

The Lexer does not consider neither <!> nor <!--> as valid comment declaration statements, considering the last character of both statements (>) still as part of the comment.
This means, when the sanitizer reads the following input <!><img src=1 onError=alert(1)>, the lexer will interpret the whole input as an HTML comment.
However, the expected behavior would be to detect <!> as an HTML comment declaration, and <img src=1 onError=alert(1)> as an img HTML tag.

This is easy to see with the HtmlChangeListener class. For example, the following code provides the following output:

package org.owasp.html;

import java.util.ArrayList;
import java.util.List;

class BugTest
{
	public static void main(String[] args)
	{
		String[] test = new String[]{"qwe1<img>qwe2", "qwe3<!><img src=1>qwe4", "qwe5<!--><img src=1>qwe6"};
		List<String> results = new ArrayList<String>();
		for (String s : test)
		{
			MyListener htmlChangeListener = new MyListener();
			PolicyFactory sanitizePolicy = new HtmlPolicyBuilder().toFactory();
			String safeString = sanitizePolicy.sanitize(s, htmlChangeListener, results);
			System.out.println( "safeString :"  + safeString + " "+htmlChangeListener.getOutOfPolicyObjects());
		}

	}


}

... 

package org.owasp.html;

import javax.annotation.Nonnull;
import javax.annotation.Nullable;
import java.util.LinkedList;
import java.util.List;

public class MyListener implements HtmlChangeListener<List<String>>
{
	private final List<String> outOfPolicyObjects = new LinkedList<String>();

	@Override
	public void discardedTag(@Nullable List<String> context, @Nonnull String elementName)
	{
		outOfPolicyObjects.add(elementName);
	}

	@Override
	public void discardedAttributes(@Nullable List<String> context, @Nonnull String tagName, @Nonnull String... attributeNames)
	{
		outOfPolicyObjects.add(String.format("%s (%s)", tagName, String.join(",", attributeNames)));
	}

	public List<String> getOutOfPolicyObjects()
	{
		return outOfPolicyObjects;
	}
}

The output:

safeString :qwe1qwe2 [img]
safeString :qwe3qwe4 []
safeString :qwe5 []

However, the expected correct output should be:

safeString :qwe1qwe2 [img]
safeString :qwe3qwe4 [img]
safeString :qwe5qwe6 [img]

The fix

We added a new state called COMMENT_DASH_AFTER_BANG into the HtmlInputSplitter inside HtmlLexer to handle dash character after the bang one (!-)
Also we created special condition checks in the BANG and BANG_DASH states inside the lexer's state machine to handle <!> and <!--> comments.

Authors

Bug discovery: Carlos Villa (@carlosvillasanchez)
Code fix: Eduardo Aguado (@jfbyers)

subbudvk · 2024-03-05T05:24:08Z

@jfbyers Can you pl provide a test case, where with a minimal policy, post sanitization, the vector is not sanitized?

jfbyers · 2024-03-05T09:10:44Z

Hi @subbudvk , with the current implementation the following strings are not properly sanitized:

class BugTest
{
	public static void main(String[] args)
	{
		String[] test = new String[]{"qwe1<img>qwe2", "qwe3<!>qwe4", "qwe5<!-->qwe6"};
		for (String s : test)
		{
			PolicyFactory sanitizePolicy = new HtmlPolicyBuilder().toFactory();
			String safeString = sanitizePolicy.sanitize(s, null, null);
			System.out.println( "safeString :"  + safeString);
		}

	}
}

This outputs:

safeString :qwe1qwe2
safeString :qwe3
safeString :qwe5

And it should be:

safeString :qwe1qwe2
safeString :qwe3qwe4
safeString :qwe5qwe6

As I mentioned in my first message (apologies if I did not explain myself clearly) this is even more misleading if you use a listener to add tags/attributes in

discardedTag()

or

discardedAttribute()

from the string being sanitized e.g string qwe3<!><img src=1 onerror=alert(1)> would not add <img src=1 onerror=alert(1)> as a discarded tag so any library consumer might take incorrect decisions about the contents of the string.

jfbyers · 2024-03-12T10:51:44Z

Thanks @melloware , can you approve / run the workflows or you want me to add more unit tests?

melloware · 2024-03-12T11:15:06Z

No you are good. I don't have commit privs so I can't run your workflow only @mikesamuel has permissions to this repo.

mikesamuel · 2024-03-21T15:44:15Z

Which HTML RFC are you quoting?

https://html.spec.whatwg.org/multipage/syntax.html#comments seems to disagree. https://html.spec.whatwg.org/multipage/parsing.html#parse-errors does class <! under incorrectly opened comment, but why would the incorrectly opened comment end at > though instead of -->?

And it should be:

safeString :qwe1qwe2
safeString :qwe3qwe4
safeString :qwe5qwe6

Why?

src/main/java/org/owasp/html/HtmlLexer.java

src/test/java/org/owasp/html/HtmlLexerTest.java

subbudvk · 2024-03-22T11:10:24Z

@mikesamuel : Can you kindly release a new version at least with whatever changes that are already merged to master?

melloware · 2024-03-22T12:03:20Z

+1 to @subbudvk just get any new version out there that gets rid of Guava.

jfbyers · 2024-03-22T12:57:56Z

Added support for 2 comment parser errors https://html.spec.whatwg.org/multipage/parsing.html#parse-errors :

abrupt-closing-of-empty-comment (i.e., or ). The parser behaves as if the comment is closed correctly.
incorrectly-closed-comment comment that is closed by the "--!>" code point sequence. The parser treats such comments as if they are correctly closed by the "-->" code point sequence.

carlosvillasanchez · 2024-03-22T13:59:41Z

Which HTML RFC are you quoting?
....

Hey @mikesamuel

We are quoting this RFC: https://www.ietf.org/rfc/rfc1866.txt section 3.2.5.

To include comments in an HTML document, use a comment declaration. A comment declaration consists of <! followed by zero or more comments followed by >. Each comment starts with -- and includes all text up to and including the next occurrence of --. In a comment declaration, white space is allowed after each comment, but not before the first comment. The entire comment declaration is ignored

Based on this, <!> it is indeed a valid self closed comment declaration. Without any comment.

It is true, that based on this,  which is indeed a valid self closed comment section, with one empty comment.

I can build some PoC for both scenarios detailed above, let me know if that is needed or if this is enough information.

Thanks!!

mikesamuel · 2024-03-25T18:42:02Z

+1 to @subbudvk just get any new version out there that gets rid of Guava.

Done

mikesamuel · 2024-03-25T18:49:10Z

We are quoting this RFC: https://www.ietf.org/rfc/rfc1866.txt section 3.2.5.

I don't think any modern browser references that standard and, since it's an RFC, it hasn't changes since it was published in 1995.
The WhatWG steering group is representatives from the main browser vendors, and html.whatwg.spec.org is the best approximation of what HTML is.

mikesamuel · 2024-03-25T18:52:19Z

src/main/java/org/owasp/html/HtmlLexer.java

@@ -665,6 +679,8 @@ && canonicalElementName(start + 2, end)
                  if ('>' == ch) {
                    state = State.DONE;
                    type = HtmlTokenType.COMMENT;
+                  } else if ('!' == ch) {  // --!> is also valid closing sequence
+                    state = State.COMMENT_DASH_DASH;


This would seem to suggest that<!-- --!-> is a whole comment tag.
iiuc, after <!-- --!- we should be in the comment_dash state.

Perhaps we need ! to transition to COMMENT_DASH_DASH_BANG here which transitions as does case COMMENT above.

Does this look right?

flowchart TD BANG -- "-" --> BANG_DASH; BANG_DASH -- "-" --> COMMENT_DASH_AFTER_BANG; BANG_DASH -- "else" --> DIRECTIVE; COMMENT_DASH_AFTER_BANG -- "-" --> COMMENT_DASH_AFTER_BANG; COMMENT_DASH_AFTER_BANG -- ">" --> DONE; COMMENT_DASH_AFTER_BANG -- "else" --> COMMENT; COMMENT -- "-" --> COMMENT_DASH; COMMENT -- "else" --> COMMENT; COMMENT_DASH -- "-" --> COMMENT_DASH_DASH; COMMENT_DASH -- "else" --> COMMENT; COMMENT_DASH_DASH -- ">" --> DONE; COMMENT_DASH_DASH -- "-" --> COMMENT_DASH_DASH; COMMENT_DASH_DASH -- "!" --> COMMENT_DASH_DASH_BANG; COMMENT_DASH_DASH_BANG -- ">" --> DONE; COMMENT_DASH_DASH_BANG -- "-" --> COMMENT_DASH; COMMENT_DASH_DASH_BANG -- "else" --> COMMENT;

Loading

Yes, this seems correct. The missing part is the COMMENT_DASH_AFTER_BANG case and the <!> scenario too. But the new COMMENT_DASH_DASH_BANG suggestion seems correct to me.

Thanks @mikesamuel , this makes sense. I implemented your proposed changes and added new tests. All the use cases discussed so far pass the tests. Can you please validate / approve the MR? Thanks!

@jfbyers Changes look good to me, let see what mike's thoughts are :)

@subbudvk @mikesamuel are you ok moving forward with this PR ? What are the next steps? Thank you.

subbudvk · 2024-03-26T01:07:12Z

+1 to @subbudvk just get any new version out there that gets rid of Guava.

Done

Thanks @mikesamuel ! Appreciate your contribution to the open source community.

carlosvillasanchez · 2024-04-10T15:35:57Z

I don't think any modern browser references that standard and, since it's an RFC, it hasn't changes since it was published in 1995.
The WhatWG steering group is representatives from the main browser vendors, and html.whatwg.spec.org is the best approximation of what HTML is.

About this, please forgive me in advance... I am not super used to referencing in RFCs and/or browsers standards.

That being said, in https://html.spec.whatwg.org/multipage/syntax.html#comments they state a comment must start with , which as we are seeing it is not true. Browsers clearly identify a pattern like <!> as a comment section.

Not sure if this means this HTML spec is wrong, or it is just not the one followed by browsers.

Also the pattern  its not specified in the spec (I guess this one rightfully so).

Again, let me know if it is useful to provide a live PoC for this.

jtmelton · 2025-05-27T18:16:04Z

@mikesamuel and @jmanico just wanted to bump this and see if we can get this merged? It's been a couple months since the last comments.

Fixes in parser to address bypass of the library and XSS attacks

9b25635

jfbyers marked this pull request as ready for review March 1, 2024 10:40

melloware approved these changes Mar 11, 2024

View reviewed changes

This comment was marked as outdated.

Sign in to view

melloware approved these changes Mar 11, 2024

View reviewed changes

mikesamuel reviewed Mar 21, 2024

View reviewed changes

src/main/java/org/owasp/html/HtmlLexer.java Outdated Show resolved Hide resolved

src/test/java/org/owasp/html/HtmlLexerTest.java Outdated Show resolved Hide resolved

Aguado, Eduardo added 2 commits March 22, 2024 13:51

Fixes suggested by Mike Samuel

1ae3478

Spacing

e8b68ef

mikesamuel reviewed Mar 25, 2024

View reviewed changes

Aguado, Eduardo added 4 commits September 24, 2024 11:13

Merge remote-tracking branch 'origin' into bugfix/lexerHtmlComments

0d2f071

Added new test

ce28bc9

Implemented change suggested by mikesamuel

7408e06

Spacing

ee341c9

Fixes in HTML Lexer to support HTML empty comment statements #327

Are you sure you want to change the base?

Fixes in HTML Lexer to support HTML empty comment statements #327

Uh oh!

Conversation

jfbyers commented Mar 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary and intro

HTML RFC context

The bug

The fix

Authors

Uh oh!

subbudvk commented Mar 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jfbyers commented Mar 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

jfbyers commented Mar 12, 2024

Uh oh!

melloware commented Mar 12, 2024

Uh oh!

mikesamuel commented Mar 21, 2024

Uh oh!

Uh oh!

Uh oh!

subbudvk commented Mar 22, 2024

Uh oh!

melloware commented Mar 22, 2024

Uh oh!

jfbyers commented Mar 22, 2024

Uh oh!

carlosvillasanchez commented Mar 22, 2024

Uh oh!

mikesamuel commented Mar 25, 2024

Uh oh!

mikesamuel commented Mar 25, 2024

Uh oh!

mikesamuel Mar 25, 2024

Choose a reason for hiding this comment

Uh oh!

mikesamuel Mar 25, 2024

Choose a reason for hiding this comment

Uh oh!

carlosvillasanchez Apr 10, 2024

Choose a reason for hiding this comment

Uh oh!

jfbyers Sep 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carlosvillasanchez Sep 25, 2024

Choose a reason for hiding this comment

Uh oh!

jfbyers Oct 4, 2024

Choose a reason for hiding this comment

Uh oh!

subbudvk commented Mar 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

carlosvillasanchez commented Apr 10, 2024

Uh oh!

jtmelton commented May 27, 2025

Uh oh!

Uh oh!

jfbyers commented Mar 1, 2024 •

edited

Loading

subbudvk commented Mar 5, 2024 •

edited

Loading

jfbyers commented Mar 5, 2024 •

edited

Loading

jfbyers Sep 24, 2024 •

edited

Loading

subbudvk commented Mar 26, 2024 •

edited

Loading