Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes in HTML Lexer to support HTML empty comment statements #327

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

jfbyers
Copy link

@jfbyers jfbyers commented Mar 1, 2024

Summary and intro

The HTML Lexer fails to detect empty HTML comment declarations leading into the next piece of HTML object in the input to not be detected as such.

Basically, for an input such as <!><img src=1 onError=alert(1)>, the lexer considers the whole blob as an HTML comment instead of an empty comment declaration (<!>) and then an img tag (which is what browsers would do).

This bug in the lexer, provides wrong information to the HtmlChangeListener

HTML RFC context

In the HTML RFC, comments are defined as:

To include comments in an HTML document, use a comment declaration. A comment declaration consists of `<!' followed by zero or more comments followed by `>'. Each comment starts with `--' and includes all text up to and including the next occurrence of `--'.

This means that <!> is a valid comment declaration with zero comments inside. Which means the following HTML code would trigger the alert(1) if rendered in a browser:

<html><body>
<!><script>alert(1)</script>
</body></html>

In addition, the pattern <!--> is also transformed by browsers to <!-- --> (not sure why yet, but tested in all major browsers). This means the following HTML code would also trigger the alert(1) if rendered in a browser:

<html><body>
<!--><script>alert(1)</script>
</body></html>

The bug

The Lexer does not consider neither <!> nor <!--> as valid comment declaration statements, considering the last character of both statements (>) still as part of the comment.
This means, when the sanitizer reads the following input <!><img src=1 onError=alert(1)>, the lexer will interpret the whole input as an HTML comment.
However, the expected behavior would be to detect <!> as an HTML comment declaration, and <img src=1 onError=alert(1)> as an img HTML tag.

This is easy to see with the HtmlChangeListener class. For example, the following code provides the following output:

package org.owasp.html;

import java.util.ArrayList;
import java.util.List;

class BugTest
{
	public static void main(String[] args)
	{
		String[] test = new String[]{"qwe1<img>qwe2", "qwe3<!><img src=1>qwe4", "qwe5<!--><img src=1>qwe6"};
		List<String> results = new ArrayList<String>();
		for (String s : test)
		{
			MyListener htmlChangeListener = new MyListener();
			PolicyFactory sanitizePolicy = new HtmlPolicyBuilder().toFactory();
			String safeString = sanitizePolicy.sanitize(s, htmlChangeListener, results);
			System.out.println( "safeString :"  + safeString + " "+htmlChangeListener.getOutOfPolicyObjects());
		}

	}


}

... 

package org.owasp.html;

import javax.annotation.Nonnull;
import javax.annotation.Nullable;
import java.util.LinkedList;
import java.util.List;

public class MyListener implements HtmlChangeListener<List<String>>
{
	private final List<String> outOfPolicyObjects = new LinkedList<String>();

	@Override
	public void discardedTag(@Nullable List<String> context, @Nonnull String elementName)
	{
		outOfPolicyObjects.add(elementName);
	}

	@Override
	public void discardedAttributes(@Nullable List<String> context, @Nonnull String tagName, @Nonnull String... attributeNames)
	{
		outOfPolicyObjects.add(String.format("%s (%s)", tagName, String.join(",", attributeNames)));
	}

	public List<String> getOutOfPolicyObjects()
	{
		return outOfPolicyObjects;
	}
}

The output:

safeString :qwe1qwe2 [img]
safeString :qwe3qwe4 []
safeString :qwe5 []

However, the expected correct output should be:

safeString :qwe1qwe2 [img]
safeString :qwe3qwe4 [img]
safeString :qwe5qwe6 [img]

The fix

We added a new state called COMMENT_DASH_AFTER_BANG into the HtmlInputSplitter inside HtmlLexer to handle dash character after the bang one (!-)
Also we created special condition checks in the BANG and BANG_DASH states inside the lexer's state machine to handle <!> and <!--> comments.

Authors

Bug discovery: Carlos Villa (@carlosvillasanchez)
Code fix: Eduardo Aguado (@jfbyers)

@jfbyers jfbyers marked this pull request as ready for review March 1, 2024 10:40
@subbudvk
Copy link
Contributor

subbudvk commented Mar 5, 2024

@jfbyers Can you pl provide a test case, where with a minimal policy, post sanitization, the vector is not sanitized?

@jfbyers
Copy link
Author

jfbyers commented Mar 5, 2024

Hi @subbudvk , with the current implementation the following strings are not properly sanitized:

class BugTest
{
	public static void main(String[] args)
	{
		String[] test = new String[]{"qwe1<img>qwe2", "qwe3<!>qwe4", "qwe5<!-->qwe6"};
		for (String s : test)
		{
			PolicyFactory sanitizePolicy = new HtmlPolicyBuilder().toFactory();
			String safeString = sanitizePolicy.sanitize(s, null, null);
			System.out.println( "safeString :"  + safeString);
		}

	}
}

This outputs:

safeString :qwe1qwe2
safeString :qwe3
safeString :qwe5

And it should be:

safeString :qwe1qwe2
safeString :qwe3qwe4
safeString :qwe5qwe6

As I mentioned in my first message (apologies if I did not explain myself clearly) this is even more misleading if you use a listener to add tags/attributes in

discardedTag() 

or

discardedAttribute()

from the string being sanitized e.g string qwe3<!><img src=1 onerror=alert(1)> would not add <img src=1 onerror=alert(1)> as a discarded tag so any library consumer might take incorrect decisions about the contents of the string.

melloware

This comment was marked as outdated.

@jfbyers
Copy link
Author

jfbyers commented Mar 12, 2024

Thanks @melloware , can you approve / run the workflows or you want me to add more unit tests?

@melloware
Copy link

No you are good. I don't have commit privs so I can't run your workflow only @mikesamuel has permissions to this repo.

@mikesamuel
Copy link
Contributor

Which HTML RFC are you quoting?

https://html.spec.whatwg.org/multipage/syntax.html#comments seems to disagree. https://html.spec.whatwg.org/multipage/parsing.html#parse-errors does class <! under incorrectly opened comment, but why would the incorrectly opened comment end at > though instead of -->?

And it should be:

safeString :qwe1qwe2
safeString :qwe3qwe4
safeString :qwe5qwe6

Why?

@subbudvk
Copy link
Contributor

@mikesamuel : Can you kindly release a new version at least with whatever changes that are already merged to master?

@melloware
Copy link

+1 to @subbudvk just get any new version out there that gets rid of Guava.

@jfbyers
Copy link
Author

jfbyers commented Mar 22, 2024

Added support for 2 comment parser errors https://html.spec.whatwg.org/multipage/parsing.html#parse-errors :

  • abrupt-closing-of-empty-comment (i.e., or ). The parser behaves as if the comment is closed correctly.
  • incorrectly-closed-comment comment that is closed by the "--!>" code point sequence. The parser treats such comments as if they are correctly closed by the "-->" code point sequence.

@carlosvillasanchez
Copy link

Which HTML RFC are you quoting?
....

Hey @mikesamuel

We are quoting this RFC: https://www.ietf.org/rfc/rfc1866.txt section 3.2.5.

To include comments in an HTML document, use a comment declaration. A comment declaration consists of <! followed by zero or more comments followed by >. Each comment starts with -- and includes all text up to and including the next occurrence of --. In a comment declaration, white space is allowed after each comment, but not before the first comment. The entire comment declaration is ignored

Based on this, <!> it is indeed a valid self closed comment declaration. Without any comment.

It is true, that based on this, <!--> it is not a valid self closed comment declaration. However, we have seen in the most used browsers (Chrome, Firefox, Edge, Safari), the behavior is to modify this blob into <!----> which is indeed a valid self closed comment section, with one empty comment.

I can build some PoC for both scenarios detailed above, let me know if that is needed or if this is enough information.

Thanks!!

@mikesamuel
Copy link
Contributor

+1 to @subbudvk just get any new version out there that gets rid of Guava.

Done

@mikesamuel
Copy link
Contributor

We are quoting this RFC: https://www.ietf.org/rfc/rfc1866.txt section 3.2.5.

I don't think any modern browser references that standard and, since it's an RFC, it hasn't changes since it was published in 1995.
The WhatWG steering group is representatives from the main browser vendors, and html.whatwg.spec.org is the best approximation of what HTML is.

@@ -665,6 +679,8 @@ && canonicalElementName(start + 2, end)
if ('>' == ch) {
state = State.DONE;
type = HtmlTokenType.COMMENT;
} else if ('!' == ch) { // --!> is also valid closing sequence
state = State.COMMENT_DASH_DASH;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would seem to suggest that<!-- --!-> is a whole comment tag.
iiuc, after <!-- --!- we should be in the comment_dash state.

Perhaps we need ! to transition to COMMENT_DASH_DASH_BANG here which transitions as does case COMMENT above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this look right?

flowchart TD
  BANG -- "-" --> BANG_DASH;
  BANG_DASH -- "-" --> COMMENT_DASH_AFTER_BANG;
  BANG_DASH -- "else" --> DIRECTIVE;
  COMMENT_DASH_AFTER_BANG -- "-" --> COMMENT_DASH_AFTER_BANG;
  COMMENT_DASH_AFTER_BANG -- ">" --> DONE;
  COMMENT_DASH_AFTER_BANG -- "else" --> COMMENT;
  COMMENT -- "-" --> COMMENT_DASH;
  COMMENT -- "else" --> COMMENT;
  COMMENT_DASH -- "-" --> COMMENT_DASH_DASH;
  COMMENT_DASH -- "else" --> COMMENT;
  COMMENT_DASH_DASH -- ">" --> DONE;
  COMMENT_DASH_DASH -- "-" --> COMMENT_DASH_DASH;
  COMMENT_DASH_DASH -- "!" --> COMMENT_DASH_DASH_BANG;
  COMMENT_DASH_DASH_BANG -- ">" --> DONE;
  COMMENT_DASH_DASH_BANG -- "-" --> COMMENT_DASH;
  COMMENT_DASH_DASH_BANG -- "else" --> COMMENT;
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this seems correct. The missing part is the COMMENT_DASH_AFTER_BANG case and the <!> scenario too. But the new COMMENT_DASH_DASH_BANG suggestion seems correct to me.

Copy link
Author

@jfbyers jfbyers Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mikesamuel , this makes sense. I implemented your proposed changes and added new tests. All the use cases discussed so far pass the tests. Can you please validate / approve the MR? Thanks!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jfbyers Changes look good to me, let see what mike's thoughts are :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@subbudvk @mikesamuel are you ok moving forward with this PR ? What are the next steps? Thank you.

@subbudvk
Copy link
Contributor

subbudvk commented Mar 26, 2024

+1 to @subbudvk just get any new version out there that gets rid of Guava.

Done

Thanks @mikesamuel ! Appreciate your contribution to the open source community.

@carlosvillasanchez
Copy link

I don't think any modern browser references that standard and, since it's an RFC, it hasn't changes since it was published in 1995.
The WhatWG steering group is representatives from the main browser vendors, and html.whatwg.spec.org is the best approximation of what HTML is.

About this, please forgive me in advance... I am not super used to referencing in RFCs and/or browsers standards.

That being said, in https://html.spec.whatwg.org/multipage/syntax.html#comments they state a comment must start with <!-- and finish with -->, which as we are seeing it is not true. Browsers clearly identify a pattern like <!> as a comment section.

Not sure if this means this HTML spec is wrong, or it is just not the one followed by browsers.

Also the pattern <!--> that browsers turn into <!-- --> its not specified in the spec (I guess this one rightfully so).

Again, let me know if it is useful to provide a live PoC for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants