Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provide a way to specify a blocklist of protocols #2047

Closed
robstoll opened this issue Nov 10, 2023 · 5 comments
Closed

provide a way to specify a blocklist of protocols #2047

robstoll opened this issue Nov 10, 2023 · 5 comments

Comments

@robstoll
Copy link

robstoll commented Nov 10, 2023

following use case:
I want to restrict the protocol in href of <a>. allow everything with the exception of javascript (and all its encoded variants etc.)

currently, one can only either allow all or safelist.

@jhy jhy changed the title provide a way to specify a blacklist of protocols provide a way to specify a blocklist of protocols Nov 10, 2023
@jhy
Copy link
Owner

jhy commented Nov 10, 2023

What protocols do you want to support additionally than HTTP and HTTPS? Gopher?

My fundamental concern with blocklists is that it is very easy to forget something -- or the world changes and introduces something -- that will then cause unexpected results. That's why IMO the Cleaner Safelist implementation has proven largely resilient to XSS over time. Only allowing known safe content.

What is the motivating use case here? Are there that many protocols that people are actually using that it is a real chore to add them explicitly, if you know they're safe? Are you confident that in a few years somebody won't add gpt:// as a protocol? Etc.

@robstoll
Copy link
Author

robstoll commented Nov 10, 2023

I actually totally see your point and usually whitelist is what I use. this time I wasn't able to convince the client otherwise. the client want to be able to use all kind of protocols such as tel, callto, mailto, all kind of ftp etc. the client basically doesn't want to restrict itself to use new protocols in the future -- without the need to adjust the software. So it is rather the opposite, the client would want to use gpt: in the future. The only thing I was able to convince is that javascript needs to be forbidden.

Anyway, I see your concerns and IMO it's OK to keep it that way. Maybe another feature could be helpful which might exist already within jsoup and I did just not find it yet. A way to get the protocol from an href normalized in the sense of all kind of variants of href="javascript:", href=" javascript:", href="java\tscript:" etc. boil down to "javascript"

(side question. if I use addProtocols(..."https") is then something like "//google.com" allowed?)

@jhy
Copy link
Owner

jhy commented Nov 10, 2023

A way to get the protocol from an href normalized in the sense of all kind of variants of href="javascript:", href=" javascript:", href="java\tscript:" etc. boil down to "javascript"

Yes, node.absUrl(attribute) returns the normalized and absolute URL form, if it can be coerced into one. Some of those forms will not get parsed as a full URL and will return blank, so you could have code that tests for .startswith("javascript") or is .equals(""), and handles the link appropriately.

String html = """
  <a href="  javascript:alert(1)">1</a>
  <a href="javascript:alert(1)">2</a>
  <a href="java\tscript:alert(1)">3</a>
  """;
Document doc = Jsoup.parse(html, "https://example.com/");
Elements links = doc.select("a");
links.forEach(el -> print(el.text() + ": " + el.absUrl("href")));

Gives:

1: 
2: javascript:alert(1)
3: javascript:alert(1)

Note that I'm showing this as an example of using absUrl. I 100% don't recommend trying to defang HTML input by using a blocklist approach. The only sane approach is to use a positive known-good assertion test (as the Cleaner / Safelist does).

The Safelist isSafeTag and isSafeAttribute methods are overridable, so you could extend it to allow any protocol if you want, or perform other extended assertions.

the client basically doesn't want to restrict itself to use new protocols in the future -- without the need to adjust the software

My suggestion would be to make it admin configurable - just pipe in a list of accepted, known safe protocols.


if I use addProtocols(..."https") is then something like "//google.com" allowed?

The Safelist.basic already includes https protocols et al. And yes, links that are protocol relative are supported as long as you pass in the baseUri when parsing (either into the clean method or the parse method).

String html = "<a href='//google.com/'>Link</a>";
String clean = Jsoup.clean(html, "https://example.com/", Safelist.basic());
System.out.println(clean);

Produces:

<a href="https://google.com/" rel="nofollow">Link</a>

@jhy jhy closed this as completed Nov 10, 2023
@robstoll
Copy link
Author

My suggestion would be to make it admin configurable - just pipe in a list of accepted, known safe protocols.

was my suggestion as well.
thanks for the suggestions

jhy added a commit that referenced this issue Dec 3, 2023
That was preventing the valid decode of &#55357;&#56495; to 💯.

This rule must have been in the spec when initially implemented but I can't find a reference to it now. I'm assuming that the range had since been added, but can't immediately identify why it was explicitly excluded originally.

Fixes #2047
@jhy
Copy link
Owner

jhy commented Dec 4, 2023

(The above commit is not related to this issue, and should point to #2074 instead)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants