-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error with greater and lesser than tags #83
Comments
This is probably the most-reported issue in Sanitize, but I don't think people realize what a complex problem it is. So far, I haven't been able to come up with a solution that wouldn't make Sanitize potentially unsafe. The root of the issue is that "1 > 2 and 2 < 1" simply isn't syntactically valid HTML. In attempting to parse it, Nokogiri (which Sanitize uses under the hood) turns it into "1 > 2 and 2", as you can see here:
This is technically correct behavior. As far as Nokogiri is concerned, "< 1" indicates an unclosed tag, which is invalid, so Nokogiri removes the unclosed tag and its contents (" 1"). So, could Sanitize pre-parse the input and identify "<" chars that are meant literally rather than as the beginning of an unclosed tag, and then escape them before passing the input on to Nokogiri? Maybe. But so far I haven't been able to come up with a safe and effective way of doing this, and nobody else has either. If you have any clever ideas, I'm all ears! |
"Maybe. But so far I haven't been able to come up with a safe and effective way of doing this, and nobody else has either. If you have any clever ideas, I'm all ears!" I thought about this: Of course, this is not completely right. Sometimes a mathematician would like to write "x<a", but any issues could then be prevented by typing an extra spacebar, like "x < a". I'm not a programmer, so I can't help you with scripting this if you think it is a good solution. |
Unfortunately, those rules are too ambiguous. The mathematical less-than operator isn't the only use case for "<". For example, it's often used in plain text to demarcate a URL:
Some people use it in emoticons:
And some people even use it when discussing literal HTML tags:
These are just a few examples. In my experience, when Sanitize is used to sanitize comments or forum posts in which users aren't explicitly conscious of writing HTML (such as when they're actually writing Markdown or some other format that gets converted to HTML), users tend to expect "<" to just work magically in any arbitrary situation. |
Well, I undestand your point I think. To write a script where "<" works magically in arbitrary situation will be probably impossible. Especially your example of discussing literal HTML tags. What about converting |
There are two basic problems with this:
Right now, Sanitize consists of a battle-tested parser (Nokogiri) and a set of simple transformations that get applied to the DOM that Nokogiri generates. Nokogiri can't be configured to apply custom invalid tag escaping criteria, so we would have to implement this before passing HTML to Nokogiri. The last thing I want to do is implement another HTML parser on top of Nokogiri, because that means adding complexity, which means adding bugs. But what's the alternative? Regular expressions? What if the criteria (which have yet to be determined) can't be expressed using regular expressions? Or what if the regexes required are complex, and thus likely to introduce bugs? These are questions I can't answer right now because we lack the criteria needed to answer them, but they worry me. One of the things that makes Sanitize safe is its simplicity. Right now there is one regex in Sanitize, and it's relatively easy to understand, and is used for a well-defined task (validating URI schemes). The rest of Sanitize's logic is implemented in the form of easy to follow DOM transformations. Sanitize doesn't try to do too much, and what it does do, it tries to do with as little complexity as possible, because complexity leads to bugs. I'd like to stick to this principle as much as possible. So the ultimate question here is: is there a set of simple, unambiguous criteria that can be used to identify If you can answer that, I want to hear from you. If not, I appreciate your input, but there's no way to move forward until we first have this criteria. |
Perhaps fail harder, rather than being more accommodating? If I pass malformed HTML like @kenichi made reference to using |
I don't like the idea of failing hard, given that Sanitize is used for parsing blog comments, form posts, this actual GitHub comment, etc. While it would prevent unexpected output, I fear application developers would route around the failures by doing something less safe and more forgiving. Ideally, I'd like Sanitize to be safe and forgiving, erring on the side of safety when forgiveness isn't possible.
At this point I'm pretty sure we can completely rule out a regex-based solution, simply because there's no sane way to construct a regex that can reliably identify a non-tag I've considered writing a SAX-based pre-parser, but I plan to investigate https://github.com/rubys/nokogumbo, since HTML5's parsing rules are at least a little more lenient (they would allow "1 > 2 and 2 < 1", for example), but Gumbo and its Ruby wrappers are still very young and Google doesn't yet recommend using it with untrusted input. |
Well, isn't the HTML converter supposed to escape <? After all, it does work in Github comments even though Github uses sanitize. So in all those cases, it seems sanitize is being fed garbage and is doing the sane thing, that is, damage control on bugs of the HTML converter. (And I guess that's the case with Discourse, as I've written on the original thread). At most, you want to add a FAQ entry explaining this to users - and given this:
I guess you should. To me, if you can't answer "RTFM (read the fine manual)", you have a bug. Maybe you have other use cases than converters, such as users manually writing HTML — but is < supposed to just work there? |
I think I'll be happy if I can get Sanitize to the point of recovering from an invalid In the meantime, @Blaisorblade is probably right that there needs to be a FAQ. |
If Nokogiri parses the document fragment The only bulletproof solution is for Nokogiri to implement the HTML5 parsing algorithm, just like browsers do. |
Not necessarily. I don't think Nokogiri claims to support all of HTML5. I think accepting that fragment as “valid” HTML depends heavily on HTML5's error handling specification.
That's what @rgrove was getting at with his reference to https://github.com/rubys/nokogumbo#readme which backs onto https://github.com/google/gumbo-parser#readme for full HTML5 support. But Gumbo lists under its “Non-goals”:
|
I asked Gumbo's maintainer for clarification on Gumbo's security status. You can see his response at google/gumbo-parser#53. This gist of it is that Gumbo is currently known to be susceptible to at least one buffer overflow bug, and may have other issues, but they hope to fix these. Given this info, I'm definitely not comfortable using Gumbo in Sanitize yet, but I did create an experimental branch of Sanitize that uses Gumbo's parser via nokogumbo. It works very well, and I like its output a lot more than Nokogiri's. I won't switch until and unless Gumbo can safely handle untrusted input, but so far I like what I see. |
Good news everyone! Google now considers Gumbo robust enough for use with untrusted input, and I'm steaming ahead on Sanitize 3.0.0, which uses Gumbo to parse HTML and thus resolves this issue. There's a fair amount of work and testing left to do before 3.0.0 will be ready for release, but you can follow its progress in the dev-3.0.0 branch if you're interested. |
Excellent :) |
Fixed in Sanitize 3.0.0. |
This bug was initially discovered here: http://meta.discourse.org/t/after-lt-all-the-text-is-hidden/9807
A call to
Sanitize.clean("1 > 2 and 2 < 1")
produces1 > 2 and 2
.Everything after the
<
symbol is chopped off. Sorry I'm not familiar with your code base but I'm wondering if there's anything I can do to help this get fixed?The text was updated successfully, but these errors were encountered: