-
Notifications
You must be signed in to change notification settings - Fork 249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Certain characters in attribute values are not being escaped resulting in invalid HTML #198
Comments
I have also encountered this issue with quotes and angle brackets. Before: <span class="email" data-href="mailto:"John Doe" <john.doe@example.com>"> After: <span class="email" data-href='mailto:"John Doe" <john.doe@example.com>'> As you can see, the entity references disappeared and the quote style switched. The angle brackets in the attribute cause it to no longer be valid. |
The angle brackets are quoted, they don't need to transformed. That validates fine. The & is a problem. |
I am using Prince to convert from XHTML to PDF and it gives me an error saying the Unescaped '<' not allowed in attribute values. So, it may be they are more strict than the validator. My question is, why are the entities being converted in the first place? If they weren't, it would fix both issues. |
CsQuery is a parser.. Though this is really HtmlParser at work here for the But HtmlParser is not broken. The problem is just in how I render it. & I am using Prince http://www.princexml.com to convert from XHTML to PDF — |
Ideally, should be pluggable how these are handled when they occur in attributes. |
The code is here
It handles the quote context, but not the ampersand. The ampersand is the only character that requires encoding, and technically it only needs to be encoded if it is ambiguous -- so I'm actually surprised that validator.nu fails the example above. I agree, there should be some control over how the attributes are generated. It might be desirable to use escaping for even legitimate characters for the same reason as elsewhere in the document (e.g. just for readability of unprintable characters in the source code). I can't think of a reason why the same translation used for the rest of the HTML wouldn't work in the attribute context. There handling chooses apostrophe or double-quote in order to minimize the need for escaping quotes, but seems like this should be choosable (e.g. single quote, double quote, or "best fit") anyway. |
Of course, if the document is XHTML, then there is no flexibility on ampersands. In the case of working with the XML serialization of HTML5, you can't even tell from the doctype whether it is supposed to be XML, so I guess if you have you have to either play safe: look for the http://www.w3.org/1999/xhtml namespace, or provide options for the user to tell you want they want to do. So flexibility on this is good... |
There are three characters that have to be escaped for attributes in XHTML it looks like from the XML specification. They are |
Yes, escaping < is actually a must in XML spec; & and whichever of " or ' is not the delimiter, (of course.)
|
Characters that should be escaped are not escaped when they occur in attribute values.
That is, a valid HTML5 document (e.g., per https://html5.validator.nu/) can be passed to CsQuery and come out as an invalid HTML5 document.
I have tested this 1.3.3249 (NuGet) and CsQuery master. The problem reproduces in both.
Not sure, but suspect that because this affects escaping, this may be a security problem where undesired HTML can be injected.
Program:
Output:
Message from validator:
The text was updated successfully, but these errors were encountered: