Certain characters in attribute values are not being escaped resulting in invalid HTML #198

jasonkres · 2015-07-31T17:06:23Z

Characters that should be escaped are not escaped when they occur in attribute values.
That is, a valid HTML5 document (e.g., per https://html5.validator.nu/) can be passed to CsQuery and come out as an invalid HTML5 document.

I have tested this 1.3.3249 (NuGet) and CsQuery master. The problem reproduces in both.

Not sure, but suspect that because this affects escaping, this may be a security problem where undesired HTML can be injected.

Program:

using System;
using CsQuery;

namespace CsQueryTestApp
{
    class Program
    {
        static void Main(string[] args)
        {
            string h = "<!DOCTYPE html><html><head><meta charset=\"UTF-8\"><title>test</title></head><body><a href=\"http://www.example.com?a=1&amp;b=2\">Hello &amp; World</a></body></html>";
            Console.WriteLine("Valid per https://html5.validator.nu/:");
            Console.WriteLine(h);
            Console.WriteLine("Invalid per https://html5.validator.nu/:");
            Console.WriteLine(CQ.CreateDocument(h).Render());
        }
    }
}

Output:

Valid per https://html5.validator.nu/:
<!DOCTYPE html><html><head><meta charset="UTF-8"><title>test</title></head><body><a href="http://www.example.com?a=1&amp;b=2">Hello &amp; World</a></body></html>
Invalid per https://html5.validator.nu/:
<!DOCTYPE html><html><head><meta charset="UTF-8"><title>test</title></head><body><a href="http://www.example.com?a=1&b=2">Hello &amp; World</a></body></html>

Message from validator:

Error: & did not start a character reference. (& probably should have been escaped as &amp;.)
At line 1, column 117
example.com?a=1&b=2">Hello &am

The text was updated successfully, but these errors were encountered:

schmalls · 2015-08-19T18:22:29Z

I have also encountered this issue with quotes and angle brackets.

Before:

<span class="email" data-href="mailto:&quot;John Doe&quot; &lt;john.doe@example.com&gt;">

After:

<span class="email" data-href='mailto:"John Doe" <john.doe@example.com>'>

As you can see, the entity references disappeared and the quote style switched. The angle brackets in the attribute cause it to no longer be valid.

jamietre · 2015-08-19T18:37:05Z

The angle brackets are quoted, they don't need to transformed. That validates fine. The & is a problem.

schmalls · 2015-08-19T18:45:21Z

I am using Prince to convert from XHTML to PDF and it gives me an error saying the Unescaped '<' not allowed in attribute values. So, it may be they are more strict than the validator. My question is, why are the entities being converted in the first place? If they weren't, it would fix both issues.

jamietre · 2015-08-19T18:58:21Z

CsQuery is a parser.. Though this is really HtmlParser at work here for the
input When it loads a document it stores the actual value and not whatever
representation created it from the source. In a pre tag for example an & is
really an &. If we just stored the ASCII that came from the source then it
would be impossible to treat text data as something that is usable in any
context.

But HtmlParser is not broken. The problem is just in how I render it. &
needs to be escaped in attribute values always.

I am using Prince http://www.princexml.com to convert from XHTML to PDF
and it gives me an error saying the Unescaped '<' not allowed in attribute
values. So, it may be they are more strict than the validator. My question
is, why are the entities being converted in the first place? If they
weren't, it would fix both issues.

—
Reply to this email directly or view it on GitHub
#198 (comment).

jasonkres · 2015-08-20T16:20:28Z

So, it may be they are more strict than the validator.

Ideally, should be pluggable how these are handled when they occur in attributes.
?? Similar to how you handle other encoding issues with the HtmlEncoders ??
For this to be most useful, the encoder routine needs some context (especially: whether ' or " is the currently expected attribute delimiter).

jamietre · 2015-08-20T17:49:12Z

The code is here

public static string AttributeEncode(string text, bool alwaysQuote, out string quoteChar)...

It handles the quote context, but not the ampersand. The ampersand is the only character that requires encoding, and technically it only needs to be encoded if it is ambiguous -- so I'm actually surprised that validator.nu fails the example above.

I agree, there should be some control over how the attributes are generated. It might be desirable to use escaping for even legitimate characters for the same reason as elsewhere in the document (e.g. just for readability of unprintable characters in the source code). I can't think of a reason why the same translation used for the rest of the HTML wouldn't work in the attribute context.

There handling chooses apostrophe or double-quote in order to minimize the need for escaping quotes, but seems like this should be choosable (e.g. single quote, double quote, or "best fit") anyway.

jasonkres · 2015-08-20T18:04:21Z

and technically it only needs to be encoded if it is ambiguous

Of course, if the document is XHTML, then there is no flexibility on ampersands. In the case of working with the XML serialization of HTML5, you can't even tell from the doctype whether it is supposed to be XML, so I guess if you have you have to either play safe: look for the http://www.w3.org/1999/xhtml namespace, or provide options for the user to tell you want they want to do. So flexibility on this is good...

schmalls · 2015-08-20T19:06:56Z

There are three characters that have to be escaped for attributes in XHTML it looks like from the XML specification. They are <, &, and ".

jasonkres · 2015-08-20T19:26:14Z

Yes, escaping < is actually a must in XML spec; & and whichever of " or ' is not the delimiter, (of course.)
The relevant production from the spec is (where ^ indicates negation):

AttValue ::= '"' ([^<&"] | Reference)* '"'  
             | "'" ([^<&'] | Reference)* "'"

jasonkres mentioned this issue Jul 31, 2015

Certain characters in attribute values are not being escaped resulting in invalid HTML milkshakesoftware/PreMailer.Net#68

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Certain characters in attribute values are not being escaped resulting in invalid HTML #198

Certain characters in attribute values are not being escaped resulting in invalid HTML #198

jasonkres commented Jul 31, 2015

schmalls commented Aug 19, 2015

jamietre commented Aug 19, 2015

schmalls commented Aug 19, 2015

jamietre commented Aug 19, 2015

jasonkres commented Aug 20, 2015

jamietre commented Aug 20, 2015

jasonkres commented Aug 20, 2015

schmalls commented Aug 20, 2015

jasonkres commented Aug 20, 2015

Certain characters in attribute values are not being escaped resulting in invalid HTML #198

Certain characters in attribute values are not being escaped resulting in invalid HTML #198

Comments

jasonkres commented Jul 31, 2015

schmalls commented Aug 19, 2015

jamietre commented Aug 19, 2015

schmalls commented Aug 19, 2015

jamietre commented Aug 19, 2015

jasonkres commented Aug 20, 2015

jamietre commented Aug 20, 2015

jasonkres commented Aug 20, 2015

schmalls commented Aug 20, 2015

jasonkres commented Aug 20, 2015