Tokeniser: Tag attributes that follow '<' character in attribute name are lost #1483

jmeckman · 2021-01-28T17:46:29Z

The Tokeniser logic for parsing attribute names considers a '<' character to be the end of the tag. This is not consistent with the way the browsers engines that I tested on MacOS (Brave/Chrome, Safari, Firefox) handle this case.

As demonstrated here: http://try.jsoup.org/~X8uusGL-o4nn_aiT4XVefMuXW0Q

Consider the tag<a before="foo" <junk after="page.html">.

In this case, jsoup will associate the before attribute with the a tag. It will then process <junk as a new tag and associate the after attribute with it.

Handling more consistently with browsers might assign the unvalued attribute "<junk" to the a tag and continue processing additional attributes.

The text was updated successfully, but these errors were encountered:

jmeckman · 2021-01-28T18:08:51Z

I found that this logic was added as a way to address a previous issue I did not find in my issue search: #797

A configurable option would be helpful to allow consumers to handle this as needed.

jmeckman · 2021-01-29T15:59:26Z

Imperfect workaround for those that want this behavior. It requires 2nd parsing of input, but could be useful for others in the meantime:

        Parser parser = new Parser(new HtmlTreeBuilder());
        parser.setTrackErrors(100);
        Document d = parser.parseInput(html, "");

        if (!parser.getErrors().isEmpty()) {
            StringBuilder sb = null;
            for (ParseError error : parser.getErrors()) {
                // Look for specific message produced by org.jsoup.parser.Tokeniser#error(TokeniserState state) when
                // it encounters a < as the start of an attribute name
                if ("Unexpected character '<' in input state [BeforeAttributeName]".equals(error.getErrorMessage())) {
                    if (html.charAt(error.getPosition()) == '<') {
                        if (sb == null) {
                            sb = new StringBuilder(html);
                        }
                        sb.setCharAt(error.getPosition(), ' ');
                    }
                }
            }
            if (sb != null) {
                // re-parse the corrected input
                d = new Parser(new HtmlTreeBuilder()).parseInput(sb.toString(), "");
            }
        }

jhy · 2021-01-29T22:17:47Z

Yes, this is something I'm not super happy with -- personally from any example of HTML like this that I have seen, it's been clear that the author missed including a closing > on the previous tag, and it's better to behave like that than to assume they wanted an attribute named like <img. But a primary goal of jsoup is to parse consistently to current browsers.

So far I've been reluctant to add parse options to jsoup as it makes the API more complicated and harder to learn. Generally I'd prefer jsoup to just get things right.

Maybe one approach would be to, when encountering an errant <, check if the following string matches a known tag name (exactly, not starts-with). If so, act as today (start a new element). If not, use it as part of an attribute name.

What HTML are you encountering where the latter is better? Would like to understand it to consider other approaches.

jmeckman · 2021-02-01T22:31:39Z

The use case I have is that I am parsing HTML where the intent of the author is to hide attributes from scanners/crawlers/parsers. The specific case where the actor was successful added new lines in front of the attribute as a clear attempt to fool it into assuming it was a new tag. The tag being parsed was an <a and the bogus attribute was not a legitimate HTML tag, but a random looking string like <hdasq. The intent was to hide the href to a malicious site by putting it after the oddball attribute.

In testing this HTML out in different browsers, I found that even if a real tag name was used, the browsers still ignored it. My goal is to parse the HTML as a browser would display it to an end user, so I would not benefit from correcting a careless HTML author's errant missing > character.

jhy · 2021-02-03T08:41:48Z

Thanks, makes sense

jhy added the discussion Discussion for a new feature, or other change proposal label Jan 29, 2021

panthony mentioned this issue Apr 18, 2023

JSoup differs from browsers around commented HTML attributes #1938

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokeniser: Tag attributes that follow '<' character in attribute name are lost #1483

Tokeniser: Tag attributes that follow '<' character in attribute name are lost #1483

jmeckman commented Jan 28, 2021 •

edited

Loading

jmeckman commented Jan 28, 2021

jmeckman commented Jan 29, 2021

jhy commented Jan 29, 2021

jmeckman commented Feb 1, 2021

jhy commented Feb 3, 2021

Tokeniser: Tag attributes that follow '<' character in attribute name are lost #1483

Tokeniser: Tag attributes that follow '<' character in attribute name are lost #1483

Comments

jmeckman commented Jan 28, 2021 • edited Loading

jmeckman commented Jan 28, 2021

jmeckman commented Jan 29, 2021

jhy commented Jan 29, 2021

jmeckman commented Feb 1, 2021

jhy commented Feb 3, 2021

jmeckman commented Jan 28, 2021 •

edited

Loading