Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokeniser: Tag attributes that follow '<' character in attribute name are lost #1483

Open
jmeckman opened this issue Jan 28, 2021 · 5 comments
Labels
discussion Discussion for a new feature, or other change proposal

Comments

@jmeckman
Copy link

jmeckman commented Jan 28, 2021

The Tokeniser logic for parsing attribute names considers a '<' character to be the end of the tag. This is not consistent with the way the browsers engines that I tested on MacOS (Brave/Chrome, Safari, Firefox) handle this case.

As demonstrated here: http://try.jsoup.org/~X8uusGL-o4nn_aiT4XVefMuXW0Q

Consider the tag<a before="foo" <junk after="page.html">.

In this case, jsoup will associate the before attribute with the a tag. It will then process <junk as a new tag and associate the after attribute with it.

Handling more consistently with browsers might assign the unvalued attribute "<junk" to the a tag and continue processing additional attributes.

@jmeckman
Copy link
Author

I found that this logic was added as a way to address a previous issue I did not find in my issue search: #797

A configurable option would be helpful to allow consumers to handle this as needed.

@jmeckman
Copy link
Author

Imperfect workaround for those that want this behavior. It requires 2nd parsing of input, but could be useful for others in the meantime:

        Parser parser = new Parser(new HtmlTreeBuilder());
        parser.setTrackErrors(100);
        Document d = parser.parseInput(html, "");

        if (!parser.getErrors().isEmpty()) {
            StringBuilder sb = null;
            for (ParseError error : parser.getErrors()) {
                // Look for specific message produced by org.jsoup.parser.Tokeniser#error(TokeniserState state) when
                // it encounters a < as the start of an attribute name
                if ("Unexpected character '<' in input state [BeforeAttributeName]".equals(error.getErrorMessage())) {
                    if (html.charAt(error.getPosition()) == '<') {
                        if (sb == null) {
                            sb = new StringBuilder(html);
                        }
                        sb.setCharAt(error.getPosition(), ' ');
                    }
                }
            }
            if (sb != null) {
                // re-parse the corrected input
                d = new Parser(new HtmlTreeBuilder()).parseInput(sb.toString(), "");
            }
        }

@jhy
Copy link
Owner

jhy commented Jan 29, 2021

Yes, this is something I'm not super happy with -- personally from any example of HTML like this that I have seen, it's been clear that the author missed including a closing > on the previous tag, and it's better to behave like that than to assume they wanted an attribute named like <img. But a primary goal of jsoup is to parse consistently to current browsers.

So far I've been reluctant to add parse options to jsoup as it makes the API more complicated and harder to learn. Generally I'd prefer jsoup to just get things right.

Maybe one approach would be to, when encountering an errant <, check if the following string matches a known tag name (exactly, not starts-with). If so, act as today (start a new element). If not, use it as part of an attribute name.

What HTML are you encountering where the latter is better? Would like to understand it to consider other approaches.

@jhy jhy added the discussion Discussion for a new feature, or other change proposal label Jan 29, 2021
@jmeckman
Copy link
Author

jmeckman commented Feb 1, 2021

The use case I have is that I am parsing HTML where the intent of the author is to hide attributes from scanners/crawlers/parsers. The specific case where the actor was successful added new lines in front of the attribute as a clear attempt to fool it into assuming it was a new tag. The tag being parsed was an <a and the bogus attribute was not a legitimate HTML tag, but a random looking string like <hdasq. The intent was to hide the href to a malicious site by putting it after the oddball attribute.

In testing this HTML out in different browsers, I found that even if a real tag name was used, the browsers still ignored it. My goal is to parse the HTML as a browser would display it to an end user, so I would not benefit from correcting a careless HTML author's errant missing > character.

@jhy
Copy link
Owner

jhy commented Feb 3, 2021

Thanks, makes sense

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Discussion for a new feature, or other change proposal
Projects
None yet
Development

No branches or pull requests

2 participants