-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokeniser: Tag attributes that follow '<' character in attribute name are lost #1483
Comments
I found that this logic was added as a way to address a previous issue I did not find in my issue search: #797 A configurable option would be helpful to allow consumers to handle this as needed. |
Imperfect workaround for those that want this behavior. It requires 2nd parsing of input, but could be useful for others in the meantime:
|
Yes, this is something I'm not super happy with -- personally from any example of HTML like this that I have seen, it's been clear that the author missed including a closing So far I've been reluctant to add parse options to jsoup as it makes the API more complicated and harder to learn. Generally I'd prefer jsoup to just get things right. Maybe one approach would be to, when encountering an errant What HTML are you encountering where the latter is better? Would like to understand it to consider other approaches. |
The use case I have is that I am parsing HTML where the intent of the author is to hide attributes from scanners/crawlers/parsers. The specific case where the actor was successful added new lines in front of the attribute as a clear attempt to fool it into assuming it was a new tag. The tag being parsed was an In testing this HTML out in different browsers, I found that even if a real tag name was used, the browsers still ignored it. My goal is to parse the HTML as a browser would display it to an end user, so I would not benefit from correcting a careless HTML author's errant missing |
Thanks, makes sense |
The Tokeniser logic for parsing attribute names considers a '<' character to be the end of the tag. This is not consistent with the way the browsers engines that I tested on MacOS (Brave/Chrome, Safari, Firefox) handle this case.
As demonstrated here: http://try.jsoup.org/~X8uusGL-o4nn_aiT4XVefMuXW0Q
Consider the tag
<a before="foo" <junk after="page.html">
.In this case, jsoup will associate the
before
attribute with thea
tag. It will then process<junk
as a new tag and associate theafter
attribute with it.Handling more consistently with browsers might assign the unvalued attribute
"<junk"
to thea
tag and continue processing additional attributes.The text was updated successfully, but these errors were encountered: