Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could handle missing tag ends (>) better #797

Closed
bodiam opened this issue Dec 11, 2016 · 4 comments
Closed

Could handle missing tag ends (>) better #797

bodiam opened this issue Dec 11, 2016 · 4 comments
Milestone

Comments

@bodiam
Copy link

bodiam commented Dec 11, 2016

We are using Jsoup to parse HTML documents from some external websites, which are not under our control. A few days ago, one of these sites updated their website, and introduced a bug, causing our crawling to fail spectacularly. The HTML which was broken looked a bit like this:

<td class="my-cell"
   <div class="great-formatting">100</div>
</td>

As you can see, the TD is missing a closing >, while we did a document.select("div.great-formatting"). This failed, because Jsoup couldn't parse the document correctly anymore.

I understand it's a very edge case, and maybe very hard to fix. However, for us it was a production issue, and caused us quite a few headaches. Right now, we have a sort of preprocessor running over the HTML to close all elements which should be closed, but it would be much nicer if Jsoup would handle this out of the box.

@jhy jhy changed the title Jsoup cannot parse broken HTML Could handle missing tag ends (>) better Dec 22, 2016
@jhy
Copy link
Owner

jhy commented Dec 22, 2016

Thanks for the report. This is implemented per the HTML5 spec and is the same way browsers run. (That sounds like an excuse or cop-out, but I just mean it for context.)

I think if we see a < that would be an attribute name we could assume the intent was to start a new tag, not to get an attribute with a < in its name. Even moreso if there was a newline between them (although the state machine doesn't know the latter atm).

@jhy jhy added this to the 1.12.1 milestone Apr 29, 2018
@jhy jhy closed this as completed in bdf1df7 Apr 29, 2018
@bodiam
Copy link
Author

bodiam commented Apr 29, 2018

Hi, thanks for fixing this issue, much appreciated!

@akaakuk
Copy link

akaakuk commented Nov 13, 2018

Hi,

Just a quick one regarding this issue as I can see it's now closed, we are also using jsoup to parse HTML documents from external websites, but we run a testing tool and would need to know if something about the website was incorrect. Is this behaviour configurable?

Thanks.

@jhy
Copy link
Owner

jhy commented Dec 24, 2018

@akaakuk you can enable error tracking if you want to catch error during parsing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants