-
-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add html5gum as alternative link extractor #480
Conversation
Wow, that was unexpected! Specimen 1: Sentry docs
It finds three broken links that were not discovered before:
I double-checked and they indeed all contain an apostrophe character like this:
So I'd say that's indeed a bug in the Sentry docs. 🎉 Specimen 2: My blog
It found a previously undetected link to http://www.w3.org/2000/svg in this SVG blob 🎉 :
Specimen 3: Rudimentary performance checkWas curious about the performance, so I checked with hyperfine. Version with html5ever:
Version with html5gum:
That's a very nice boost! Another one with the Sentry docs: html5ever: 2.131 s ± 0.048 s So, great work! 🥳 I'll do some more testing when I get the time and add updates here. |
i'm suspicious of timing things that in principle should be IO-bound. is it possible to reverse the order of those two benchmarks and get inverted results? I wonder if it's some cache on endler.dev playing a trick on you is sentry-docs benchmark with |
btw i was impressed with how fast lychee is nowadays, last time I checked it's now on-par with hyperlink due to how aggressively it parallelizes parsing as well, and the network I/O seems to go very fast too (you can get faster performance from hyperlink on a beefy laptop by increasing threadcount with |
cc @lebensterben as well, i think i saw you propose using tree-sitter for parsing so I do wonder what you think of this |
Yeah it's not a great benchmark. 😆 Just something ad hoc I tried.
Yup.
Yeah, it took a while to come up with an abstraction that feels right, and there's still some tweaking we can do to reduce allocations, but overall I'm happy with the mix between parallel parsing and async I/O. |
I recommended tree-sitter parser mainly for
|
I recommend to use https://wiki.archlinux.org/title/List_of_applications for benchmarks. This is long enough and contains a good a mount of bad links. |
Great idea. Added it to the benchmarks in #481. |
is there a tree-sitter parser for html5 that follows whatwg spec? https://github.com/tree-sitter/tree-sitter-html appears to have outstanding issues for basic things like html entities which is a problem because those are valid in attributes |
I don't think so. But for the purpose of link detection, do we really need fully compliance to the specifications? Or put it in another way, can we have a more specialized parser that just supports the features we need? |
you do at the very minimum need:
at least I believe you do to somewhat conform to whatwg, you can always try it out without. without adhering to whatwg you risk not matching browser behavior, and so not what the user sees. i'm generally not sure what lychee's guidance there is: does it check broken links that might break the site for a browser-using user (broken css/js, broken clickable links, links that are visible in plaintext and therefore reachable by copypasting into browser bar), or does it want to do more than that? it appears to attempt to extract random URLs from all sorts of attributes, but then skips over html comments. plaintext url extraction is a risky business, and overreliance on it as well. there is no particularly narrow set of characters allowed in an URL in practice, there's no single followed URL spec, relative links can have pretty much any sort of character in them at any position. i wouldn't rely on that alone too much, so you do still need to match on attribute name a bit imo |
we need some clarification here... |
right now? currently lychee has a predefined set of tag names and attributes that it directly treats as links, and beyond that it attempts to apply |
no. I mean what is lychee designed to check. Does it check all URLs (except comment) in an HTML? We need a clear design decision on this for both users and developers. |
Oh i overlooked
that's fine, html5ever/html5gum also only give you that (html5gum even less). what matters is how you deal with malformed input, mis-nested tags and edgecases in parsing. |
I think we had this discussion before and the conclusion was that we could have a way to filter out invisible links, i.e things which aren't inside an a tag inside an html document for example. |
this pr produces exactly the same structs as the html5ever version did, if you mean that each link knows its tag name and attribute name (as defined in |
Oh that's nice. Also, here are some results of running the benchmark on my machine:
html5ever: html5gum blows html5ever out of the water. A 3x perf improvement is quite impressive! |
huh, when I run the benchsuite in this repo (elvis.html, right?) it gives me a 10% speedup i'm fine with merging as-is, but i would like to ensure somebody else reviews this PR and sort of understands what's going on there. this sort of low-level API that html5gum exposes instead of token iterator is kind of brittle in that it's easy to mis-implement the trait |
@untitaker We can make all of them from_utf8_unchecked as long as the invariant that the strings are UTF-8 is valid. |
interesting, elvis.html on my machine was only improving by 10%
runtime switch sounds doable, keep in mind that binary size will suffer a bit because of the duplicated entities table across parsers, but istm a linux build of lychee is already quite large
yes, it should be safe. I can add a patch so @mre can see how much this gains, it might be measurable |
this was benchmarked on an M1 Max in my case.
If the html engine is behind a feature flag, only the code for one engine will be compiled into the library. So I think we could have a
Sure! I'd be delighted to try. Thanks for your work. |
I was just writing a comment on the new Tokio 1.16 with a ~20% perf improvement for unevenly distributed workloads here and that got me thinking: could we improve performance by making link extraction entirely async? html5gum could add support for returning a stream of URIs and the request workers could start checking links right away -- while the document is not even fully parsed yet. html5ever doesn't provide such an API. It might be possible to add there as well, but it would be fancy if html5gum could provide that out of the box. Would you folks think that this could improve performance or would the futures overhead dwarf the gains? |
This is all done in an effort to reduce allocations, right? Another approach would be to return a stream of links with struct Link<'a> {
element: Element,
attribute: Attribute,
link: Cow<'a, str>,
} This would also be zero-copy but with the added advantage of being straightforward to use. Users could drive the stream to completion and filter out anything they don't need. The stream supports checking links while the document is still getting parsed. I'm sure you considered that, but I wanted to get your opinion. It doesn't really affect this PR, so can also move the discussion over to html5gum. |
Right, this PR effectively gets rid of all per-useless-token allocations. A "useless token" is any token that we don't extract a link from (doctype, comment, most tags). You're thinking of how to reduce the amount of allocations per useful token. This is also valid, just probably harder to pull off. I'm somewhat sure the next optimization in this area with best cost/benefit is in reusing the extractor across html documents -- this gets rid of all the per-document allocations, which doesn't even come close to per-link allocations but is relatively easy to pull off (just put the extractor on a thread local)
I think the main question here is where It cannot borrow for too long from any of the strings living on the extractor, as those need to be mutated while the parser is still running, and it cannot borrow from the original input I can definetly see this being useful if you expect a lot of links to be filtered (by user settings or deduplication), if you do that very close to the extractor within the same thread, you can probably borrow strings for "a little while" from the extractor, run the filtering logic and then only allocate for the links that made it through the filter. |
Thanks for the explanations. I think it's fine the way it is then. At least I can't come up with a different design that would be worth the refactoring effort right now. In any case we can revisit this in the future. From my side we can merge it after the todo list above. |
Ah right, to go back to your original question, we can defer this, once we get around to improving the situation the externally visible behavior should be the same here, if it's not we can treat it as bug as per semver anyway. |
@mre it appears that we run into TeXitoi/structopt#305 -- i.e. you can't have a flag with an envvar, and still have it behave like a flag ISTM upgrading to clap 3 should fix it, but i'm tempted to just remove either the envvar or the option, or even leave as is ( |
Regarding the discrepancy, I believe I have found the reason html5ever behaves differently from html5gum. I believe the html5gum behavior is fundamentally more correct. Apply this patch: diff --git a/lychee-lib/src/extract/html.rs b/lychee-lib/src/extract/html.rs
index 86e4599..bb868dc 100644
--- a/lychee-lib/src/extract/html.rs
+++ b/lychee-lib/src/extract/html.rs
@@ -18,7 +18,7 @@ impl TokenSink for LinkExtractor {
#[allow(clippy::match_same_arms)]
fn process_token(&mut self, token: Token, _line_number: u64) -> TokenSinkResult<()> {
match token {
- Token::CharacterTokens(raw) => self.links.extend(extract_plaintext(&raw)),
+ Token::CharacterTokens(raw) => self.links.extend(extract_plaintext(&dbg!(raw))),
Token::TagToken(tag) => {
let Tag {
kind: _kind,
diff --git a/lychee-lib/src/extract/html5gum.rs b/lychee-lib/src/extract/html5gum.rs
index a2b6520..d93b524 100644
--- a/lychee-lib/src/extract/html5gum.rs
+++ b/lychee-lib/src/extract/html5gum.rs
@@ -85,6 +85,7 @@ impl LinkExtractor {
fn flush_current_characters(&mut self) {
// safety: since we feed html5gum tokenizer with a &str, this must be a &str as well.
let raw = unsafe { from_utf8_unchecked(&self.current_string) };
+ dbg!(raw);
self.links.extend(extract_plaintext(raw));
self.current_string.clear();
} Then you'll see that on a html file like this: <span class="token string">u'https://sentry.io/api/0/issues/{}/'</span>
While
In other words, html5ever will emit multiple consecutive character tokens, while in html5gum this sort of behavior is controlled by the emitter. It probably emits multiple tokens because html entity decoding goes down a special path. Arguably the final result of html5ever is more desirable, but i think this is more of a case for making plaintext extraction better. Given this behavior I can construct HTML files like this where the behavior of html5gum is more desirable: http://google.com Here html5ever will emit multiple character tokens as well, breaking the URL in half, while html5gum will construct one string that is passed to plaintext extraction |
Is the intent to hide this flag from the readme that way? or do you also want a short option for this setting? EDIT: Actually i don't understand the request, |
I thought it would print the help description but not the long versions, which currently cause the additional line-breaks. Probably removing the option or the env var is good enough and in this case I'd remove the option because I consider the extractor to be an implementation detail. We might still need a way to switch between engines quickly if we have to, but then using an env var is easier to do "globally", e.g. in the Github action and this would affect most users. |
I just came across of a form of html5gum: https://docs.rs/html5tokenizer/0.4.0/html5tokenizer/ Any idea what it is? |
a user forked html5gum and added their own features: untitaker/html5gum#14 |
@untitaker |
@lebensterben My understanding is that that user would not like to have their code merged into html5gum under the current coding/commit style conventions, so out of courtesy I won't do that (I have been in this situation myself where somebody made me regret my choice of a liberal license, so I understand to some degree). Legally I am able to... as we have recently learned ;) It's a tricky feature and the crate has changed a bunch since that fork. Most importantly it now operates on a bytestream rather than a character stream, for performance (before that html5gum was slower than html5ever). However this doesn't make this feature easier to implement if the spans should still be character offsets rather than byte offsets. So in the end it may require a clean-slate implementation anyway. The corresponding issue where we discussed this feature is still open anyway. I understand that lychee would like to have something like that, so patches welcome |
True, a user requested that. #268 (comment) |
I think it will depend on the exact feature we want to enable. Martin Fischer wanted exact locations of tokens as character offset, but you seem to be happy with even just pointing out the line or approximate offset where some link is. That can be implemented with any parser: Just stream the input to it, and observe at which state the stream is when a link is emitted. Seems like we want to do streaming anyway sometime. At least that would come with less potential for perf overhead. |
probably separate PR |
@MichaIng did you get around to benchmarking this, will you once this hits master, or do you not ahve time at all? |
Unless we want to do more benchmarks I think this is ready to |
benches/src/extract.rs
Outdated
@@ -6,7 +6,7 @@ use std::path::PathBuf; | |||
fn extract(paths: &[PathBuf]) { | |||
for path in paths { | |||
let content: InputContent = path.try_into().unwrap(); | |||
let extracted = Extractor::extract(&content); | |||
let extracted = Extractor::extract(&content, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Came back for the final review and had a hard time remembering what false
was for. An alternative would be to have Extract::with_html5gum
and Extract::with_html5ever
. More explicit without using a config struct. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i did something like that now.. it's a bit repetitive in both Extractor and usages of it, but in return we have no breaking api change
} | ||
|
||
impl LinkExtractor { | ||
pub(crate) const fn new() -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could #[derive(Default)]
for LinkExtractor
and this code goes away?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did that now, and it makes new
non-const as Default
is not const
let attr = unsafe { from_utf8_unchecked(&self.current_attribute_name) }; | ||
let value = unsafe { from_utf8_unchecked(&self.current_attribute_value) }; | ||
|
||
let urls = LinkExtractor::extract_urls_from_elem_attr(attr, name, value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
element
is also called tag
or name
in this function. Maybe we should stick to one word?
let elem = unsafe { from_utf8_unchecked(&self.current_element_name) };
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's html5gum (html spec) naming clashing with lychee naming
other candidates for rename are last_start_tag
but they are not used by lychee
it's unclear where to draw the line to me because you clearly can't pick different method names
i chose to do just the rename you proposed (as that's the attribute we use for link extraction) but now the discrepancy is visible in the struct definition
best i can do is do a mass rename, accept that in things like set_last_start_tag
the variable names are inconsistent and document that elem and tag are the same thing
Added a few more comments, but just cosmetic ones. This is okay to merge from my side. |
Merged now. I did some final tests on my machine and all looks good. 🤞 |
html5gum is a html parser i wrote that offers lower-level control over which tokens actually get created and are tracked. as such the extractor in this PR doesn't really ever allocate anything at all for tokens it doesn't care about, or really do any per-token allocation at all. from a rudimentary test it appears to be faster than html5ever, though i'm currently not equipped to run proper benchmarks and the cost in terms of linecount is quite high (html5gum needs to improve on that front), though I guess if you count the lines in Cargo.lock it evens out again ;)
i'm using this parser in hyperlink successfully for a while already, but the userbase is much smaller and hyperlink doesn't really deal with external links at all.
i don't want to merge this PR as-is, I just want to know if it's slower for any of y'all or whether you're running into any bugs. html5gum is still a work-in-progress, though it should parse html correctly (in fact it is more up-to-date with the whatwg spec than html5ever atm) it's still an entirely new html parser
cc @MichaIng @mre @pawroman