Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add html5gum as alternative link extractor #480

Merged
merged 25 commits into from
Feb 7, 2022

Conversation

untitaker
Copy link
Collaborator

@untitaker untitaker commented Jan 27, 2022

html5gum is a html parser i wrote that offers lower-level control over which tokens actually get created and are tracked. as such the extractor in this PR doesn't really ever allocate anything at all for tokens it doesn't care about, or really do any per-token allocation at all. from a rudimentary test it appears to be faster than html5ever, though i'm currently not equipped to run proper benchmarks and the cost in terms of linecount is quite high (html5gum needs to improve on that front), though I guess if you count the lines in Cargo.lock it evens out again ;)

i'm using this parser in hyperlink successfully for a while already, but the userbase is much smaller and hyperlink doesn't really deal with external links at all.

i don't want to merge this PR as-is, I just want to know if it's slower for any of y'all or whether you're running into any bugs. html5gum is still a work-in-progress, though it should parse html correctly (in fact it is more up-to-date with the whatwg spec than html5ever atm) it's still an entirely new html parser

cc @MichaIng @mre @pawroman

@mre
Copy link
Member

mre commented Jan 27, 2022

Wow, that was unexpected!
I don't have much time to look at it right now, but I ran a few quick tests and the results are amazing.

Specimen 1: Sentry docs

lychee --no-progress ~/Downloads/sentry-docs/public --base ~/Downloads/sentry-docs/public --format detailed --offline --exclude "\.(png|jpg)" "host:port"

It finds three broken links that were not discovered before:

✗ file:///Users/mendler/Downloads/sentry-docs/public/product/integrations/integration-platform/u'https:/sentry.io/api/0/issues/%7B%7D: Cannot find file file:///Users/mendler/Downloads/sentry-docs/public/product/integrations/integration-platform/u'https:/sentry.io/api/0/issues/%7B%7D
✗ file:///Users/mendler/Downloads/sentry-docs/public/product/integrations/integration-platform/u'https:/sentry.io/api/0/sentry-app-installations/%7B%7D: Cannot find file file:///Users/mendler/Downloads/sentry-docs/public/product/integrations/integration-platform/u'https:/sentry.io/api/0/sentry-app-installations/%7B%7D
✗ file:///Users/mendler/Downloads/sentry-docs/public/product/integrations/integration-platform/u'https:/sentry.io/api/0/sentry-app-installations/%7B%7D/authorizations: Cannot find file file:///Users/mendler/Downloads/sentry-docs/public/product/integrations/integration-platform/u'https:/sentry.io/api/0/sentry-app-installations/%7B%7D/authorizations

I double-checked and they indeed all contain an apostrophe character like this:

<span class="token string">u&#x27;https://sentry.io/api/0/issues/{}/&#x27;</span>

So I'd say that's indeed a bug in the Sentry docs. 🎉

Specimen 2: My blog

lychee https://endler.dev --verbose

It found a previously undetected link to http://www.w3.org/2000/svg in this SVG blob 🎉 :

<link href="data:image/svg+xml;utf8,<svg xmlns='http://www.w3.org/2000/svg' viewBox='0 0 456 456'><circle cx='228' cy='228' r='220' fill='white' stroke='black' stroke-width='16'/><path d='M301 266l-70 37-77-37z'/><path fill='gold' d='M301 266l-70 31-77-31z'/><path d='M193 208a22 21 85 00-2 0 22 21 85 00-19 24 22 21 85 0023 19 22 21 85 0019-23 22 21 85 00-7-14 5 6 0 012 4 5 6 0 01-6 7 5 6 0 01-5-7 5 6 0 016-6 5 6 0 013 1 22 21 85 00-14-5zm71 1a22 21 85 00-2 0 22 21 85 00-19 23 22 21 85 0022 20 22 21 85 0019-24 22 21 85 00-20-19zm10 4a5 6 0 015 6 5 6 0 01-5 6 5 6 0 01-5-6 5 6 0 015-6z'/><path d='M262 58c-4 9-12 19-19 24 1-8 2-16 1-22-4 8-12 18-20 24 2-7 2-15 1-21-7 13-15 24-29 28-70 20-131 113-152 213 11 28 29 53 51 73 8-158 45-267 112-208 10 9 16 13 20 13 5 0 10-4 21-13 67-59 105 50 112 208 22-20 40-45 51-73-20-95-77-184-144-209l-10-4c4-8 7-23 5-33z'/></svg>" rel=icon>

Specimen 3: Rudimentary performance check

Was curious about the performance, so I checked with hyperfine.

Version with html5ever:

~/C/p/l/lychee ❯❯❯ hyperfine 'lychee --dump https://endler.dev --verbose'                                             ✘
Benchmark 1: lychee --dump https://endler.dev --verbose
  Time (mean ± σ):     250.1 ms ± 128.9 ms    [User: 12.9 ms, System: 10.5 ms]
  Range (min … max):   156.3 ms … 618.1 ms    14 runs

Version with html5gum:

~/C/p/l/lychee ❯❯❯ hyperfine 'lychee --dump https://endler.dev --verbose'
Benchmark 1: target/release/lychee --dump https://endler.dev --verbose
  Time (mean ± σ):     180.8 ms ±  27.5 ms    [User: 11.6 ms, System: 8.9 ms]
  Range (min … max):   146.2 ms … 250.6 ms    15 runs

That's a very nice boost!

Another one with the Sentry docs:

html5ever: 2.131 s ± 0.048 s
html5gum: 2.076 s ± 0.035 s

So, great work! 🥳 I'll do some more testing when I get the time and add updates here.

@untitaker
Copy link
Collaborator Author

untitaker commented Jan 27, 2022

i'm suspicious of timing things that in principle should be IO-bound. is it possible to reverse the order of those two benchmarks and get inverted results? I wonder if it's some cache on endler.dev playing a trick on you

is sentry-docs benchmark with --offline?

@untitaker
Copy link
Collaborator Author

untitaker commented Jan 27, 2022

btw i was impressed with how fast lychee is nowadays, last time I checked it's now on-par with hyperlink due to how aggressively it parallelizes parsing as well, and the network I/O seems to go very fast too (you can get faster performance from hyperlink on a beefy laptop by increasing threadcount with -j, but it's no use if your CI only has two cores). also using the simple/high-level API of html5gum (see its readme) actually regressed performance over what you have in master iirc

@untitaker
Copy link
Collaborator Author

cc @lebensterben as well, i think i saw you propose using tree-sitter for parsing so I do wonder what you think of this

@mre
Copy link
Member

mre commented Jan 27, 2022

i'm suspicious of timing things that in principle should be IO-bound. is it possible to reverse the order of those two benchmarks and get inverted results? I wonder if it's some cache on endler.dev playing a trick on you

Yeah it's not a great benchmark. 😆 Just something ad hoc I tried.
I did reverse the order a few times to make sure, but the result was about the same each time.
A better benchmark would be to isolate link extraction itself and we have a benchmark for that actually.
https://github.com/lycheeverse/lychee/tree/master/benches
If you have some time you can try it. I'll do the same when I get a chance.

is sentry-docs benchmark with --offline?

Yup.

hyperfine 'lychee --no-progress ~/Downloads/sentry-docs/public --base ~/Downloads/sentry-docs/public --format detailed --offline --exclude "\.(png|jpg)" "host:port"'

btw i was impressed with how fast lychee is nowadays, last time I checked it's now on-par with hyperlink due to how aggressively it parallelizes parsing as well, and the network I/O seems to go very fast too (you can get faster performance from hyperlink on a beefy laptop by increasing threadcount with -j, but it's no use if your CI only has two cores). also using the simple/high-level API of html5gum (see its readme) actually regressed performance over what you have in master iirc

Yeah, it took a while to come up with an abstraction that feels right, and there's still some tweaking we can do to reduce allocations, but overall I'm happy with the mix between parallel parsing and async I/O.
Walking globs can be made much faster with globwalk, but there are some issues and work on a more general solution has stalled.
We also want to exclude folders like .git and node_modules by default, which should give us another perf boost.
Allocation-wise, I'd love to get to a point where we don't copy the input sources and requests around and make better use of Cows, but overall we're in a decent position. html5gum seems to do less allocations than html5ever, which is great!

@lebensterben
Copy link
Member

cc @lebensterben as well, i think i saw you propose using tree-sitter for parsing so I do wonder what you think of this

I recommended tree-sitter parser mainly for

  • It outputs column and line number when encountering an error, which could enhance our error reporting.
  • It's insanely fast and it's used by many people constantly. (tree-sitter gets deeply integrated into neovim) Thus it's much more reliable than html5ever.

@lebensterben
Copy link
Member

@mre

I recommend to use https://wiki.archlinux.org/title/List_of_applications for benchmarks.

This is long enough and contains a good a mount of bad links.

@mre
Copy link
Member

mre commented Jan 27, 2022

I recommend to use https://wiki.archlinux.org/title/List_of_applications for benchmarks.

Great idea. Added it to the benchmarks in #481.

@untitaker
Copy link
Collaborator Author

is there a tree-sitter parser for html5 that follows whatwg spec? https://github.com/tree-sitter/tree-sitter-html appears to have outstanding issues for basic things like html entities which is a problem because those are valid in attributes

@lebensterben
Copy link
Member

is there a tree-sitter parser for html5 that follows whatwg spec?

I don't think so.
tree-sitter is pretty low level. It gives you a view of an html's AST.

But for the purpose of link detection, do we really need fully compliance to the specifications?

Or put it in another way, can we have a more specialized parser that just supports the features we need?

@untitaker
Copy link
Collaborator Author

untitaker commented Jan 27, 2022

you do at the very minimum need:

  • html entity parsing, as you can use html entities to escape arbitrary bytes of the URL (so you need to unescape them)
  • some variant of attribute value parsing, so you can actually figure out where an unquoted attribute value starts and ends. those rules are well-defined in whatwg and require you to figure out where the attribute starts
  • something to detect comments, so you can skip them. or you don't, and simply scrape their contents like plaintext (which lychee currently doesn't do) -- those rules are actually pretty complicated
  • something to skip over CDATA, doctypes (incl. malformed doctypes)

at least I believe you do to somewhat conform to whatwg, you can always try it out without. without adhering to whatwg you risk not matching browser behavior, and so not what the user sees.

i'm generally not sure what lychee's guidance there is: does it check broken links that might break the site for a browser-using user (broken css/js, broken clickable links, links that are visible in plaintext and therefore reachable by copypasting into browser bar), or does it want to do more than that? it appears to attempt to extract random URLs from all sorts of attributes, but then skips over html comments.

plaintext url extraction is a risky business, and overreliance on it as well. there is no particularly narrow set of characters allowed in an URL in practice, there's no single followed URL spec, relative links can have pretty much any sort of character in them at any position. i wouldn't rely on that alone too much, so you do still need to match on attribute name a bit imo

@lebensterben
Copy link
Member

i'm generally not sure what lychee's guidance there is: does it check broken links that might break the site for a browser-using user (broken css/js, broken clickable links, links that are visible in plaintext and therefore reachable by copypasting into browser bar), or does it want to do more than that?

we need some clarification here...
what does lychee actually check?

@untitaker
Copy link
Collaborator Author

right now? currently lychee has a predefined set of tag names and attributes that it directly treats as links, and beyond that it attempts to apply extract_plaintext on every single attribute regardless what the attribute is or what tag it belongs to. between start/end tags it applies extract_plaintext as well. it skips comments, doctypes, not sure about CDATA content

@lebensterben
Copy link
Member

no. I mean what is lychee designed to check.

Does it check all URLs (except comment) in an HTML?
Or does it only checks HTML elements that interacts with user, such as text, button, hyperlinks?

We need a clear design decision on this for both users and developers.

@untitaker
Copy link
Collaborator Author

Oh i overlooked

tree-sitter is pretty low level. It gives you a view of an html's AST.

that's fine, html5ever/html5gum also only give you that (html5gum even less). what matters is how you deal with malformed input, mis-nested tags and edgecases in parsing.

@mre
Copy link
Member

mre commented Jan 27, 2022

Does it check all URLs (except comment) in an HTML?
Or does it only checks HTML elements that interacts with user, such as text, button, hyperlinks?

I think we had this discussion before and the conclusion was that we could have a way to filter out invisible links, i.e things which aren't inside an a tag inside an html document for example.
In fact, the current extractor tracks the element and attribute of each link already, it's just not exposed in the cli yet. We could change that now. Does html5gum support that, too?

@untitaker
Copy link
Collaborator Author

untitaker commented Jan 27, 2022

In fact, the current extractor tracks the element and attribute of each link already, it's just not exposed in the cli yet. We could change that now. Does html5gum support that, too?

this pr produces exactly the same structs as the html5ever version did, if you mean that each link knows its tag name and attribute name (as defined in RawUri), then yes

@mre
Copy link
Member

mre commented Jan 27, 2022

Oh that's nice.

Also, here are some results of running the benchmark on my machine:

cd benches
cargo bench -p benches

html5ever: 26.975 ms 27.020 ms 27.092 ms
html5gum: 9.3890 ms 9.3987 ms 9.4054 ms

html5gum blows html5ever out of the water. A 3x perf improvement is quite impressive!
Given that it's faster, actively maintained, and detects more links, I'd have no issues with making html5gum the default HTML engine for lychee. We could put html5ever behind a feature flag in case people want that for the time being until we're confident that the new engine truly handles all the edge-cases. @untitaker would you be interested in doing the work for that?

@untitaker
Copy link
Collaborator Author

huh, when I run the benchsuite in this repo (elvis.html, right?) it gives me a 10% speedup

i'm fine with merging as-is, but i would like to ensure somebody else reviews this PR and sort of understands what's going on there. this sort of low-level API that html5gum exposes instead of token iterator is kind of brittle in that it's easy to mis-implement the trait

@lebensterben
Copy link
Member

@untitaker
I saw a few from_utf8 and the result is unwrapped right away.

We can make all of them from_utf8_unchecked as long as the invariant that the strings are UTF-8 is valid.

@untitaker
Copy link
Collaborator Author

untitaker commented Jan 27, 2022

@mre

Also, here are some results of running the benchmark on my machine:

interesting, elvis.html on my machine was only improving by 10%

We could put html5ever behind a feature flag in case people want that for the time being until we're confident that the new engine truly handles all the edge-cases.

runtime switch sounds doable, keep in mind that binary size will suffer a bit because of the duplicated entities table across parsers, but istm a linux build of lychee is already quite large

@lebensterben

We can make all of them from_utf8_unchecked as long as the invariant that the strings are UTF-8 is valid.

yes, it should be safe. I can add a patch so @mre can see how much this gains, it might be measurable

@mre
Copy link
Member

mre commented Jan 28, 2022

interesting, elvis.html on my machine was only improving by 10%

this was benchmarked on an M1 Max in my case.

runtime switch sounds doable, keep in mind that binary size will suffer a bit because of the duplicated entities table across parsers, but istm a linux build of lychee is already quite large

If the html engine is behind a feature flag, only the code for one engine will be compiled into the library. So I think we could have a html5ever flag, which would be the alternative engine and html5gum would be in default. Not sure if I expressed that clearly. 😊

yes, it should be safe. I can add a patch so @mre can see how much this gains, it might be measurable

Sure! I'd be delighted to try. Thanks for your work.

@mre
Copy link
Member

mre commented Jan 28, 2022

I was just writing a comment on the new Tokio 1.16 with a ~20% perf improvement for unevenly distributed workloads here and that got me thinking: could we improve performance by making link extraction entirely async? html5gum could add support for returning a stream of URIs and the request workers could start checking links right away -- while the document is not even fully parsed yet. html5ever doesn't provide such an API. It might be possible to add there as well, but it would be fancy if html5gum could provide that out of the box. Would you folks think that this could improve performance or would the futures overhead dwarf the gains?

@mre mre changed the title remove html5ever in favor of html5gum Add html5gum as alternative link extractor Feb 1, 2022
@mre
Copy link
Member

mre commented Feb 1, 2022

This is all done in an effort to reduce allocations, right? Another approach would be to return a stream of links with

struct Link<'a> {
  element: Element,
  attribute: Attribute,
  link: Cow<'a, str>,
}

This would also be zero-copy but with the added advantage of being straightforward to use. Users could drive the stream to completion and filter out anything they don't need. The stream supports checking links while the document is still getting parsed. I'm sure you considered that, but I wanted to get your opinion. It doesn't really affect this PR, so can also move the discussion over to html5gum.

@untitaker
Copy link
Collaborator Author

untitaker commented Feb 1, 2022

Right, this PR effectively gets rid of all per-useless-token allocations. A "useless token" is any token that we don't extract a link from (doctype, comment, most tags). You're thinking of how to reduce the amount of allocations per useful token. This is also valid, just probably harder to pull off. I'm somewhat sure the next optimization in this area with best cost/benefit is in reusing the extractor across html documents -- this gets rid of all the per-document allocations, which doesn't even come close to per-link allocations but is relatively easy to pull off (just put the extractor on a thread local)

[potential new Link definition]

I think the main question here is where link would borrow its string from, and at which point you'd convert it to a Link<'static> (i.e. using Cow::Owned) such that you can realistically pass it to the main queue, or to other threads (no idea how it works).

It cannot borrow for too long from any of the strings living on the extractor, as those need to be mutated while the parser is still running, and it cannot borrow from the original input &str because that one doesn't live for very long either, and also because html5gum doesn't let you do that. That's not a great reason, but I believe it would be infeasible to fix in a spec-conformant parser (for that discussion let's definetly move to html5gum repo) The only parser that I know of that successfully pulls it off is xmlparser, but that one is far from being a html5 parser.

I can definetly see this being useful if you expect a lot of links to be filtered (by user settings or deduplication), if you do that very close to the extractor within the same thread, you can probably borrow strings for "a little while" from the extractor, run the filtering logic and then only allocate for the links that made it through the filter.

@mre
Copy link
Member

mre commented Feb 1, 2022

Thanks for the explanations. I think it's fine the way it is then. At least I can't come up with a different design that would be worth the refactoring effort right now. In any case we can revisit this in the future. From my side we can merge it after the todo list above.

@untitaker
Copy link
Collaborator Author

Ah right, to go back to your original question, we can defer this, once we get around to improving the situation the externally visible behavior should be the same here, if it's not we can treat it as bug as per semver anyway.

@untitaker
Copy link
Collaborator Author

untitaker commented Feb 1, 2022

@mre it appears that we run into TeXitoi/structopt#305 -- i.e. you can't have a flag with an envvar, and still have it behave like a flag

ISTM upgrading to clap 3 should fix it, but i'm tempted to just remove either the envvar or the option, or even leave as is (--use-html5ever true)

@untitaker
Copy link
Collaborator Author

untitaker commented Feb 1, 2022

Regarding the discrepancy, I believe I have found the reason html5ever behaves differently from html5gum. I believe the html5gum behavior is fundamentally more correct.

Apply this patch:

diff --git a/lychee-lib/src/extract/html.rs b/lychee-lib/src/extract/html.rs
index 86e4599..bb868dc 100644
--- a/lychee-lib/src/extract/html.rs
+++ b/lychee-lib/src/extract/html.rs
@@ -18,7 +18,7 @@ impl TokenSink for LinkExtractor {
     #[allow(clippy::match_same_arms)]
     fn process_token(&mut self, token: Token, _line_number: u64) -> TokenSinkResult<()> {
         match token {
-            Token::CharacterTokens(raw) => self.links.extend(extract_plaintext(&raw)),
+            Token::CharacterTokens(raw) => self.links.extend(extract_plaintext(&dbg!(raw))),
             Token::TagToken(tag) => {
                 let Tag {
                     kind: _kind,
diff --git a/lychee-lib/src/extract/html5gum.rs b/lychee-lib/src/extract/html5gum.rs
index a2b6520..d93b524 100644
--- a/lychee-lib/src/extract/html5gum.rs
+++ b/lychee-lib/src/extract/html5gum.rs
@@ -85,6 +85,7 @@ impl LinkExtractor {
     fn flush_current_characters(&mut self) {
         // safety: since we feed html5gum tokenizer with a &str, this must be a &str as well.
         let raw = unsafe { from_utf8_unchecked(&self.current_string) };
+        dbg!(raw);
         self.links.extend(extract_plaintext(raw));
         self.current_string.clear();
     }

Then you'll see that on a html file like this:

<span class="token string">u&#x27;https://sentry.io/api/0/issues/{}/&#x27;</span>

lychee --use-html5ever true will yield:

[lychee-lib/src/extract/html.rs:21] raw = Tendril<UTF8>(inline: "u")
[lychee-lib/src/extract/html.rs:21] raw = Tendril<UTF8>(inline: "'")
[lychee-lib/src/extract/html.rs:21] raw = Tendril<UTF8>(shared: "https://sentry.io/api/0/issues/{}/")
[lychee-lib/src/extract/html.rs:21] raw = Tendril<UTF8>(inline: "'")
[lychee-lib/src/extract/html.rs:21] raw = Tendril<UTF8>(inline: "\n")

While lychee without the argument (using html5gum) will yield:

[lychee-lib/src/extract/html5gum.rs:88] raw = ""
[lychee-lib/src/extract/html5gum.rs:88] raw = "u'https://sentry.io/api/0/issues/{}/'"
[lychee-lib/src/extract/html5gum.rs:88] raw = "\n"

In other words, html5ever will emit multiple consecutive character tokens, while in html5gum this sort of behavior is controlled by the emitter.

It probably emits multiple tokens because html entity decoding goes down a special path.

Arguably the final result of html5ever is more desirable, but i think this is more of a case for making plaintext extraction better. Given this behavior I can construct HTML files like this where the behavior of html5gum is more desirable:

http:&#x2f;/google.com

Here html5ever will emit multiple character tokens as well, breaking the URL in half, while html5gum will construct one string that is passed to plaintext extraction

@untitaker
Copy link
Collaborator Author

untitaker commented Feb 1, 2022

Use -h flag for the help section in README.md and fix the unit test, which checks the rendered text

Is the intent to hide this flag from the readme that way? or do you also want a short option for this setting?

EDIT: Actually i don't understand the request, -h is not an option on lychee atm

@mre
Copy link
Member

mre commented Feb 1, 2022

lychee -h works, but it doesn't print what I thought it would.

lychee -h
error: The following required arguments were not provided:
    <inputs>...

USAGE:
    lychee <inputs>... --config <config-file> --format <format> --github-token <github-token> --headers <headers>... --max-concurrency <max-concurrency> --max-redirects <max-redirects> --max-retries <max-retries> --method <method> --timeout <timeout> --user-agent <user-agent>

I thought it would print the help description but not the long versions, which currently cause the additional line-breaks.

Probably removing the option or the env var is good enough and in this case I'd remove the option because I consider the extractor to be an implementation detail. We might still need a way to switch between engines quickly if we have to, but then using an env var is easier to do "globally", e.g. in the Github action and this would affect most users.

@lebensterben
Copy link
Member

I just came across of a form of html5gum: https://docs.rs/html5tokenizer/0.4.0/html5tokenizer/

Any idea what it is?

@untitaker
Copy link
Collaborator Author

a user forked html5gum and added their own features: untitaker/html5gum#14

@lebensterben
Copy link
Member

@untitaker
Do you plan to add those features?
Even though that user closed the PR and forked it, you can still pull in his codes and add him as co-author in commit message. There's also no violation of copyright since both repositories have same license.

@untitaker
Copy link
Collaborator Author

untitaker commented Feb 1, 2022

@lebensterben My understanding is that that user would not like to have their code merged into html5gum under the current coding/commit style conventions, so out of courtesy I won't do that (I have been in this situation myself where somebody made me regret my choice of a liberal license, so I understand to some degree). Legally I am able to... as we have recently learned ;)

It's a tricky feature and the crate has changed a bunch since that fork. Most importantly it now operates on a bytestream rather than a character stream, for performance (before that html5gum was slower than html5ever). However this doesn't make this feature easier to implement if the spans should still be character offsets rather than byte offsets. So in the end it may require a clean-slate implementation anyway. The corresponding issue where we discussed this feature is still open anyway.

I understand that lychee would like to have something like that, so patches welcome

@mre
Copy link
Member

mre commented Feb 1, 2022

I understand that lychee would like to have something like that, so patches welcome

True, a user requested that. #268 (comment)
Checking the original issue it looks like you came up with the generics proposal and you maintain the main repo, so it would be fair game to me. You wouldn't merge it in the current state anyway, but it would be helpful if it could land at some point and the PR could serve as a blueprint. It's completely up to you, though and you explained your position already.

@untitaker
Copy link
Collaborator Author

untitaker commented Feb 1, 2022

I think it will depend on the exact feature we want to enable. Martin Fischer wanted exact locations of tokens as character offset, but you seem to be happy with even just pointing out the line or approximate offset where some link is. That can be implemented with any parser: Just stream the input to it, and observe at which state the stream is when a link is emitted. Seems like we want to do streaming anyway sometime. At least that would come with less potential for perf overhead.

@untitaker
Copy link
Collaborator Author

Decide if async link extraction would be in the scope of this PR or a separate PR

probably separate PR

@untitaker
Copy link
Collaborator Author

untitaker commented Feb 1, 2022

@MichaIng did you get around to benchmarking this, will you once this hits master, or do you not ahve time at all?

@untitaker
Copy link
Collaborator Author

untitaker commented Feb 4, 2022

Unless we want to do more benchmarks I think this is ready to merge review a last time and merge. We'll just have to see if people complain about changed behavior (hopefully not), and deal with the outfall sooner (pre-1.0) than later.

@@ -6,7 +6,7 @@ use std::path::PathBuf;
fn extract(paths: &[PathBuf]) {
for path in paths {
let content: InputContent = path.try_into().unwrap();
let extracted = Extractor::extract(&content);
let extracted = Extractor::extract(&content, false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Came back for the final review and had a hard time remembering what false was for. An alternative would be to have Extract::with_html5gum and Extract::with_html5ever. More explicit without using a config struct. What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i did something like that now.. it's a bit repetitive in both Extractor and usages of it, but in return we have no breaking api change

}

impl LinkExtractor {
pub(crate) const fn new() -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could #[derive(Default)] for LinkExtractor and this code goes away?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did that now, and it makes new non-const as Default is not const

let attr = unsafe { from_utf8_unchecked(&self.current_attribute_name) };
let value = unsafe { from_utf8_unchecked(&self.current_attribute_value) };

let urls = LinkExtractor::extract_urls_from_elem_attr(attr, name, value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

element is also called tag or name in this function. Maybe we should stick to one word?

let elem = unsafe { from_utf8_unchecked(&self.current_element_name) };

Copy link
Collaborator Author

@untitaker untitaker Feb 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's html5gum (html spec) naming clashing with lychee naming

other candidates for rename are last_start_tag but they are not used by lychee

it's unclear where to draw the line to me because you clearly can't pick different method names

i chose to do just the rename you proposed (as that's the attribute we use for link extraction) but now the discrepancy is visible in the struct definition

best i can do is do a mass rename, accept that in things like set_last_start_tag the variable names are inconsistent and document that elem and tag are the same thing

@mre
Copy link
Member

mre commented Feb 4, 2022

Added a few more comments, but just cosmetic ones. This is okay to merge from my side.
@MichaIng did you want to benchmark this before merging?

@mre mre merged commit 68d09f7 into lycheeverse:master Feb 7, 2022
@mre
Copy link
Member

mre commented Feb 7, 2022

Merged now. I did some final tests on my machine and all looks good. 🤞
Thanks a lot for this great effort @untitaker. 🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants