Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't check preconnect links #1187

Merged
merged 18 commits into from
Oct 7, 2024
Merged

Don't check preconnect links #1187

merged 18 commits into from
Oct 7, 2024

Conversation

mre
Copy link
Member

@mre mre commented Jul 29, 2023

Preconnect links are used to establish a server connection without loading a specific resource yet.
Not always do these links point to a URL that should return a 200, and they are not user-facing, i.e. they don't show up in the final rendered version of a page.

Therefore, I think we should exclude them at all; not even in --include-verbatim
mode, as they might not point to a valid resource.

Fixes #897

@mre
Copy link
Member Author

mre commented Aug 17, 2023

Huh, I don't know why this breaks a test. Somehow the parser state gets tripped up by my change. Maybe I need to reset some state.
https://github.com/lycheeverse/lychee/actions/runs/5872452302/job/15923983895?pr=1187

@untitaker, any ideas?

@mre
Copy link
Member Author

mre commented Aug 22, 2023

I'm thinking I need to improve the attribute handling in general to make the tests pass.

I would have to cache the attributes and only extract links once I'm fully done with reading all attributes for a single tag.
Quite frankly, driving the HTML parsing state-machine is a bit finicky with html5gum. Is there a way to change the parsing such that I can get a fully parsed set of attributes for a tag? It would probably come with a performance penalty, but it would reduce the complexity of the implementation.

(That's the one reason why I haven't removed html5ever support yet. I like how simple the parsing is, although it's slower of course.)

@untitaker
Copy link
Collaborator

instead of writing your own emitter you can always just read through the DefaultEmitter (like how tags are being processed in the README), though it is a lot slower and I think also not competitive with html5ever

@mre
Copy link
Member Author

mre commented Aug 24, 2023

@untitaker, the way I would model it (without switching to the DefaultEmitter) would be to create a HashMap with key being the attribute and value being a list/set of links, e.g. { key: "href", "value": [https://example.com] }. When I reach the end of a tag, I check the attributes and emit the list of links if there are no excluded attributes (like rel=prefetch).

Does that make sense? Is there an easier way in html5gum (without DefaultEmitter)? It feels like the perf could suffer from the additional bookkeeping.

@untitaker
Copy link
Collaborator

untitaker commented Aug 24, 2023 via email

@wackget
Copy link
Contributor

wackget commented Sep 11, 2024

This commit doesn't exclude all types of preconnect links e.g. dns-prefetch is not excluded even though it should be.

mre and others added 4 commits September 28, 2024 16:37
Preconnect links are used to establish a server connection without loading a
specific resource yet. Not always do these links point to a URL that should
return a 200, and they are not user-facing, i.e. they don't show up in the
final rendered version of a page.

Therefore, I think we should them at all; not even in `--include-verbatim`
mode, as they might not point to a valid resource.

Fixes #897
…lity

- Replace Vec<u8> with String for better readability and manipulation
- Introduce Element struct to encapsulate element-related data
- Use HashMap<String, String> for current_attributes for efficient lookups
- Add verbatim_stack to properly handle nested verbatim elements
- Remove unsafe code where possible, using String::from_utf8_lossy
- Improve attribute handling with HashMap entry API and prioritize srcset
- Simplify logic and consolidate verbatim element handling
- Enhance encapsulation in LinkExtractor struct
- Improve overall performance with more efficient data structures
- Increase flexibility for future feature additions or modifications

This refactor maintains core functionality while making the code more
idiomatic Rust, easier to read and maintain, and more robust in handling
edge cases. The new structure is better suited for future extensions or
modifications.
@mre
Copy link
Member Author

mre commented Oct 7, 2024

After more than a year, I finally got around to cleaning up this PR and fixing the remaining attribute ordering issues. Along the way, I refactored (and hopefully improved) the html5gum parser.

The new verbatim_stack also handles nested verbatim elements now.

Other changes include refactoring the HTML link extractor for improved performance and maintainability, extending the documentation. I also removed unsafe code where possible, using String::from_utf8_lossy. (We can switch back if it becomes a performance bottleneck, but I doubt it will be the case.)

Thanks, @untitaker, for the review.

@mre mre merged commit 11adc09 into master Oct 7, 2024
7 checks passed
@mre mre deleted the preconnect branch October 7, 2024 20:36
// Check and exclude rel=preconnect. Other than prefetch and preload,
// preconnect only does DNS lookups and might not be a link to a resource
if let Some(rel) = attrs.iter().find(|attr| &attr.name.local == "rel") {
if rel.value.contains("preconnect") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would changing this conditional to the following resolve issue #1499?

if rel.value.contains("preconnect") || rel.value.contains("prefetch")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so. Should we be more explicit and check for dns-prefetch here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry; don't know why I didn't use that.

if rel.value.contains("preconnect") || rel.value.contains("dns-prefetch")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wanna send a pull request?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks :)

#1520

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't check preconnect links if not in verbatim mode
3 participants