Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Run extract_opengraph_data only on first 64kB of data and if Content-Type html #4957

Merged
merged 3 commits into from
Aug 7, 2024

Conversation

phiresky
Copy link
Collaborator

@phiresky phiresky commented Aug 3, 2024

Currently, for metadata extraction the full data is fetched, converted lossily to a string and then parsed as HTML.

This is expensive: It takes 10-20s of 100% CPU to parse a 20MB response. 1+MB responses are common for image, gif, video URLs.

This changes that method, it tries to fetch only the first 64kB of data and then only runs the expensive HTML parsing if there is no null byte in the data. After this, the whole API request only takes ~0.1 s (including the external fetch)

Fixes #4956.

@Nothing4You
Copy link
Collaborator

what about validating the returned content type before reading any data?

@phiresky
Copy link
Collaborator Author

phiresky commented Aug 3, 2024

It seemed like the code was intentionally modified in 2021-2022 to not do that, maybe due to wrong server responses, there's a comment in the code linking #1964 . But yes, that reason might not be valid or no longer valid or not the same.

@phiresky phiresky force-pushed the opengraph-optimize branch from 2d1a1bc to f5bbc65 Compare August 3, 2024 12:35
@Nothing4You
Copy link
Collaborator

the html parser seems to be doing pretty good even with heavily truncated data, so it's probably fine to just do it that way.
if we already truncate, do we still need to worry about a binary check though?
the library seems to be doing fine with null bytes in the input as well:

Evaluating "<html><head><title>Hello</title><hea"
HTML parsed ok: true
HTML title: Some("Hello")
Evaluating "<html><head><title>Hell\0o World"
HTML parsed ok: true
HTML title: Some("Hell�o World")

@phiresky
Copy link
Collaborator Author

phiresky commented Aug 3, 2024

I've looked at the linked issue(s) again and don't see a reason to not use the content-type response. I've updated the PR to use the content-type header instead of checking for null bytes.

@dessalines
Copy link
Member

dessalines commented Aug 3, 2024

This is probably due to me stupidly removing the doctype check in 55f84dd#diff-0cd13a4a101c05a54d9ab789fa06c9d48ab61c0e1f5c9b9a178440e9ba3a5ef5L145 . I think adding that back in would be enough.

@phiresky phiresky changed the title fix: Run extract_opengraph_data only on first 64kB of data and if data is not binary. fix: Run extract_opengraph_data only on first 64kB of data and if Content-Type html Aug 3, 2024
Copy link
Member

@dessalines dessalines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I see we already have tests below, so we should be good.

@dessalines
Copy link
Member

CI is messing up, I'll try to see what's going on.

crates/api_common/src/request.rs Show resolved Hide resolved
crates/api_common/src/request.rs Outdated Show resolved Hide resolved
@dessalines dessalines merged commit 606545c into main Aug 7, 2024
2 checks passed
@ticoombs
Copy link
Contributor

ticoombs commented Aug 9, 2024

image
Might be too early to say, but this has reduced my CPU by at-least half. Thanks

Nutomic pushed a commit to sunaurus/lemmy that referenced this pull request Sep 20, 2024
…tent-Type html (LemmyNet#4957)

* fix: Run extract_opengraph_data only on first 64kB of data and if data is not binary.

* use mime type for determination

* chore: simplify collect function
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants