fix: Run extract_opengraph_data only on first 64kB of data and if Content-Type html #4957

phiresky · 2024-08-03T12:22:14Z

Currently, for metadata extraction the full data is fetched, converted lossily to a string and then parsed as HTML.

This is expensive: It takes 10-20s of 100% CPU to parse a 20MB response. 1+MB responses are common for image, gif, video URLs.

This changes that method, it tries to fetch only the first 64kB of data and then only runs the expensive HTML parsing if there is no null byte in the data. After this, the whole API request only takes ~0.1 s (including the external fetch)

Fixes #4956.

Nothing4You · 2024-08-03T12:29:54Z

what about validating the returned content type before reading any data?

phiresky · 2024-08-03T12:32:59Z

It seemed like the code was intentionally modified in 2021-2022 to not do that, maybe due to wrong server responses, there's a comment in the code linking #1964 . But yes, that reason might not be valid or no longer valid or not the same.

…a is not binary.

Nothing4You · 2024-08-03T13:08:19Z

the html parser seems to be doing pretty good even with heavily truncated data, so it's probably fine to just do it that way.
if we already truncate, do we still need to worry about a binary check though?
the library seems to be doing fine with null bytes in the input as well:

Evaluating "<html><head><title>Hello</title><hea"
HTML parsed ok: true
HTML title: Some("Hello")
Evaluating "<html><head><title>Hell\0o World"
HTML parsed ok: true
HTML title: Some("Hell�o World")

phiresky · 2024-08-03T13:11:30Z

I've looked at the linked issue(s) again and don't see a reason to not use the content-type response. I've updated the PR to use the content-type header instead of checking for null bytes.

dessalines · 2024-08-03T13:14:41Z

This is probably due to me stupidly removing the doctype check in 55f84dd#diff-0cd13a4a101c05a54d9ab789fa06c9d48ab61c0e1f5c9b9a178440e9ba3a5ef5L145 . I think adding that back in would be enough.

dessalines

Nice, I see we already have tests below, so we should be good.

dessalines · 2024-08-03T16:17:08Z

CI is messing up, I'll try to see what's going on.

crates/api_common/src/request.rs

ticoombs · 2024-08-09T00:27:00Z

Might be too early to say, but this has reduced my CPU by at-least half. Thanks

…tent-Type html (LemmyNet#4957) * fix: Run extract_opengraph_data only on first 64kB of data and if data is not binary. * use mime type for determination * chore: simplify collect function

phiresky requested review from Nutomic, dessalines, dullbananas and SleeplessOne1917 as code owners August 3, 2024 12:22

fix: Run extract_opengraph_data only on first 64kB of data and if dat…

f5bbc65

…a is not binary.

phiresky force-pushed the opengraph-optimize branch from 2d1a1bc to f5bbc65 Compare August 3, 2024 12:35

use mime type for determination

b60e511

phiresky changed the title ~~fix: Run extract_opengraph_data only on first 64kB of data and if data is not binary.~~ fix: Run extract_opengraph_data only on first 64kB of data and if Content-Type html Aug 3, 2024

phiresky mentioned this pull request Aug 3, 2024

Adding back in doctype check. #4958

Closed

dessalines approved these changes Aug 3, 2024

View reviewed changes

SleeplessOne1917 approved these changes Aug 3, 2024

View reviewed changes

dullbananas requested changes Aug 3, 2024

View reviewed changes

crates/api_common/src/request.rs Show resolved Hide resolved

crates/api_common/src/request.rs Outdated Show resolved Hide resolved

dullbananas reviewed Aug 3, 2024

View reviewed changes

crates/api_common/src/request.rs Show resolved Hide resolved

chore: simplify collect function

34631a9

dullbananas approved these changes Aug 6, 2024

View reviewed changes

dessalines merged commit 606545c into main Aug 7, 2024
2 checks passed

This was referenced Nov 18, 2024

Metadata fetching fails on Youtube #5208

Open

Guess image mime type from file extension (fixes #5196) #5212

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Run extract_opengraph_data only on first 64kB of data and if Content-Type html #4957

fix: Run extract_opengraph_data only on first 64kB of data and if Content-Type html #4957

phiresky commented Aug 3, 2024

Nothing4You commented Aug 3, 2024

phiresky commented Aug 3, 2024

Nothing4You commented Aug 3, 2024

phiresky commented Aug 3, 2024

dessalines commented Aug 3, 2024 •

edited

Loading

dessalines left a comment

dessalines commented Aug 3, 2024

ticoombs commented Aug 9, 2024

fix: Run extract_opengraph_data only on first 64kB of data and if Content-Type html #4957

fix: Run extract_opengraph_data only on first 64kB of data and if Content-Type html #4957

Conversation

phiresky commented Aug 3, 2024

Nothing4You commented Aug 3, 2024

phiresky commented Aug 3, 2024

Nothing4You commented Aug 3, 2024

phiresky commented Aug 3, 2024

dessalines commented Aug 3, 2024 • edited Loading

dessalines left a comment

Choose a reason for hiding this comment

dessalines commented Aug 3, 2024

ticoombs commented Aug 9, 2024

dessalines commented Aug 3, 2024 •

edited

Loading