-
-
Notifications
You must be signed in to change notification settings - Fork 885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Run extract_opengraph_data only on first 64kB of data and if Content-Type html #4957
Conversation
what about validating the returned content type before reading any data? |
It seemed like the code was intentionally modified in 2021-2022 to not do that, maybe due to wrong server responses, there's a comment in the code linking #1964 . But yes, that reason might not be valid or no longer valid or not the same. |
2d1a1bc
to
f5bbc65
Compare
the html parser seems to be doing pretty good even with heavily truncated data, so it's probably fine to just do it that way.
|
I've looked at the linked issue(s) again and don't see a reason to not use the content-type response. I've updated the PR to use the content-type header instead of checking for null bytes. |
This is probably due to me stupidly removing the doctype check in 55f84dd#diff-0cd13a4a101c05a54d9ab789fa06c9d48ab61c0e1f5c9b9a178440e9ba3a5ef5L145 . I think adding that back in would be enough. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I see we already have tests below, so we should be good.
CI is messing up, I'll try to see what's going on. |
…tent-Type html (LemmyNet#4957) * fix: Run extract_opengraph_data only on first 64kB of data and if data is not binary. * use mime type for determination * chore: simplify collect function
Currently, for metadata extraction the full data is fetched, converted lossily to a string and then parsed as HTML.
This is expensive: It takes 10-20s of 100% CPU to parse a 20MB response. 1+MB responses are common for image, gif, video URLs.
This changes that method, it tries to fetch only the first 64kB of data and then only runs the expensive HTML parsing if there is no null byte in the data. After this, the whole API request only takes ~0.1 s (including the external fetch)
Fixes #4956.