-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add query for LCP image formats and file sizes #97
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @swissspidy, I left some feedback on the queries. Mostly I think they can be simplified using a few best practices.
A separate overarching question: It's hard to spot for me at this point, but at first glance the queries look very similar. Could we accomplish to get both points of data with a single query, rather than having two separate queries with a lot of duplicate parts?
Let me go back to beginning here, after separate discussions with @felixarntz and @adamsilverstein For context/background, I want to come up with queries supporting the hypothesis that the client-side media processing work is beneficial for LCP. At first, here I was trying to come up with a comparison of median file size per image format. In hindsight that's not super relevant, as usually WebP and AVIF are smaller than JPEG. So such a query isn't necessarily useful. Given the LCP focus, what's more interesting is:
This way we should be able to get some useful data. To take this even further, we can look at all images on a page (from I only briefly looked at that so far but haven't dug deep yet. But I think it should be possible. Adam also made a good point that it's possible to detect usage of MozJPEG encoder. So we can also track adoption there and differences of MozJPEG vs other encoders. This can again be done on all images on a page |
@adamsilverstein @felixarntz I just added a new version of the query that is almost there I think. The raw data I get from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@swissspidy There's nothing that immediately jumps out to me as problematic for query performance, except that there's a good chance you're just processing too much data that causes the JOIN to be too expensive. I left two recommendations where I think we can optimize that.
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@swissspidy I'm not exactly sure why you get 3 rows for mobile and desktop in your query, but when grouping correctly you still get those 3 rows. I think the main problem is that you group also by url
.
It looks like image/avif
does not show up at all. Maybe that's because no LCP image uses AVIF - possible but at scale still unlikely, so potentially there's something in the HTTP Archive pipeline or underlying tooling that treats AVIF differently? 🤔
This comment was marked as resolved.
This comment was marked as resolved.
Agreed. Interestingly, it seems WebP is present if you adjust the query to be for Something that might help figure out what's going on (and I'd argue generally good to include in the result for a quantitative assessment) is how many images were considered in each group. Can you add that to the query? |
This comment was marked as resolved.
This comment was marked as resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@swissspidy I may have found one of the main problems.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
Yep, makes sense 👍 Cleaned up version looks more like this:
Makes it very obvious how much smaller WebP and AVIF are than JPEG, especially on mobile with similar dimensions. |
Co-authored-by: Weston Ruter <westonruter@google.com>
The median width for GIF images as LCP is 3,500 pixels on desktop? Why is that so much larger than the other image formats on desktop? Such a huge GIF would surely seem to have a much larger median file size, since GIFs are terribly heavy, assuming they are animated. So apparently they are not animated, since the median is not huge. Does this indicate an issue in the LCP detection logic I wonder? Specifically, Chrome is supposed to ignore low-entropy images from being identified as LCP. Low entropy is calculated at "0.05 bits of image data per displayed pixel". Since the GIF format is the heaviest of all the formats, could it be that placeholder GIF images are erroneously being identified as LCP? I tried modifying the query to list out the desktop LCP GIF images which had an image width of at least 3,500 pixels: QueryDECLARE
DATE_TO_QUERY DATE DEFAULT '2024-03-01';
CREATE TEMPORARY FUNCTION
IS_CMS(technologies ARRAY<STRUCT<technology STRING,
categories ARRAY<STRING>,
info ARRAY<STRING>>>,
cms STRING,
version STRING)
RETURNS BOOL AS ( EXISTS(
SELECT
*
FROM
UNNEST(technologies) AS technology,
UNNEST(technology.info) AS info
WHERE
technology.technology = cms
AND ( version = ""
OR ENDS_WITH(version, ".x")
AND (STARTS_WITH(info, RTRIM(version, "x"))
OR info = RTRIM(version, ".x"))
OR info = version ) ) );
CREATE TEMPORARY FUNCTION
IS_GIF_IMAGE (summary STRING)
RETURNS BOOLEAN AS (LOWER(CAST(JSON_EXTRACT_SCALAR(summary, "$.mimeType") AS STRING)) = "image/gif");
WITH
pagesWithLcpImages AS (
SELECT
date,
client,
page,
JSON_EXTRACT_SCALAR(custom_metrics, '$.performance.lcp_elem_stats.url') AS url,
CAST(JSON_EXTRACT_SCALAR(custom_metrics, '$.performance.lcp_elem_stats.naturalWidth') AS INT64) AS image_width,
CAST(JSON_EXTRACT_SCALAR(custom_metrics, '$.performance.lcp_elem_stats.naturalHeight') AS INT64) AS image_height,
FROM
`httparchive.all.pages`
WHERE
IS_CMS(technologies,
'WordPress',
'')
AND LOWER(JSON_EXTRACT_SCALAR(custom_metrics, '$.performance.lcp_elem_stats.nodeName')) = 'img'
AND date = DATE_TO_QUERY
AND client = "desktop" ),
imageRequests AS (
SELECT
date,
client,
page,
url,
LOWER(CAST(JSON_EXTRACT_SCALAR(summary, "$.mimeType") AS STRING)) AS mime_type,
CAST( JSON_EXTRACT_SCALAR(summary, "$.respSize") AS NUMERIC) AS resp_size,
FROM
`httparchive.all.requests`
WHERE
IS_GIF_IMAGE(summary)
AND date = DATE_TO_QUERY
AND client = "desktop" )
SELECT
url,
image_width,
resp_size
FROM
pagesWithLcpImages
JOIN
imageRequests
USING
( date,
page,
client,
url )
WHERE
image_width >= 3500
ORDER BY
image_width DESC I only got 86 results for this which I didn't expect if the median of 32,816 images is 3500 pixels wide. I did another query in which I got all GIF LCP images which are less than 3500 pixels wide, and I got 32731 images in that case. The top 10 GIF images in width (all SFW): And the last 10: None of these look like low-entropy placeholders. They're all just massive GIFs. So these check out. I also did a query for all LCP GIF images on desktop that have a bits-per-pixel value less than 0.1 (double the 0.05 threshold): QueryDECLARE
DATE_TO_QUERY DATE DEFAULT '2024-03-01';
CREATE TEMPORARY FUNCTION
IS_CMS(technologies ARRAY<STRUCT<technology STRING,
categories ARRAY<STRING>,
info ARRAY<STRING>>>,
cms STRING,
version STRING)
RETURNS BOOL AS ( EXISTS(
SELECT
*
FROM
UNNEST(technologies) AS technology,
UNNEST(technology.info) AS info
WHERE
technology.technology = cms
AND ( version = ""
OR ENDS_WITH(version, ".x")
AND (STARTS_WITH(info, RTRIM(version, "x"))
OR info = RTRIM(version, ".x"))
OR info = version ) ) );
CREATE TEMPORARY FUNCTION
IS_GIF_IMAGE (summary STRING)
RETURNS BOOLEAN AS (LOWER(CAST(JSON_EXTRACT_SCALAR(summary, "$.mimeType") AS STRING)) = "image/gif");
WITH
pagesWithLcpImages AS (
SELECT
date,
client,
page,
JSON_EXTRACT_SCALAR(custom_metrics, '$.performance.lcp_elem_stats.url') AS url,
CAST(JSON_EXTRACT_SCALAR(custom_metrics, '$.performance.lcp_elem_stats.naturalWidth') AS INT64) AS image_width,
CAST(JSON_EXTRACT_SCALAR(custom_metrics, '$.performance.lcp_elem_stats.naturalHeight') AS INT64) AS image_height,
FROM
`httparchive.all.pages`
WHERE
IS_CMS(technologies,
'WordPress',
'')
AND LOWER(JSON_EXTRACT_SCALAR(custom_metrics, '$.performance.lcp_elem_stats.nodeName')) = 'img'
AND date = DATE_TO_QUERY
AND client = "desktop" ),
imageRequests AS (
SELECT
date,
client,
page,
url,
LOWER(CAST(JSON_EXTRACT_SCALAR(summary, "$.mimeType") AS STRING)) AS mime_type,
CAST( JSON_EXTRACT_SCALAR(summary, "$.respSize") AS NUMERIC) AS resp_size,
FROM
`httparchive.all.requests`
WHERE
IS_GIF_IMAGE(summary)
AND date = DATE_TO_QUERY
AND client = "desktop" )
SELECT
url,
image_width,
image_height,
resp_size,
SAFE_DIVIDE(resp_size * 8, image_width * image_height) AS bpp
FROM
pagesWithLcpImages
JOIN
imageRequests
USING
( date,
page,
client,
url )
WHERE
image_width * image_height > 0 AND
SAFE_DIVIDE(resp_size * 8, image_width * image_height) < 0.1
ORDER BY
image_width DESC I got 759 results. Of those, 314 had a BPP less than 0.05. These seem to be incorrectly identified as LCP images, according to the low-entropy constraint. A few examples: https://eatbeat.jp/wp-content/themes/the-thor/img/dummy.gif Anyway, this doesn't seem to impact the correctness of your query at all. I was just curious why the GIF image format seemed to be an outlier in its median width. |
@westonruter @adamsilverstein I'd like to merge this one. Just need another approval. Ideally we get rid of the 2-approvals-required rule though. |
Thanks @tunetheweb! |
Screenshot due to GitHub's container being too narrow: