Add query for LCP image formats and file sizes #97

swissspidy · 2024-02-27T13:33:40Z

date	client	mime_type	num_lcp_images	median_width	median_height	median_file_size_kb
2024-03-01	desktop	image/avif	8582	250	440	35.66308594
2024-03-01	desktop	image/webp	655366	300	448	50.3515625
2024-03-01	desktop	image/jpeg	2081970	2000	427	110.8388672
2024-03-01	desktop	image/png	911693	300	397	133.7001953
2024-03-01	desktop	image/gif	32816	3500	338	149.8212891
2024-03-01	mobile	image/svg+xml	64097	300	25	5.446289063
2024-03-01	mobile	image/avif	7044	360	300	29.12792969
2024-03-01	mobile	image/webp	638423	360	270	42.3984375
2024-03-01	mobile	image/png	1087469	360	299	65.90722656
2024-03-01	mobile	image/jpeg	2002148	360	300	95.97558594
2024-03-01	mobile	image/gif	35545	360	300	121.1601563

Screenshot due to GitHub's container being too narrow:

sql/2024/02/median-image-bytes.sql

sql/2024/02/median-image-lcp.sql

sql/README.md

felixarntz

Thanks @swissspidy, I left some feedback on the queries. Mostly I think they can be simplified using a few best practices.

A separate overarching question: It's hard to spot for me at this point, but at first glance the queries look very similar. Could we accomplish to get both points of data with a single query, rather than having two separate queries with a lot of duplicate parts?

sql/2024/02/median-image-bytes.sql

sql/2024/02/median-image-lcp.sql

swissspidy · 2024-04-03T21:45:54Z

Let me go back to beginning here, after separate discussions with @felixarntz and @adamsilverstein

For context/background, I want to come up with queries supporting the hypothesis that the client-side media processing work is beneficial for LCP.

At first, here I was trying to come up with a comparison of median file size per image format. In hindsight that's not super relevant, as usually WebP and AVIF are smaller than JPEG. So such a query isn't necessarily useful.

Given the LCP focus, what's more interesting is:

Get all WordPress pages...
where the LCP image is an element...
get the LCP image's dimensions and URL (from custom_metrics)...
Then, join with requests... on the URL
and get the image's response size and mime type

This way we should be able to get some useful data.

To take this even further, we can look at all images on a page (from custom_metrics) and do the same for them, getting dimensions, response size and mime type.

I only briefly looked at that so far but haven't dug deep yet. But I think it should be possible.

Adam also made a good point that it's possible to detect usage of MozJPEG encoder. So we can also track adoption there and differences of MozJPEG vs other encoders. This can again be done on all images on a page

swissspidy · 2024-04-05T13:50:47Z

@adamsilverstein @felixarntz I just added a new version of the query that is almost there I think. The raw data I get from pagesWithImages is correct (I get all the individual images and their dimensions), but then I am doing something wrong when joining the tables as the query is timing out 😅 What am I missing?

felixarntz

@swissspidy There's nothing that immediately jumps out to me as problematic for query performance, except that there's a good chance you're just processing too much data that causes the JOIN to be too expensive. I left two recommendations where I think we can optimize that.

sql/2024/04/image-formats.sql

felixarntz

@swissspidy I'm not exactly sure why you get 3 rows for mobile and desktop in your query, but when grouping correctly you still get those 3 rows. I think the main problem is that you group also by url.

It looks like image/avif does not show up at all. Maybe that's because no LCP image uses AVIF - possible but at scale still unlikely, so potentially there's something in the HTTP Archive pipeline or underlying tooling that treats AVIF differently? 🤔

sql/2024/04/image-formats.sql

felixarntz · 2024-04-12T19:15:33Z

@swissspidy

AVIF not being in that list is one thing, but that even WebP is missing is a bit surprising.

Agreed. Interestingly, it seems WebP is present if you adjust the query to be for 2024-03-01 (which I think we should do now anyway given that data is more recent and now available). But it's strange why that would be different between two months. 🤔

Something that might help figure out what's going on (and I'd argue generally good to include in the result for a quantitative assessment) is how many images were considered in each group. Can you add that to the query?

sql/2024/04/image-formats.sql

felixarntz

@swissspidy I may have found one of the main problems.

sql/2024/04/image-formats.sql

swissspidy · 2024-04-12T22:05:03Z

Yep, makes sense 👍

Cleaned up version looks more like this:

date	client	mime_type	num_lcp_images	median_width	median_height	median_file_size_kb
2024-03-01	desktop	image/avif	8582	250	440	35.66308594
2024-03-01	desktop	image/webp	655366	300	448	50.3515625
2024-03-01	desktop	image/jpeg	2081970	2000	427	110.8388672
2024-03-01	desktop	image/png	911693	300	397	133.7001953
2024-03-01	desktop	image/gif	32816	3500	338	149.8212891
2024-03-01	mobile	image/svg+xml	64097	300	25	5.446289063
2024-03-01	mobile	image/avif	7044	360	300	29.12792969
2024-03-01	mobile	image/webp	638423	360	270	42.3984375
2024-03-01	mobile	image/png	1087469	360	299	65.90722656
2024-03-01	mobile	image/jpeg	2002148	360	300	95.97558594
2024-03-01	mobile	image/gif	35545	360	300	121.1601563

Makes it very obvious how much smaller WebP and AVIF are than JPEG, especially on mobile with similar dimensions.

sql/2024/04/image-formats.sql

Co-authored-by: Weston Ruter <westonruter@google.com>

westonruter · 2024-04-26T20:49:01Z

The median width for GIF images as LCP is 3,500 pixels on desktop? Why is that so much larger than the other image formats on desktop?

Such a huge GIF would surely seem to have a much larger median file size, since GIFs are terribly heavy, assuming they are animated. So apparently they are not animated, since the median is not huge. Does this indicate an issue in the LCP detection logic I wonder? Specifically, Chrome is supposed to ignore low-entropy images from being identified as LCP. Low entropy is calculated at "0.05 bits of image data per displayed pixel". Since the GIF format is the heaviest of all the formats, could it be that placeholder GIF images are erroneously being identified as LCP?

I tried modifying the query to list out the desktop LCP GIF images which had an image width of at least 3,500 pixels:

Query

DECLARE
  DATE_TO_QUERY DATE DEFAULT '2024-03-01';

CREATE TEMPORARY FUNCTION
  IS_CMS(technologies ARRAY<STRUCT<technology STRING,
                                   categories ARRAY<STRING>,
                                   info ARRAY<STRING>>>,
         cms STRING,
         version STRING)
  RETURNS BOOL AS ( EXISTS(
  SELECT
    *
  FROM
    UNNEST(technologies) AS technology,
    UNNEST(technology.info) AS info
  WHERE
      technology.technology = cms
    AND ( version = ""
    OR ENDS_WITH(version, ".x")
            AND (STARTS_WITH(info, RTRIM(version, "x"))
        OR info = RTRIM(version, ".x"))
    OR info = version ) ) );

CREATE TEMPORARY FUNCTION
  IS_GIF_IMAGE (summary STRING)
  RETURNS BOOLEAN AS (LOWER(CAST(JSON_EXTRACT_SCALAR(summary, "$.mimeType") AS STRING)) = "image/gif");

WITH
  pagesWithLcpImages AS (
    SELECT
      date,
      client,
      page,
      JSON_EXTRACT_SCALAR(custom_metrics, '$.performance.lcp_elem_stats.url') AS url,
      CAST(JSON_EXTRACT_SCALAR(custom_metrics, '$.performance.lcp_elem_stats.naturalWidth') AS INT64) AS image_width,
      CAST(JSON_EXTRACT_SCALAR(custom_metrics, '$.performance.lcp_elem_stats.naturalHeight') AS INT64) AS image_height,
    FROM
      `httparchive.all.pages`
    WHERE
      IS_CMS(technologies,
             'WordPress',
             '')
      AND LOWER(JSON_EXTRACT_SCALAR(custom_metrics, '$.performance.lcp_elem_stats.nodeName')) = 'img'
      AND date = DATE_TO_QUERY
      AND client = "desktop" ),

  imageRequests AS (
    SELECT
      date,
      client,
      page,
      url,
      LOWER(CAST(JSON_EXTRACT_SCALAR(summary, "$.mimeType") AS STRING)) AS mime_type,
      CAST( JSON_EXTRACT_SCALAR(summary, "$.respSize") AS NUMERIC) AS resp_size,
    FROM
      `httparchive.all.requests`
    WHERE
      IS_GIF_IMAGE(summary)
      AND date = DATE_TO_QUERY
      AND client = "desktop" )

SELECT
  url,
  image_width,
  resp_size
FROM
  pagesWithLcpImages
    JOIN
  imageRequests
  USING
    ( date,
      page,
      client,
      url )
WHERE
  image_width >= 3500
ORDER BY
  image_width DESC

I only got 86 results for this which I didn't expect if the median of 32,816 images is 3500 pixels wide. I did another query in which I got all GIF LCP images which are less than 3500 pixels wide, and I got 32731 images in that case. The top 10 GIF images in width (all SFW):

url	image_width	resp_size	Animated?
https://i0.wp.com/el-foto.nl/wp-content/uploads/2018/04/cropped-Ellen-van-den-Doel-white-high-res2-andere-verhouding-2.gif?fit=14400%2C1756&ssl=1	14400	75657	No
https://leszekcichonski.pl/wp-content/uploads/2015/02/logo-LC-big-white-1.gif	14211	179550	No
https://mrschews.co.uk/wp-content/uploads/2021/11/mcck-red-ping-pong-bat-row-compressed.gif	11235	1146997	Yes
https://www.polkstanleywilcox.com/wp-content/uploads/2014/10/homepage2.gif	10011	158571	No
https://bizsense.co.il/wp-content/uploads/2016/11/%D7%A8%D7%90%D7%A9%D7%99-3.gif	8268	647296	Yes
https://brachealthcare.com/wp-content/uploads/2024/03/Popup01-1.gif	8001	10443000	Yes
https://i0.wp.com/phenixxgaming.com/wp-content/uploads/2019/03/cropped-Logo-Rainbow.gif?fit=6857%2C1522&ssl=1	6857	684209	No
https://fams-skin.com/wp-content/themes/fams_baby_202009/assets/img/p_wakuwaku/sec_1_1.gif	6000	796152	Yes
https://maisdoquecasas.arq.up.pt/wp-content/themes/maisdoquecasas/img/landing1.gif	5817	223862	Yes
http://www.twoman.co.th/wp-content/uploads/2017/02/new2manLogo.gif	5630	177134	No

And the last 10:

url	image_width	resp_size	Animated?
https://www.kaneko-cord.com/2021renewal/wp-content/themes/kaneko/img/top/mv.gif	3640	7092872	Yes
https://kobexstal.pl/wp-content/themes/kobex/img/anim.gif	3559	2277309	Yes
https://cdn-dicle.nitrocdn.com/MlEwlJPhJBrqahFHrWHyjMidvpNMUioA/assets/images/optimized/rev-5359b84/www.flightstairways.com.au/wp-content/uploads/2020/02/Flight-stairways-logo_Blue-Step.gif	3540	91540	Yes
https://yagumo.co.jp/wp-content/themes/yagumo/assets/img/top/hero_anim.gif	3540	361761	No
https://christopherhaanes.com/wp-content/uploads/2021/07/Homepage_redstripe_CH-1.gif	3508	676600	No
https://www.lamassanostra.com/wp-content/uploads/2022/08/GIF-PIZZA.gif	3507	35032423	Yes
https://arapetroleum.com/wp-content/uploads/2022/11/world-map-8.gif	3500	155804	No
https://35.bienal.org.br/wp-content/themes/fluxo/images/home-en-2.gif	3500	5130686	Yes
https://l-is-b.com/ja/wp-content/themes/lisb/common/img/top/life_is_being.gif	3500	40266	Yes
https://regalpack.com/wp-content/uploads/2019/06/Slider_BG_withWall_3500x511.gif	3500	1740943	No
https://35.bienal.org.br/wp-content/themes/fluxo/images/home-en-2.gif	3500	5130686	Yes

None of these look like low-entropy placeholders. They're all just massive GIFs. So these check out.

I also did a query for all LCP GIF images on desktop that have a bits-per-pixel value less than 0.1 (double the 0.05 threshold):

Query

DECLARE
  DATE_TO_QUERY DATE DEFAULT '2024-03-01';

CREATE TEMPORARY FUNCTION
  IS_CMS(technologies ARRAY<STRUCT<technology STRING,
                                   categories ARRAY<STRING>,
                                   info ARRAY<STRING>>>,
         cms STRING,
         version STRING)
  RETURNS BOOL AS ( EXISTS(
  SELECT
    *
  FROM
    UNNEST(technologies) AS technology,
    UNNEST(technology.info) AS info
  WHERE
      technology.technology = cms
    AND ( version = ""
    OR ENDS_WITH(version, ".x")
            AND (STARTS_WITH(info, RTRIM(version, "x"))
        OR info = RTRIM(version, ".x"))
    OR info = version ) ) );

CREATE TEMPORARY FUNCTION
  IS_GIF_IMAGE (summary STRING)
  RETURNS BOOLEAN AS (LOWER(CAST(JSON_EXTRACT_SCALAR(summary, "$.mimeType") AS STRING)) = "image/gif");

WITH
  pagesWithLcpImages AS (
    SELECT
      date,
      client,
      page,
      JSON_EXTRACT_SCALAR(custom_metrics, '$.performance.lcp_elem_stats.url') AS url,
      CAST(JSON_EXTRACT_SCALAR(custom_metrics, '$.performance.lcp_elem_stats.naturalWidth') AS INT64) AS image_width,
      CAST(JSON_EXTRACT_SCALAR(custom_metrics, '$.performance.lcp_elem_stats.naturalHeight') AS INT64) AS image_height,
    FROM
      `httparchive.all.pages`
    WHERE
      IS_CMS(technologies,
             'WordPress',
             '')
      AND LOWER(JSON_EXTRACT_SCALAR(custom_metrics, '$.performance.lcp_elem_stats.nodeName')) = 'img'
      AND date = DATE_TO_QUERY
      AND client = "desktop" ),

  imageRequests AS (
    SELECT
      date,
      client,
      page,
      url,
      LOWER(CAST(JSON_EXTRACT_SCALAR(summary, "$.mimeType") AS STRING)) AS mime_type,
      CAST( JSON_EXTRACT_SCALAR(summary, "$.respSize") AS NUMERIC) AS resp_size,
    FROM
      `httparchive.all.requests`
    WHERE
      IS_GIF_IMAGE(summary)
      AND date = DATE_TO_QUERY
      AND client = "desktop" )

SELECT
  url,
  image_width,
  image_height,
  resp_size,
  SAFE_DIVIDE(resp_size * 8, image_width * image_height) AS bpp
FROM
  pagesWithLcpImages
    JOIN
  imageRequests
  USING
    ( date,
      page,
      client,
      url )
WHERE
  image_width * image_height > 0 AND
  SAFE_DIVIDE(resp_size * 8, image_width * image_height) < 0.1
ORDER BY
  image_width DESC

I got 759 results. Of those, 314 had a BPP less than 0.05. These seem to be incorrectly identified as LCP images, according to the low-entropy constraint. A few examples:

https://eatbeat.jp/wp-content/themes/the-thor/img/dummy.gif
https://www.sleeky.co.uk/wp-content/themes/sleeky-theme/imgs/sr1-development-background.gif
https://storage.googleapis.com/absolute-comuniones/images/textos/loading.gif
https://redwagonfarmboulder.com/wp-content/uploads/2013/02/headerfix1.gif
https://porctheberge.com/wp-content/themes/tb/img/vignette-dummy.gif
https://www.creagro.nl/wp-content/uploads/2018/06/cropped-header-zelf.gif
https://promare-movie.com/wp-content/themes/promare-theme_v1/asset/img/top/kv-blank.gif
https://pilgrimagespaces.co.za/wp-content/themes/savoy/img/placeholder.gif
https://yoor.at/wp-content/uploads/2017/11/yoor_slider_bg_5120x540_grey_201711-scaled.gif
https://www.inax.com.vn/wp-content/themes/inax/assets/images/productholder.gif

Anyway, this doesn't seem to impact the correctness of your query at all. I was just curious why the GIF image format seemed to be an outlier in its median width.

swissspidy · 2024-05-07T12:00:02Z

@westonruter @adamsilverstein I'd like to merge this one. Just need another approval. Ideally we get rid of the 2-approvals-required rule though.

swissspidy · 2024-05-07T12:51:08Z

Thanks @tunetheweb!

Add queries for image formats

ebd1f03