Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add query for LCP image formats and file sizes #97

Merged
merged 21 commits into from
May 7, 2024

Conversation

swissspidy
Copy link
Collaborator

@swissspidy swissspidy commented Feb 27, 2024

date client mime_type num_lcp_images median_width median_height median_file_size_kb
2024-03-01 desktop image/avif 8582 250 440 35.66308594
2024-03-01 desktop image/webp 655366 300 448 50.3515625
2024-03-01 desktop image/jpeg 2081970 2000 427 110.8388672
2024-03-01 desktop image/png 911693 300 397 133.7001953
2024-03-01 desktop image/gif 32816 3500 338 149.8212891
2024-03-01 mobile image/svg+xml 64097 300 25 5.446289063
2024-03-01 mobile image/avif 7044 360 300 29.12792969
2024-03-01 mobile image/webp 638423 360 270 42.3984375
2024-03-01 mobile image/png 1087469 360 299 65.90722656
2024-03-01 mobile image/jpeg 2002148 360 300 95.97558594
2024-03-01 mobile image/gif 35545 360 300 121.1601563

Screenshot due to GitHub's container being too narrow:

github com_GoogleChromeLabs_wpp-research_pull_97

Copy link
Collaborator

@felixarntz felixarntz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @swissspidy, I left some feedback on the queries. Mostly I think they can be simplified using a few best practices.

A separate overarching question: It's hard to spot for me at this point, but at first glance the queries look very similar. Could we accomplish to get both points of data with a single query, rather than having two separate queries with a lot of duplicate parts?

@swissspidy
Copy link
Collaborator Author

Let me go back to beginning here, after separate discussions with @felixarntz and @adamsilverstein

For context/background, I want to come up with queries supporting the hypothesis that the client-side media processing work is beneficial for LCP.

At first, here I was trying to come up with a comparison of median file size per image format. In hindsight that's not super relevant, as usually WebP and AVIF are smaller than JPEG. So such a query isn't necessarily useful.

Given the LCP focus, what's more interesting is:

  • Get all WordPress pages...
  • where the LCP image is an element...
  • get the LCP image's dimensions and URL (from custom_metrics)...
  • Then, join with requests... on the URL
  • and get the image's response size and mime type

This way we should be able to get some useful data.

To take this even further, we can look at all images on a page (from custom_metrics) and do the same for them, getting dimensions, response size and mime type.

I only briefly looked at that so far but haven't dug deep yet. But I think it should be possible.

Adam also made a good point that it's possible to detect usage of MozJPEG encoder. So we can also track adoption there and differences of MozJPEG vs other encoders. This can again be done on all images on a page

@swissspidy
Copy link
Collaborator Author

@adamsilverstein @felixarntz I just added a new version of the query that is almost there I think. The raw data I get from pagesWithImages is correct (I get all the individual images and their dimensions), but then I am doing something wrong when joining the tables as the query is timing out 😅 What am I missing?

Copy link
Collaborator

@felixarntz felixarntz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@swissspidy There's nothing that immediately jumps out to me as problematic for query performance, except that there's a good chance you're just processing too much data that causes the JOIN to be too expensive. I left two recommendations where I think we can optimize that.

@swissspidy

This comment was marked as outdated.

Copy link
Collaborator

@felixarntz felixarntz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@swissspidy I'm not exactly sure why you get 3 rows for mobile and desktop in your query, but when grouping correctly you still get those 3 rows. I think the main problem is that you group also by url.

It looks like image/avif does not show up at all. Maybe that's because no LCP image uses AVIF - possible but at scale still unlikely, so potentially there's something in the HTTP Archive pipeline or underlying tooling that treats AVIF differently? 🤔

@swissspidy

This comment was marked as resolved.

@swissspidy swissspidy marked this pull request as ready for review April 12, 2024 09:40
@felixarntz
Copy link
Collaborator

@swissspidy

AVIF not being in that list is one thing, but that even WebP is missing is a bit surprising.

Agreed. Interestingly, it seems WebP is present if you adjust the query to be for 2024-03-01 (which I think we should do now anyway given that data is more recent and now available). But it's strange why that would be different between two months. 🤔

Something that might help figure out what's going on (and I'd argue generally good to include in the result for a quantitative assessment) is how many images were considered in each group. Can you add that to the query?

@swissspidy

This comment was marked as resolved.

Copy link
Collaborator

@felixarntz felixarntz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@swissspidy I may have found one of the main problems.

@swissspidy

This comment was marked as resolved.

@felixarntz

This comment was marked as resolved.

@swissspidy
Copy link
Collaborator Author

swissspidy commented Apr 12, 2024

Yep, makes sense 👍

Cleaned up version looks more like this:

date client mime_type num_lcp_images median_width median_height median_file_size_kb
2024-03-01 desktop image/avif 8582 250 440 35.66308594
2024-03-01 desktop image/webp 655366 300 448 50.3515625
2024-03-01 desktop image/jpeg 2081970 2000 427 110.8388672
2024-03-01 desktop image/png 911693 300 397 133.7001953
2024-03-01 desktop image/gif 32816 3500 338 149.8212891
2024-03-01 mobile image/svg+xml 64097 300 25 5.446289063
2024-03-01 mobile image/avif 7044 360 300 29.12792969
2024-03-01 mobile image/webp 638423 360 270 42.3984375
2024-03-01 mobile image/png 1087469 360 299 65.90722656
2024-03-01 mobile image/jpeg 2002148 360 300 95.97558594
2024-03-01 mobile image/gif 35545 360 300 121.1601563

Makes it very obvious how much smaller WebP and AVIF are than JPEG, especially on mobile with similar dimensions.

@swissspidy swissspidy requested a review from westonruter April 16, 2024 10:22
Co-authored-by: Weston Ruter <westonruter@google.com>
@swissspidy swissspidy requested a review from westonruter April 24, 2024 12:24
@westonruter
Copy link
Collaborator

The median width for GIF images as LCP is 3,500 pixels on desktop? Why is that so much larger than the other image formats on desktop?

Such a huge GIF would surely seem to have a much larger median file size, since GIFs are terribly heavy, assuming they are animated. So apparently they are not animated, since the median is not huge. Does this indicate an issue in the LCP detection logic I wonder? Specifically, Chrome is supposed to ignore low-entropy images from being identified as LCP. Low entropy is calculated at "0.05 bits of image data per displayed pixel". Since the GIF format is the heaviest of all the formats, could it be that placeholder GIF images are erroneously being identified as LCP?

I tried modifying the query to list out the desktop LCP GIF images which had an image width of at least 3,500 pixels:

Query
DECLARE
  DATE_TO_QUERY DATE DEFAULT '2024-03-01';

CREATE TEMPORARY FUNCTION
  IS_CMS(technologies ARRAY<STRUCT<technology STRING,
                                   categories ARRAY<STRING>,
                                   info ARRAY<STRING>>>,
         cms STRING,
         version STRING)
  RETURNS BOOL AS ( EXISTS(
  SELECT
    *
  FROM
    UNNEST(technologies) AS technology,
    UNNEST(technology.info) AS info
  WHERE
      technology.technology = cms
    AND ( version = ""
    OR ENDS_WITH(version, ".x")
            AND (STARTS_WITH(info, RTRIM(version, "x"))
        OR info = RTRIM(version, ".x"))
    OR info = version ) ) );

CREATE TEMPORARY FUNCTION
  IS_GIF_IMAGE (summary STRING)
  RETURNS BOOLEAN AS (LOWER(CAST(JSON_EXTRACT_SCALAR(summary, "$.mimeType") AS STRING)) = "image/gif");

WITH
  pagesWithLcpImages AS (
    SELECT
      date,
      client,
      page,
      JSON_EXTRACT_SCALAR(custom_metrics, '$.performance.lcp_elem_stats.url') AS url,
      CAST(JSON_EXTRACT_SCALAR(custom_metrics, '$.performance.lcp_elem_stats.naturalWidth') AS INT64) AS image_width,
      CAST(JSON_EXTRACT_SCALAR(custom_metrics, '$.performance.lcp_elem_stats.naturalHeight') AS INT64) AS image_height,
    FROM
      `httparchive.all.pages`
    WHERE
      IS_CMS(technologies,
             'WordPress',
             '')
      AND LOWER(JSON_EXTRACT_SCALAR(custom_metrics, '$.performance.lcp_elem_stats.nodeName')) = 'img'
      AND date = DATE_TO_QUERY
      AND client = "desktop" ),

  imageRequests AS (
    SELECT
      date,
      client,
      page,
      url,
      LOWER(CAST(JSON_EXTRACT_SCALAR(summary, "$.mimeType") AS STRING)) AS mime_type,
      CAST( JSON_EXTRACT_SCALAR(summary, "$.respSize") AS NUMERIC) AS resp_size,
    FROM
      `httparchive.all.requests`
    WHERE
      IS_GIF_IMAGE(summary)
      AND date = DATE_TO_QUERY
      AND client = "desktop" )

SELECT
  url,
  image_width,
  resp_size
FROM
  pagesWithLcpImages
    JOIN
  imageRequests
  USING
    ( date,
      page,
      client,
      url )
WHERE
  image_width >= 3500
ORDER BY
  image_width DESC

I only got 86 results for this which I didn't expect if the median of 32,816 images is 3500 pixels wide. I did another query in which I got all GIF LCP images which are less than 3500 pixels wide, and I got 32731 images in that case. The top 10 GIF images in width (all SFW):

url image_width resp_size Animated?
https://i0.wp.com/el-foto.nl/wp-content/uploads/2018/04/cropped-Ellen-van-den-Doel-white-high-res2-andere-verhouding-2.gif?fit=14400%2C1756&ssl=1 14400 75657 No
https://leszekcichonski.pl/wp-content/uploads/2015/02/logo-LC-big-white-1.gif 14211 179550 No
https://mrschews.co.uk/wp-content/uploads/2021/11/mcck-red-ping-pong-bat-row-compressed.gif 11235 1146997 Yes
https://www.polkstanleywilcox.com/wp-content/uploads/2014/10/homepage2.gif 10011 158571 No
https://bizsense.co.il/wp-content/uploads/2016/11/%D7%A8%D7%90%D7%A9%D7%99-3.gif 8268 647296 Yes
https://brachealthcare.com/wp-content/uploads/2024/03/Popup01-1.gif 8001 10443000 Yes
https://i0.wp.com/phenixxgaming.com/wp-content/uploads/2019/03/cropped-Logo-Rainbow.gif?fit=6857%2C1522&ssl=1 6857 684209 No
https://fams-skin.com/wp-content/themes/fams_baby_202009/assets/img/p_wakuwaku/sec_1_1.gif 6000 796152 Yes
https://maisdoquecasas.arq.up.pt/wp-content/themes/maisdoquecasas/img/landing1.gif 5817 223862 Yes
http://www.twoman.co.th/wp-content/uploads/2017/02/new2manLogo.gif 5630 177134 No

And the last 10:

url image_width resp_size Animated?
https://www.kaneko-cord.com/2021renewal/wp-content/themes/kaneko/img/top/mv.gif 3640 7092872   Yes
https://kobexstal.pl/wp-content/themes/kobex/img/anim.gif 3559 2277309   Yes
https://cdn-dicle.nitrocdn.com/MlEwlJPhJBrqahFHrWHyjMidvpNMUioA/assets/images/optimized/rev-5359b84/www.flightstairways.com.au/wp-content/uploads/2020/02/Flight-stairways-logo_Blue-Step.gif 3540 91540   Yes
https://yagumo.co.jp/wp-content/themes/yagumo/assets/img/top/hero_anim.gif 3540 361761   No
https://christopherhaanes.com/wp-content/uploads/2021/07/Homepage_redstripe_CH-1.gif 3508 676600   No
https://www.lamassanostra.com/wp-content/uploads/2022/08/GIF-PIZZA.gif 3507 35032423   Yes
https://arapetroleum.com/wp-content/uploads/2022/11/world-map-8.gif 3500 155804   No
https://35.bienal.org.br/wp-content/themes/fluxo/images/home-en-2.gif 3500 5130686   Yes
https://l-is-b.com/ja/wp-content/themes/lisb/common/img/top/life_is_being.gif 3500 40266   Yes
https://regalpack.com/wp-content/uploads/2019/06/Slider_BG_withWall_3500x511.gif 3500 1740943   No
https://35.bienal.org.br/wp-content/themes/fluxo/images/home-en-2.gif 3500 5130686   Yes

None of these look like low-entropy placeholders. They're all just massive GIFs. So these check out.

I also did a query for all LCP GIF images on desktop that have a bits-per-pixel value less than 0.1 (double the 0.05 threshold):

Query
DECLARE
  DATE_TO_QUERY DATE DEFAULT '2024-03-01';

CREATE TEMPORARY FUNCTION
  IS_CMS(technologies ARRAY<STRUCT<technology STRING,
                                   categories ARRAY<STRING>,
                                   info ARRAY<STRING>>>,
         cms STRING,
         version STRING)
  RETURNS BOOL AS ( EXISTS(
  SELECT
    *
  FROM
    UNNEST(technologies) AS technology,
    UNNEST(technology.info) AS info
  WHERE
      technology.technology = cms
    AND ( version = ""
    OR ENDS_WITH(version, ".x")
            AND (STARTS_WITH(info, RTRIM(version, "x"))
        OR info = RTRIM(version, ".x"))
    OR info = version ) ) );

CREATE TEMPORARY FUNCTION
  IS_GIF_IMAGE (summary STRING)
  RETURNS BOOLEAN AS (LOWER(CAST(JSON_EXTRACT_SCALAR(summary, "$.mimeType") AS STRING)) = "image/gif");

WITH
  pagesWithLcpImages AS (
    SELECT
      date,
      client,
      page,
      JSON_EXTRACT_SCALAR(custom_metrics, '$.performance.lcp_elem_stats.url') AS url,
      CAST(JSON_EXTRACT_SCALAR(custom_metrics, '$.performance.lcp_elem_stats.naturalWidth') AS INT64) AS image_width,
      CAST(JSON_EXTRACT_SCALAR(custom_metrics, '$.performance.lcp_elem_stats.naturalHeight') AS INT64) AS image_height,
    FROM
      `httparchive.all.pages`
    WHERE
      IS_CMS(technologies,
             'WordPress',
             '')
      AND LOWER(JSON_EXTRACT_SCALAR(custom_metrics, '$.performance.lcp_elem_stats.nodeName')) = 'img'
      AND date = DATE_TO_QUERY
      AND client = "desktop" ),

  imageRequests AS (
    SELECT
      date,
      client,
      page,
      url,
      LOWER(CAST(JSON_EXTRACT_SCALAR(summary, "$.mimeType") AS STRING)) AS mime_type,
      CAST( JSON_EXTRACT_SCALAR(summary, "$.respSize") AS NUMERIC) AS resp_size,
    FROM
      `httparchive.all.requests`
    WHERE
      IS_GIF_IMAGE(summary)
      AND date = DATE_TO_QUERY
      AND client = "desktop" )

SELECT
  url,
  image_width,
  image_height,
  resp_size,
  SAFE_DIVIDE(resp_size * 8, image_width * image_height) AS bpp
FROM
  pagesWithLcpImages
    JOIN
  imageRequests
  USING
    ( date,
      page,
      client,
      url )
WHERE
  image_width * image_height > 0 AND
  SAFE_DIVIDE(resp_size * 8, image_width * image_height) < 0.1
ORDER BY
  image_width DESC

I got 759 results. Of those, 314 had a BPP less than 0.05. These seem to be incorrectly identified as LCP images, according to the low-entropy constraint. A few examples:

https://eatbeat.jp/wp-content/themes/the-thor/img/dummy.gif
https://www.sleeky.co.uk/wp-content/themes/sleeky-theme/imgs/sr1-development-background.gif
https://storage.googleapis.com/absolute-comuniones/images/textos/loading.gif
https://redwagonfarmboulder.com/wp-content/uploads/2013/02/headerfix1.gif
https://porctheberge.com/wp-content/themes/tb/img/vignette-dummy.gif
https://www.creagro.nl/wp-content/uploads/2018/06/cropped-header-zelf.gif
https://promare-movie.com/wp-content/themes/promare-theme_v1/asset/img/top/kv-blank.gif
https://pilgrimagespaces.co.za/wp-content/themes/savoy/img/placeholder.gif
https://yoor.at/wp-content/uploads/2017/11/yoor_slider_bg_5120x540_grey_201711-scaled.gif
https://www.inax.com.vn/wp-content/themes/inax/assets/images/productholder.gif

Anyway, this doesn't seem to impact the correctness of your query at all. I was just curious why the GIF image format seemed to be an outlier in its median width.

@swissspidy swissspidy changed the title Add queries for image formats Add query for LCP image formats and file sizes May 7, 2024
@swissspidy
Copy link
Collaborator Author

@westonruter @adamsilverstein I'd like to merge this one. Just need another approval. Ideally we get rid of the 2-approvals-required rule though.

@swissspidy
Copy link
Collaborator Author

Thanks @tunetheweb!

@swissspidy swissspidy merged commit 8163be1 into main May 7, 2024
3 checks passed
@swissspidy swissspidy deleted the add/query-image-formats branch May 7, 2024 12:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants