Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add query to count content-type headers used for WordPress pages #74

Merged
merged 10 commits into from
Oct 26, 2023
62 changes: 62 additions & 0 deletions sql/2023/10/page-content-types.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# HTTP Archive query to get counts of content-types used for WordPress pages.
#
# WPP Research, Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# See query results here: https://github.com/GoogleChromeLabs/wpp-research/pull/74
WITH pages AS (
SELECT
client,
page AS url
FROM
`httparchive.all.pages`,
UNNEST(technologies) AS t
WHERE
date = '2023-08-01' AND
is_root_page AND
t.technology = 'WordPress'
),

# h/t https://discuss.httparchive.org/t/help-finding-list-of-home-pages-with-specific-http-response-header/2567/2
requests AS (
SELECT
client,
url,
REGEXP_REPLACE( resp_headers.value, ' *;.*$', '' ) AS content_type
FROM
`httparchive.all.requests`,
UNNEST(response_headers) as resp_headers
WHERE
date = "2023-08-01" AND
is_root_page AND
lower(resp_headers.name) = 'content-type' AND
is_main_document
)

SELECT
client,
content_type,
COUNT(url) AS count
FROM
requests
INNER JOIN
pages
USING
(client, url)
GROUP BY
client,
content_type
ORDER BY
client,
count DESC
1 change: 1 addition & 0 deletions sql/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ For additional considerations for writing BigQuery queries against HTTP Archive,
* [Counts for bfcache being enabled and disabled](./2023/10/bfcache-score-counts.sql)
* [Counts for failure reasons for which bfcache is disabled](./2023/10/bfcache-failure-reasons.sql)
* [Counts for how many pages have the Heartbeat script](./2023/10/heartbeat-script-presence.sql)
* [Counts for Content-Types used for WordPress pages](./2023/10/page-content-types.sql)

### 2023/08

Expand Down