Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add query to count content-type headers used for WordPress pages #74

Merged
merged 10 commits into from
Oct 26, 2023

Conversation

westonruter
Copy link
Collaborator

@westonruter westonruter commented Oct 4, 2023

In discussing with @dmsnell in the #core-html-api Slack channel about whether XHTML support should be removed from core, I wanted to do a query to see how many WordPress pages are actually served with the Content-Type of application/xhtml+xml. Even when the response body contains valid XML/XHTML markup, browsers will still use the HTML parser if the Content-Type is text/html. So we can determine whether attempting to maintain valid XHTML syntax is important for WordPress by checking how often pages are returned as application/xhtml+xml. Presuming my query is correct, in short, the answer is very few: specifically four and they are all limited to mobile. In fact, there are more pages returned as RSS or plain text than XHTML.

client content_type count
desktop text/html 4082654
desktop application/rss+xml 32
desktop text/plain 11
desktop application/x-perl 2
desktop text 2
desktop   2
desktop x-httpd-php 2
desktop texthtml 2
desktop twentyseventeen 1
desktop html_type 1
desktop charset=utf-8 1
desktop text/xml 1
desktop text/css 1
desktop another text/html 1
desktop 1 1
desktop application/atom+xml 1
desktop application/json 1
mobile text/html 5459935
mobile application/rss+xml 61
mobile text/xml 7
mobile text/plain 7
mobile application/xhtml+xml 4
mobile texthtml 3
mobile   3
mobile text 2
mobile text/HTML 2
mobile charset=utf-8 2
mobile x-httpd-php 2
mobile html_type 1
mobile application/x-perl 1
mobile 1 1
mobile Array 1
mobile text/css 1
mobile twentyseventeen 1
mobile another text/html 1
mobile utf-8 1
mobile text/htmltext/html 1
mobile application/atom+xml 1
mobile application/json 1

(Query elapsed time: 53 sec)

I checked the four URLs that were being returned as application/xhtml+xml when crawled, and they are all one is returning text/html now. The other three are returning application/xhtml+xml only when emulating a mobile device.

For reference, these are the URLs which were serving valid XHTML at crawl time:

  1. http://www.kenkomura.jp/ (XHTML only for mobile)
  2. http://www.nukumori-fukumitsu.com/ (XHTML only for mobile)
  3. http://kittokito-ichiba.co.jp/ (XHTML only for mobile)
  4. https://yakaze.jp/ (not XHTML for either desktop or mobile now)

Interestingly the sites are all Japanese.

So I think it is safe to say that WordPress can safely abandon any attempt to serve valid XML pages.

Copy link
Collaborator

@felixarntz felixarntz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@westonruter The query looks great, only a few minor recommendations below for improvement.

There is only one quirk which I think is crucial to fix: I think we should limit results to only those where client = 'mobile', otherwise I believe we get duplicate URLs, and I assume that is currently the case. Whether mobile or desktop is used doesn't matter what we're trying to achieve here, so then I'd say we should use mobile just because the dataset is larger.

It obviously shouldn't change the ratio of the results much, but I think the numbers would be lower and more accurate.

sql/2023/10/page-content-types.sql Outdated Show resolved Hide resolved
sql/2023/10/page-content-types.sql Outdated Show resolved Hide resolved
sql/2023/10/page-content-types.sql Outdated Show resolved Hide resolved
sql/2023/10/page-content-types.sql Outdated Show resolved Hide resolved
sql/2023/10/page-content-types.sql Outdated Show resolved Hide resolved
westonruter and others added 3 commits October 5, 2023 14:11
@westonruter
Copy link
Collaborator Author

There is only one quirk which I think is crucial to fix: I think we should limit results to only those where client = 'mobile', otherwise I believe we get duplicate URLs, and I assume that is currently the case. Whether mobile or desktop is used doesn't matter what we're trying to achieve here, so then I'd say we should use mobile just because the dataset is larger.

@felixarntz I've made this change to the query. I've updated the description to reflect the changes. However, doing so exposed something surprising: I re-checked the URLs when emulating a mobile device in DevTools, and actually they are only serving valid XHTML to mobile devices. For desktop, they serve HTML. So I wonder actually if these should be considered duplicates given the possibility of content negotiation based on client type?

@felixarntz felixarntz changed the title Add query to count content-types used for WordPress pages Add query to count content-type headers used for WordPress pages Oct 26, 2023
Copy link
Collaborator

@felixarntz felixarntz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@westonruter

I re-checked the URLs when emulating a mobile device in DevTools, and actually they are only serving valid XHTML to mobile devices. For desktop, they serve HTML. So I wonder actually if these should be considered duplicates given the possibility of content negotiation based on client type?

Great point, I think in that case we should probably request data for both clients and group by that so we have the full dataset (see suggestions below). Could you please apply these changes and rerun the query to update the PR description when you get a chance?

sql/2023/10/page-content-types.sql Outdated Show resolved Hide resolved
sql/2023/10/page-content-types.sql Outdated Show resolved Hide resolved
sql/2023/10/page-content-types.sql Outdated Show resolved Hide resolved
sql/2023/10/page-content-types.sql Outdated Show resolved Hide resolved
westonruter and others added 3 commits October 26, 2023 10:50
Co-authored-by: Felix Arntz <felixarntz@users.noreply.github.com>
Co-authored-by: Felix Arntz <felixarntz@users.noreply.github.com>
Co-authored-by: Felix Arntz <felixarntz@users.noreply.github.com>
@westonruter
Copy link
Collaborator Author

Could you please apply these changes and rerun the query to update the PR description when you get a chance?

@felixarntz Query re-run and results added to PR description. Indeed the results show that only mobile clients are getting served the 4 XHTML responses.

Copy link
Collaborator

@felixarntz felixarntz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@westonruter LGTM!

Is there a Trac ticket to further discuss removing non-HTML5 support? Based on this data, certainly seems reasonable.

sql/2023/10/page-content-types.sql Outdated Show resolved Hide resolved
@felixarntz felixarntz requested a review from swissspidy October 26, 2023 19:42
@westonruter
Copy link
Collaborator Author

Is there a Trac ticket to further discuss removing non-HTML5 support? Based on this data, certainly seems reasonable.

@felixarntz No, not that I'm aware of. However, @dmsnell may be on the cusp of doing so as part of the HTML API effort, since it is somewhat the motivator for doing this in the first place.

@felixarntz felixarntz merged commit 2af7990 into main Oct 26, 2023
@westonruter westonruter deleted the add/content-type-query branch October 27, 2023 00:52
@dmsnell
Copy link

dmsnell commented Oct 29, 2023

After chatting with @azozz I think I'm going to create a Trac ticket for it, but the next few days I'll be away from my work. It's all driven by the desire to know what we're allowed to change and what we aren't. For example, WordPress arbitrarily breaks most HTML named character references because it wants to limit the allowed set to those that were in the HTML4 spec plus a few hand-picked names. For example, if you save &RoundImplies; in the database, it will be corrupted by kses and display as &RoundImplies; instead of as , because kses turns it into &amp;RoundImplies;, being unaware of the HTML5 named character references. I'd like to fix these things, but it's unclear if we can make this call since it's required by HTML5 but not by HTML4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants