-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add query to count content-type
headers used for WordPress pages
#74
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@westonruter The query looks great, only a few minor recommendations below for improvement.
There is only one quirk which I think is crucial to fix: I think we should limit results to only those where client = 'mobile'
, otherwise I believe we get duplicate URLs, and I assume that is currently the case. Whether mobile or desktop is used doesn't matter what we're trying to achieve here, so then I'd say we should use mobile just because the dataset is larger.
It obviously shouldn't change the ratio of the results much, but I think the numbers would be lower and more accurate.
Co-authored-by: Felix Arntz <felixarntz@users.noreply.github.com>
@felixarntz I've made this change to the query. I've updated the description to reflect the changes. However, doing so exposed something surprising: I re-checked the URLs when emulating a mobile device in DevTools, and actually they are only serving valid XHTML to mobile devices. For desktop, they serve HTML. So I wonder actually if these should be considered duplicates given the possibility of content negotiation based on client type? |
content-type
headers used for WordPress pages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I re-checked the URLs when emulating a mobile device in DevTools, and actually they are only serving valid XHTML to mobile devices. For desktop, they serve HTML. So I wonder actually if these should be considered duplicates given the possibility of content negotiation based on client type?
Great point, I think in that case we should probably request data for both clients and group by that so we have the full dataset (see suggestions below). Could you please apply these changes and rerun the query to update the PR description when you get a chance?
Co-authored-by: Felix Arntz <felixarntz@users.noreply.github.com>
Co-authored-by: Felix Arntz <felixarntz@users.noreply.github.com>
Co-authored-by: Felix Arntz <felixarntz@users.noreply.github.com>
@felixarntz Query re-run and results added to PR description. Indeed the results show that only mobile clients are getting served the 4 XHTML responses. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@westonruter LGTM!
Is there a Trac ticket to further discuss removing non-HTML5 support? Based on this data, certainly seems reasonable.
@felixarntz No, not that I'm aware of. However, @dmsnell may be on the cusp of doing so as part of the HTML API effort, since it is somewhat the motivator for doing this in the first place. |
After chatting with @azozz I think I'm going to create a Trac ticket for it, but the next few days I'll be away from my work. It's all driven by the desire to know what we're allowed to change and what we aren't. For example, WordPress arbitrarily breaks most HTML named character references because it wants to limit the allowed set to those that were in the HTML4 spec plus a few hand-picked names. For example, if you save |
In discussing with @dmsnell in the #core-html-api Slack channel about whether XHTML support should be removed from core, I wanted to do a query to see how many WordPress pages are actually served with the
Content-Type
ofapplication/xhtml+xml
. Even when the response body contains valid XML/XHTML markup, browsers will still use the HTML parser if theContent-Type
istext/html
. So we can determine whether attempting to maintain valid XHTML syntax is important for WordPress by checking how often pages are returned asapplication/xhtml+xml
. Presuming my query is correct, in short, the answer is very few: specifically four and they are all limited to mobile. In fact, there are more pages returned as RSS or plain text than XHTML.(Query elapsed time: 53 sec)
I checked the four URLs that were being returned as
application/xhtml+xml
when crawled, andthey are allone is returningtext/html
now. The other three are returningapplication/xhtml+xml
only when emulating a mobile device.For reference, these are the URLs which were serving valid XHTML at crawl time:
Interestingly the sites are all Japanese.
So I think it is safe to say that WordPress can safely abandon any attempt to serve valid XML pages.