Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: HTML API: Add an XML serializer. #7408

Draft
wants to merge 15 commits into
base: trunk
Choose a base branch
from
Draft

Conversation

dmsnell
Copy link
Member

@dmsnell dmsnell commented Sep 21, 2024

Trac ticket: Core-62091
Built from #7331

Provides a mechanism to serialize an HTML fragment to the XML syntax. YOU PROBABLY SHOULDN'T USE THIS!!!!

REMEMBER that so-called "XHTML" served without a path ending in .xml or without the Content-type: application/xml+xhtml HTTP header will render as HTML and ONE SHOULD NOT SERVE XML/XHTML as HTML!!!

php > var_dump( ( WP_HTML_Processor::create_fragment( '<p>an <img> is worth &AElig thousand words' ) )->serialize_to_xml() );
string(43) "<p>an <img /> is worth Æ thousand words</p>"
php > var_dump( ( WP_HTML_Processor::create_fragment( '<svg><foreignObject><p>Test<svg><text>Smile</text></p></foreignObject><p>test' ) )->serialize_to_xml() );
string(200) "<svg xmlns="http://www.w3.org/2000/svg"><foreignObject xmlns="http://www.w3.org/1999/xhtml"><p>Test<svg xmlns="http://www.w3.org/2000/svg"><text>Smile</text></svg></p></foreignObject></svg><p>test</p>"
php > var_dump( ( WP_HTML_Processor::create_full_parser( '<svg><foreignObject><p>Test<svg><text>Smile</text></p></foreignObject><p>test' ) )->serialize_to_xml() );
string(315) "<?xml version="1.0" encoding="UTF-8" ?>
<html xmlns="http://www.w3.org/1999/xhtml"><head></head><body><svg xmlns="http://www.w3.org/2000/svg"><foreignObject xmlns="http://www.w3.org/1999/xhtml"><p>Test<svg xmlns="http://www.w3.org/2000/svg"><text>Smile</text></svg></p></foreignObject></svg><p>test</p></body></html>"

Extremely rare cases when it's appropriate to use this

  • Exporting HTML content into an Atom feed without escaping it. HTML may/ought to be escaped like <content type="html">&lt;p&gt;yay&lt;/&gt;</content>, but if the document can be serialized into <content type="xhtml" xmlns="http://www.w3.org/1999/xhtml"><p>yay</p></content>.
  • When attempting to directly embed HTML content into any other XML document without escaping it.

HTML generally cannot be expressed in XML, and according to the HTML specification, Using the XML syntax is not recommended! Prefer escaping the HTML to avoid corruption and data loss.

dmsnell and others added 15 commits September 11, 2024 09:37
The HTML Processor understands HTML regardless of how it's written, but
many other functions are unable to do so. There are all sorts of syntax
peculiarities and semantics that would be helpful to eliminate using the
knowledge contained in the HTML Processor.

This patch introduces `WP_HTML_Processor::normalize( $html )` as a
method which takes a fragment of HTML as input and then returns a
serialized version of the input, "cleaning it up" by balancing all
tags, providing all missing optional tags, re-encoding all text,
removing all duplicate attributes, and double-quote-escaping all
attribute values.

Core-62036
Co-authored-by: Weston Ruter <westonruter@git.wordpress.org>
If code later in the processing pipeline adds unquoted attributes
and doesn't add the requisite space following that, then another
parser might find that the solidus is part of the attribute value
instead of serving as a self-closing flag.

Co-authored-by: Weston Ruter <westonruter@git.wordpress.org>
Co-authored-by: Weston Ruter <westonruter@git.wordpress.org>
Co-authored-by: Weston Ruter <westonruter@git.wordpress.org>
Co-authored-by: Weston Ruter <westonruter@git.wordpress.org>
Copy link

github-actions bot commented Sep 21, 2024

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props dmsnell, siliconforks.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@hubot hubot deleted the html-api/normalize-to-xml branch September 21, 2024 00:53
Copy link

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

@siliconforks
Copy link

php > var_dump( ( WP_HTML_Processor::create_fragment( '<svg><foreignObject><p>Test<svg><text>Smile</text></p></foreignObject><p>test' ) )->serialize_to_xml() );
string(200) "<svg xmlns="http://www.w3.org/2000/svg"><foreignObject xmlns="http://www.w3.org/1999/xhtml"><p>Test<svg xmlns="http://www.w3.org/2000/svg"><text>Smile</text></svg></p></foreignObject></svg><p>test</p>"
php > var_dump( ( WP_HTML_Processor::create_full_parser( '<svg><foreignObject><p>Test<svg><text>Smile</text></p></foreignObject><p>test' ) )->serialize_to_xml() );
string(315) "<?xml version="1.0" encoding="UTF-8" ?>
<html xmlns="http://www.w3.org/1999/xhtml"><head></head><body><svg xmlns="http://www.w3.org/2000/svg"><foreignObject xmlns="http://www.w3.org/1999/xhtml"><p>Test<svg xmlns="http://www.w3.org/2000/svg"><text>Smile</text></svg></p></foreignObject></svg><p>test</p></body></html>"

Are the above examples actually right? Is the xmlns="http://www.w3.org/1999/xhtml" supposed to be on the foreignObject element like that?

Compare the above code to the example here:

https://developer.mozilla.org/en-US/docs/Web/SVG/Element/foreignObject

  • Exporting HTML content into an Atom feed without escaping it. HTML may/ought to be escaped like <content type="html">&lt;p&gt;yay&lt;/&gt;</content>, but if the document can be serialized into <content type="xhtml" xmlns="http://www.w3.org/1999/xhtml"><p>yay</p></content>.

The above Atom example has basically the same issue - is the xmlns="http://www.w3.org/1999/xhtml" supposed to be on the content element?

Compare to the example here:

https://en.wikipedia.org/wiki/Atom_(web_standard)#Example_of_an_Atom_1.0_feed

@dmsnell
Copy link
Member Author

dmsnell commented Sep 21, 2024

Thanks @siliconforks.

You're right, in that the new default namespace applies to the foreignObject itself, which isn't correct. This PR is a big WIP though - honestly I would be just as happy if it always raised an exception 🙃

But I'm still exploring and trying to understand what needs to occur and how it can be done in order to transform as safely as possible. I'll add WIP to the title.

@dmsnell dmsnell changed the title HTML API: Add an XML serializer. WIP: HTML API: Add an XML serializer. Sep 21, 2024
@dmsnell dmsnell marked this pull request as draft September 21, 2024 23:21
@dmsnell
Copy link
Member Author

dmsnell commented Sep 27, 2024

@siliconforks after reviewing the XML Names spec, I believe that it's still ideal to change the default namespace on the foreignObject element, but prefix that element name in the corresponding namespace. For example,

<svg><foreignObject><p>Hi</p></foreignObject></svg>

Should translate into this

<svg xmlns="http://www.w3.org/2000/svg"><svg:foreignObject xmlns="http://www.w3.org/1999/xhtml"><p>Hi</p></svg:foreignObject></svg>

and this is actually closer to why I originally reset the default namespace on the foreignObject - it lets us avoid adding the HTML namespace prefix on every tag within its contents. however, what I overlooked was that changing the default namespace affects the qualified name for the element on which it's found. this is resolved by ensuring that this element itself is prefixed.

it gets more complicated with attribute names, but only because they work differently than the element does. attribute names usually don't have a namespace, and html:title is different than a title attribute on an element in the HTML namespace. the good news is that it means we don't have to touch attributes. they are read directly as null-namespace attribute names in most circumstances. if they needed to be in a different namespace, then there's no situation in which they could exist without one.

@siliconforks
Copy link

In your example, wouldn't you also need to bind the svg: prefix to a namespace?

Like this (adding whitespace to make it more readable):

<svg xmlns="http://www.w3.org/2000/svg">
  <svg:foreignObject xmlns="http://www.w3.org/1999/xhtml" xmlns:svg="http://www.w3.org/2000/svg">
    <p>Hi</p>
  </svg:foreignObject>
</svg>

...or this:

<svg xmlns="http://www.w3.org/2000/svg" xmlns:svg="http://www.w3.org/2000/svg">
  <svg:foreignObject xmlns="http://www.w3.org/1999/xhtml">
    <p>Hi</p>
  </svg:foreignObject>
</svg>

@dmsnell
Copy link
Member Author

dmsnell commented Sep 27, 2024

@siliconforks yeah but I didn't want to show that in the code snippet. we could, for instance, eagerly add this to the start of any XML output and let it be, or add it piecemeal.

<p>Out<svg><foreignObject><p>Hi</p></foreignObject></svg>
<p
 xmlns="http://www.w3.org/1999/xhtml"
 xmlns:h="http://www.w3.org/1999/xhtml"
 xmlns:s="http://www.w3.org/2000/svg"
 xmlns:m="http://www.w3.org/1998/Math/MathML"
>Out<svg xmlns="http://www.w3.org/2000/svg"><s:foreignObject xmlns="http://www.w3.org/1999/xhtml"><p>Hi</p></s:foreignObject></svg>

anyway, I think this is a minor detail. the point is mainly that we can rely on resetting the default namespace on the integration points, but will likely want to prefix the integration point itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants