Skip to content

HTML API: CSS class name methods should behave according to quirks mode #7169

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

sirreal
Copy link
Member

@sirreal sirreal commented Aug 9, 2024

Trac ticket: Core-61531

Testing changes

These are for the full HTML API test suites. There are no changes to the html5lib test suite.

- Tests: 2441, Assertions: 4122, Skipped: 423.
+ Tests: 2449, Assertions: 4135, Skipped: 423.

Description

Update that HTML Processor and Tag Processor to handle CSS classes in a case-sensitive way by default.
This aligns with "no-quirks" or "standards" mode behavior for class name handling.

Remove forced lowercasing in ::class_list.

Add a $document_mode argument to WP_HTML_Processor::create_fragment() to enable the fragment parser to be created in quirks mode.

When the HTML Processor is in no-quirks mode, add_class, remove_class and has_class operate in a case-sensitive way:

$processor = WP_HTML_Processor::create_full_parser( '<!DOCTYPE html><div class="FOO">' );
$processor->next_tag( 'DIV' );
var_dump( $processor->has_class('foo') );
// bool(false)
var_dump( $processor->has_class('FOO') );
// bool(true)

When the HTML Processor is in quirks mode (only available via the fragment parser at the moment), add_class, remove_class and has_class operate in a case-sensitive way:

$processor = WP_HTML_Processor::create_fragment('<div class="FOO">', '<body>', 'UTF-8', 'quirks-mode');
$processor->next_tag( 'DIV' );
var_dump( $processor->has_class('foo') );
// bool(true)
var_dump( $processor->has_class('FOO') );
// bool(true)

add_class and remove_class get similar treatment. Case-insensitive classes matching remove_class( $class_name ) will be removed. Case insensitive duplicate classes will not be added when calling add_class( $class_name ).

The case-sensitivity is managed by adding a protected comparable_class_name method to the WP_HTML_Tag_Processor. This method is called internally in the CSS class related methods to allow subclasses to customize comparison behavior. This requires minimal changes to the implementation of class name handling in the Tag Processor.

In the Tag Processor and the HTML Processor in "no-quirks" mode, comparable_class_name method returns the class name as-is, so case-sensitive comparison is performed. The HTML Processor in quirks mode will return the class name in ASCII lowercase so that case-insensitive comparison is performed.

class_list does produce case-insensitive duplicates in all cases. This deduplication would be easy to perform in quirks mode, however it's unclear what form of casing should be yielded. In the case of <div class="aaa AAA AaA">, which of the (equivalent in quirks-mode) aaa class names should be yielded?


The Tag Processor and HTML Processor classes have several methods for dealing with CSS class names: class_list, add_class, remove_class and has_class.

These methods are intended to provide a CSS class selector-like interface with the class attribute.

Class name matching (CSS class selectors .className {} or getElementsByClassName( "className" )) is case sensitive in no-quirks mode and case-insensitive in quirks-mode.

The list of elements with class names classNames for a node root is the HTMLCollection returned by the following algorithm:

Let classes be the result of running the ordered set parser on classNames.
If classes is the empty set, return an empty HTMLCollection.
Return an HTMLCollection rooted at root, whose filter matches descendant elements that have all their classes in classes.

The comparisons for the classes must be done in an ASCII case-insensitive manner if root’s node document’s mode is "quirks"; otherwise in an identical to manner.

When matching against a document which is in quirks mode, class names must be matched ASCII case-insensitively; class selectors are otherwise case-sensitive, only matching class names they are identical to.


This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

dmsnell and others added 7 commits July 19, 2024 17:11
Allow quirks mode to be set before document processing begins.
This is necessary for has_class to work properly.
This could be put into a protected method or the class sensitivity could
be a parameter if desired.
Subsequent changes introduced document_mode instead of compat_mode
@sirreal sirreal force-pushed the html-api/css-class-name-method-audit branch from a50e262 to 1435cf9 Compare August 13, 2024 09:04
Copy link

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

@@ -4645,7 +4675,7 @@ public function remove_class( $class_name ): bool {
* @return bool|null Whether the matched tag contains the given class name, or null if not matched.
*/
public function has_class( $wanted_class ): ?bool {
return $this->is_virtual() ? null : parent::has_class( $wanted_class );
return $this->is_virtual() ? false : parent::has_class( $wanted_class );
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be a small bug and isn't necessary in this PR. I'd be happy to make another PR if desired. is_virtual would suggest we're stopped on a tag, but the tag can't have any attributes. This should return false.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm. I would consider this more of a bug in the type signature. with active format reconstruction it is possible for virtual nodes to have a class, but the null was meant to convey exactly what you inferred: "This tag can have no classes" rather than "this has has no classes."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug introduced in #6753

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems different from what the tag processor does, where "not on a tag" returns null, "on a tag" always returns false with no special handling for tags with no classes.

/**
* Returns if a matched tag contains the given ASCII case-insensitive class name.
*
* @since 6.4.0
*
* @param string $wanted_class Look for this CSS class name, ASCII case-insensitive.
* @return bool|null Whether the matched tag contains the given class name, or null if not matched.
*/
public function has_class( $wanted_class ): ?bool {
if ( self::STATE_MATCHED_TAG !== $this->parser_state ) {
return null;
}
$wanted_class = strtolower( $wanted_class );
foreach ( $this->class_list() as $class_name ) {
if ( $class_name === $wanted_class ) {
return true;
}
}
return false;
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug introduced in #6753

I think this was a confusion, there is no bug with types and #6753 did not introduce a bug here that I'm aware of. This is simply a question of what null and false mean here as return values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for following up on these comments. null has been used to mean "cannot potentially contain that attribute or any others." maybe what we need is a comment update since we created virtual nodes.

at some point, if we allow reading these this might change, as nodes created during active format reconstruction contain the attributes of their original tags. for now, though, I prefer having a distinction between "this tag does not have this class" and "it's not possible to answer if this tag has this class"

Comment on lines -396 to -397
* `QUIRKS_MODE` impacts many styling-related aspects of an HTML document, but
* none of the other changes modifies how the HTML is parsed or selected.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lines immediately above this about P > TABLE handling in quirks/no-quirks modes seem contradictory. That's directly related to tree-construction.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the wording. It indeed was self-contradictory

* @return static|null The created processor if successful, otherwise null.
*/
public static function create_fragment( $html, $context = '<body>', $encoding = 'UTF-8' ) {
public static function create_fragment( $html, $context = '<body>', $encoding = 'UTF-8', $document_mode = WP_HTML_Processor_State::NO_QUIRKS_MODE ) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A fragment parser would typically inherit the context element's document's compatibility mode.

I suspect we'll need to change how the context element is passed, but I don't think it will include information about it's document mode and this argument will remain helpful.

See #7141.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think this is worth supporting right now? do we have enough warrant to include it? in what cases would we want to create a document fragment in quirks mode, and in those cases, how would we know?

I can definitely see value in having this, but also I wonder if this inclusion will help people understand better what they need to be doing or add confusion.

what do you think the consequences would be of simply not supporting a quirks-mode fragment parser? or at least of not having it in the primary function signature for creating the class?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was helpful and important for testing while working on these changes.

In what cases would we want to create a document fragment in quirks mode, and in those cases, how would we know?

The fragment should use the same mode as the document for the context element. This should become clear when we work on set_{inner,outer}_html where we'll be creating fragment parsers that use the parent parser's document mode.

Another option would be to adjust the full parser to handle doctype declarations and set the document_mode correctly according to the full HTML document. Then we could certainly omit these changes.

I'm also reluctant to eagerly add more to this method signature.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think the consequences would be of simply not supporting a quirks-mode fragment parser?

If the full parser supports quirks mode and we have set_{inner,outer}_html methods, I think the fragment parser must support quirks mode.

or at least of not having it in the primary function signature for creating the class?

The mode should be based on the context element's document's mode. There's no way to provide that information to the fragment parser right now. The context element is currently passed to the fragment parser as an HTML-like string (only <body> is allowed right now) which seems insufficient to pass all the information the fragment parser requires.

There are other ways to handle this, for example an instance method could create a fragment from a node and set quirks mode appropriately as well as handle things like reading attributes, namespace, etc. This is all best discussed in #7141.

I am going to explore a change to handle quirks mode in the full parser based on the doctype declaration, that would be sufficient and we could remove this change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see some value in having a method like ->set_quirks_mode()

I still suspect that it'll be basically universal that we don't work in the full parser mode within WordPress. almost no HTML-processing code has access to the full document, and so that's what I meant when assuming no-quirks mode. We just don't know even when doing inner_html operations if the parent document is in quirks mode or not. "We can only assume UTF-8, no-quirks, <body> context."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be value in allowing the compat mode to be changed on a processor. I'd like to leave those changes for consideration in their own PR. There was no way to change the compat mode before and I don't think we need to address that in this PR.

@sirreal sirreal changed the title HTML API: CSS classname WIP HTML API: CSS class name methods should behave according to quirks mode Aug 13, 2024
@sirreal sirreal marked this pull request as ready for review August 13, 2024 15:20
Copy link

github-actions bot commented Aug 13, 2024

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props dmsnell, jonsurrell.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@sirreal
Copy link
Member Author

sirreal commented Aug 21, 2024

I plan to revisit this when #7195 is complete. That will allow us to test with quirks mode using the full parser and not need to change the fragment factory method signature.

Full documents can be created in quirks mode now. There's no need to
introduce quirks mode to the fragment parser or change its signature in
order to tests the quirks mode changes.
Quirks mode changes behavior CSS class functions, namely whether they
are ASCII case-insensitive class name matches or byte-for-byte
comparisons.

It makes sense to move quirks mode into the tag processor so that it can
deal with this correctly.
@sirreal sirreal requested a review from dmsnell September 2, 2024 18:17
@sirreal
Copy link
Member Author

sirreal commented Sep 2, 2024

I think the concerns have been addressed and this is ready for another review.

* @todo When reconstructing active formatting elements with attributes, find a way
* to indicate if the virtually-reconstructed formatting elements contain the
* wanted class name.
*
* @param string $wanted_class Look for this CSS class name, ASCII case-insensitive.
* @return bool|null Whether the matched tag contains the given class name, or null if not matched.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sirreal I've reverted this change so we can consider it separately. I know it will be important during active format reconstruction, but I think that null is a kind of partially-implemented escape hatch that communicates that this isn't supported rather than indicating that a class definitively doesn't exist on the tag.

$this->assertSame( '<span class="UPPER">', $processor->get_updated_html() );

$processor->add_class( 'ANOTHER-UPPER' );
$this->assertSame( '<span class="UPPER another-upper">', $processor->get_updated_html() );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping that here we could respect the given casing.

@dmsnell
Copy link
Member

dmsnell commented Sep 4, 2024

@sirreal I've updated some comments and changed the behavior of add_class() and remove_class() so that they preserve the casing of the first-provided class name for all lexical variations. I'll entertain ideas on this, but it's how we handle attributes, and I think it's the most respectful we can be to developers who are setting values when casing doesn't matter.

it's really quite a surprise, and I hope quirks mode is almost never used, given the conflict between CSS selectors matching the class attribute value, and those given as class name selectors.

Copy link
Member

@dmsnell dmsnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sirreal I'm going to merge this because I anticipate you would prefer it (once the tests all pass), but if you disagree with the changes I made we can revisit.

@dmsnell
Copy link
Member

dmsnell commented Sep 4, 2024

@sirreal I've also moved the Trac ticket reference to the top of the description to make it easier to find, and used the shorthand notation instead of the full link.

pento pushed a commit that referenced this pull request Sep 4, 2024
The HTML API has been behaving as if CSS class name selectors matched class names in an ASCII case-insensitive manner. This is only true if the document in question is set to quirks mode. Unfortunately most documents processed will be set to no-quirks mode, meaning that some CSS behaviors have been matching incorrectly when provided with case variants of class names.

In this patch, the CSS methods have been audited and updated to adhere to the rules governing ASCII case sensitivity when matching classes. This includes `add_class()`, `remove_class()`, `has_class()`, and `class_list()`. Now, it is assumed that a document is in no-quirks mode unless a full HTML parser infers quirks mode, and these methods will treat class names in a byte-for-byte manner. Otherwise, when a document is in quirks mode, the methods will compare the provided class names against existing class names for the tag in an ASCII case insensitive way, while `class_list()` will return a lower-cased version of the existing class names.

The lower-casing in `class_list()` is performed for consistency, since it's possible that multiple case variants of the same comparable class name exists on a tag in the input HTML.

Developed in #7169
Discussed in https://core.trac.wordpress.org/ticket/61531

Props dmsnell, jonsurrell.
See #61531.


git-svn-id: https://develop.svn.wordpress.org/trunk@58985 602fd350-edb4-49c9-b593-d223f7449a82
@dmsnell
Copy link
Member

dmsnell commented Sep 4, 2024

Merged in [58985]
fb40fe9

@dmsnell dmsnell closed this Sep 4, 2024
markjaquith pushed a commit to markjaquith/WordPress that referenced this pull request Sep 4, 2024
The HTML API has been behaving as if CSS class name selectors matched class names in an ASCII case-insensitive manner. This is only true if the document in question is set to quirks mode. Unfortunately most documents processed will be set to no-quirks mode, meaning that some CSS behaviors have been matching incorrectly when provided with case variants of class names.

In this patch, the CSS methods have been audited and updated to adhere to the rules governing ASCII case sensitivity when matching classes. This includes `add_class()`, `remove_class()`, `has_class()`, and `class_list()`. Now, it is assumed that a document is in no-quirks mode unless a full HTML parser infers quirks mode, and these methods will treat class names in a byte-for-byte manner. Otherwise, when a document is in quirks mode, the methods will compare the provided class names against existing class names for the tag in an ASCII case insensitive way, while `class_list()` will return a lower-cased version of the existing class names.

The lower-casing in `class_list()` is performed for consistency, since it's possible that multiple case variants of the same comparable class name exists on a tag in the input HTML.

Developed in WordPress/wordpress-develop#7169
Discussed in https://core.trac.wordpress.org/ticket/61531

Props dmsnell, jonsurrell.
See #61531.

Built from https://develop.svn.wordpress.org/trunk@58985


git-svn-id: http://core.svn.wordpress.org/trunk@58381 1a063a9b-81f0-0310-95a4-ce76da25c4cd
github-actions bot pushed a commit to gilzow/wordpress-performance that referenced this pull request Sep 4, 2024
The HTML API has been behaving as if CSS class name selectors matched class names in an ASCII case-insensitive manner. This is only true if the document in question is set to quirks mode. Unfortunately most documents processed will be set to no-quirks mode, meaning that some CSS behaviors have been matching incorrectly when provided with case variants of class names.

In this patch, the CSS methods have been audited and updated to adhere to the rules governing ASCII case sensitivity when matching classes. This includes `add_class()`, `remove_class()`, `has_class()`, and `class_list()`. Now, it is assumed that a document is in no-quirks mode unless a full HTML parser infers quirks mode, and these methods will treat class names in a byte-for-byte manner. Otherwise, when a document is in quirks mode, the methods will compare the provided class names against existing class names for the tag in an ASCII case insensitive way, while `class_list()` will return a lower-cased version of the existing class names.

The lower-casing in `class_list()` is performed for consistency, since it's possible that multiple case variants of the same comparable class name exists on a tag in the input HTML.

Developed in WordPress/wordpress-develop#7169
Discussed in https://core.trac.wordpress.org/ticket/61531

Props dmsnell, jonsurrell.
See #61531.

Built from https://develop.svn.wordpress.org/trunk@58985


git-svn-id: https://core.svn.wordpress.org/trunk@58381 1a063a9b-81f0-0310-95a4-ce76da25c4cd
@sirreal sirreal deleted the html-api/css-class-name-method-audit branch September 4, 2024 14:26
aslamdoctor pushed a commit to aslamdoctor/wordpress-develop that referenced this pull request Dec 28, 2024
The HTML API has been behaving as if CSS class name selectors matched class names in an ASCII case-insensitive manner. This is only true if the document in question is set to quirks mode. Unfortunately most documents processed will be set to no-quirks mode, meaning that some CSS behaviors have been matching incorrectly when provided with case variants of class names.

In this patch, the CSS methods have been audited and updated to adhere to the rules governing ASCII case sensitivity when matching classes. This includes `add_class()`, `remove_class()`, `has_class()`, and `class_list()`. Now, it is assumed that a document is in no-quirks mode unless a full HTML parser infers quirks mode, and these methods will treat class names in a byte-for-byte manner. Otherwise, when a document is in quirks mode, the methods will compare the provided class names against existing class names for the tag in an ASCII case insensitive way, while `class_list()` will return a lower-cased version of the existing class names.

The lower-casing in `class_list()` is performed for consistency, since it's possible that multiple case variants of the same comparable class name exists on a tag in the input HTML.

Developed in WordPress#7169
Discussed in https://core.trac.wordpress.org/ticket/61531

Props dmsnell, jonsurrell.
See #61531.


git-svn-id: https://develop.svn.wordpress.org/trunk@58985 602fd350-edb4-49c9-b593-d223f7449a82
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants