Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor ImageCDN parsing to rely on HTML API instead of RegExps #32700

Merged
merged 10 commits into from
Aug 28, 2024

Conversation

dmsnell
Copy link
Member

@dmsnell dmsnell commented Aug 26, 2023

Status

This is a work in progress and isn't tested or verified. This has been reviewed, but the filters aren't tested because it's unclear what code might rely on them.

Due to the change in indentation the diff view is more associated with the actual changes if ignoring whitespace.

Proposed changes:

The introduction of the HTML API into WordPress 6.2 offers a new method of matching and modifying HTML. In this patch we're replacing code that attempts to parse the input HTML and extract images that are direct children of an anchor ("A" tag), then read and modify them based on the values of their attributes and computed Photon properties.

In the previous code the Image_CDN class scanned the entire HTML document to generate a list of PREG image match objects, then iterated over those matches and performed string-replace operations on them.

Now the class does a pass from start to finish, visting each image tag along the way, and making the appropriate modifications. Extra care is taken to ensure that only images that are the single child of a link are matched.

In this change the values of the tag key in some of the filters has changed from the initial matched HTML snippet to the name of the image tag, which could be IMG or AMP-IMG or AMP-ANIM. An update to the Tag Processor or a custom sub-class thereof could provide the original HTML snippet and match the existing behavior, but that hasn't been done in this patch yet given the author's uncertainty about the use and value of those snippets.

Other information:

  • Have you written new tests for your changes, if applicable?
  • Have you checked the E2E test CI results, and verified that your changes do not break them?
  • Have you tested your changes on WordPress.com, if applicable (if so, you'll see a generated comment below with a script to run)?

Jetpack product discussion

This mandates running on WordPress 6.2. If we prefer being able to run on older versions, we could also include a copy of the Tag Processor in the plugin.

Does this pull request change what data or activity we track or use?

No.

Testing instructions:

Test suite should pass. Manual review is necessary though.
I don't even know what all the function I modified is supposed to do fully, so code auditing and review of the modifications is critical.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 26, 2023

Are you an Automattician? Please test your changes on all WordPress.com environments to help mitigate accidental explosions.

  • To test on WoA, go to the Plugins menu on a WordPress.com Simple site. Click on the "Upload" button and follow the upgrade flow to be able to upload, install, and activate the Jetpack Beta plugin. Once the plugin is active, go to Jetpack > Jetpack Beta, select your plugin, and enable the image-cdn/rely-on-html-api branch.

  • To test on Simple, run the following command on your sandbox:

    bin/jetpack-downloader test jetpack image-cdn/rely-on-html-api
    

Interested in more tips and information?

  • In your local development environment, use the jetpack rsync command to sync your changes to a WoA dev blog.
  • Read more about our development workflow here: PCYsg-eg0-p2
  • Figure out when your changes will be shipped to customers here: PCYsg-eg5-p2

@github-actions github-actions bot added [Feature] Photon aka "Image CDN". Feature developed in the Image CDN package and shipped in multiple plugins [Package] Image CDN [Status] In Progress labels Aug 26, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Aug 26, 2023

Thank you for your PR!

When contributing to Jetpack, we have a few suggestions that can help us test and review your patch:

  • ✅ Include a description of your PR changes.
  • ✅ Add a "[Status]" label (In Progress, Needs Team Review, ...).
  • ✅ Add testing instructions.
  • ✅ Specify whether this PR includes any changes to data or privacy.
  • ✅ Add changelog entries to affected projects

This comment will be updated as you work on your PR and make changes. If you think that some of those checks are not needed for your PR, please explain why you think so. Thanks for cooperation 🤖


The e2e test report can be found here. Please note that it can take a few minutes after the e2e tests checks are complete for the report to be available.


Follow this PR Review Process:

  1. Ensure all required checks appearing at the bottom of this PR are passing.
  2. Choose a review path based on your changes:
    • A. Team Review: add the "[Status] Needs Team Review" label
      • For most changes, including minor cross-team impacts.
      • Example: Updating a team-specific component or a small change to a shared library.
    • B. Crew Review: add the "[Status] Needs Review" label
      • For significant changes to core functionality.
      • Example: Major updates to a shared library or complex features.
    • C. Both: Start with Team, then request Crew
      • For complex changes or when you need extra confidence.
      • Example: Refactor affecting multiple systems.
  3. Get at least one approval before merging.

Still unsure? Reach out in #jetpack-developers for guidance!

@dmsnell dmsnell force-pushed the image-cdn/rely-on-html-api branch 6 times, most recently from ba4718b to 564599c Compare August 26, 2023 01:27
@dmsnell dmsnell force-pushed the image-cdn/rely-on-html-api branch from b318fbb to dc3a6b1 Compare August 28, 2023 22:42
@github-actions github-actions bot added the Docs label Aug 29, 2023
@dmsnell dmsnell force-pushed the image-cdn/rely-on-html-api branch 2 times, most recently from 6b724e7 to 583e9ca Compare August 29, 2023 02:28
@jeherve jeherve mentioned this pull request Aug 29, 2023
16 tasks
Copy link
Member

@jeherve jeherve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for proposing this! This is a nice way to use WP_HTML_Tag_Processor, and something we can do now that we'll be requiring WordPress 6.2 (#31638).

I have not tested your PR, but left a couple of comments below about how things work in the monorepo, if that can help you discover how we usually do things :)

projects/packages/image-cdn/src/class-image-cdn.php Outdated Show resolved Hide resolved
projects/packages/image-cdn/CHANGELOG.md Outdated Show resolved Hide resolved
@oskosk
Copy link
Contributor

oskosk commented Aug 29, 2023

@haqadn @Automattic/heart-of-gold can you make sure you keep an eye on this one?

@dmsnell dmsnell force-pushed the image-cdn/rely-on-html-api branch from 484deb0 to ffab63f Compare August 29, 2023 16:09
@dmsnell dmsnell changed the title WIP: Refactor ImageCDN parsing to rely on HTML API instead of RegExps Refactor ImageCDN parsing to rely on HTML API instead of RegExps Aug 29, 2023
@dmsnell dmsnell force-pushed the image-cdn/rely-on-html-api branch from e51843b to 5e23682 Compare September 1, 2023 20:47
@dmsnell dmsnell marked this pull request as ready for review September 4, 2023 20:54
@dmsnell
Copy link
Member Author

dmsnell commented Sep 5, 2023

@oskosk @jeherve should I try and find someone to review this? are either of you happy with the proposed change?

@dmsnell
Copy link
Member Author

dmsnell commented Sep 5, 2023

One note I should leave here is to ask about pulling in Core's assertEqualMarkup() function. I wanted to add it to the BaseTestCase but then I saw that's an external class and wasn't sure where to add it so that it would be available all around Jetpack.

The issue here is that the test output changes but in a semantically neutral way. I could rewrite the test so that it hardcodes the newer ordering of output, but that would leave this fragility in place for the future.

Copy link
Member

@jeherve jeherve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should I try and find someone to review this? are either of you happy with the proposed change?

Maybe @Automattic/heart-of-gold could take a look?

/**
* Allow specific images to be skipped by Photon.
*
* @TODO: Does this need to pass the full HTML of the image tag?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at usage of this filter in other plugins in the WordPress.org plugin directory. While most plugins do not use this third parameter, there are a few that do, sometimes to look for a specific class name in the HTML. So I think we should keep this available if possible.

*
* @see https://developer.wordpress.com/docs/photon/api/
* @TODO: What is the point of passing $images and $index. Are they required?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked around but couldn't figure out where that filter is being used today. It was used in the past, but may not be used anymore today.

haqadn
haqadn previously approved these changes Sep 12, 2023
Copy link
Contributor

@haqadn haqadn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code changes look good to me and no problems found after testing. 👍🏼

P.S.: I only worked on migrating from a Jetpack module to image-cdn package. I don't have prior knowledge of the implementation details on the core functionality.

@dmsnell
Copy link
Member Author

dmsnell commented Sep 13, 2023

I will reconstruct the existing filter value before merge.
Would still love to have some thoughts on the assertEquivalentMarkup() method from Core, if there's a better place to put that.

@@ -178,6 +178,51 @@ protected function helper_remove_image_sizes() {
remove_image_size( 'jetpack_soft_oversized_after_upload' );
}

/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these functions are okay here.

Personally, I would put them in the parent class at class-image-cdn-attachment-test-case.php but, it looks like this is the only (current) test that extends this particular class.

@kraftbj
Copy link
Contributor

kraftbj commented Sep 14, 2023

Since we're late in the beta testing week, I'd say let's merge this next week after the branch cut so we get the longest amount of time to test it in the wild before the next dotorg release

dmsnell added a commit that referenced this pull request Aug 23, 2024
)

The introduction of the HTML API into WordPress 6.2 offers a new method of
matching and modifying HTML. In this patch we're replacing code that attempts
to parse the input HTML and visit all images, then read and modify them based
on the values of their attributes and computed Photon properties.

In the previous code the `Image_CDN` class scanned the entire HTML document to
generate a list of PREG image match objects, then iterated over those matches
and performed string-replace operations on them.

Now the class does a pass from start to finish, visting each image tag along
the way, and making the appropriate modifications.

Co-authored-by: Adnan Haque <3737780+haqadn@users.noreply.github.com>
Co-authored-by: Brandon Kraft <public@brandonkraft.com>
Co-authored-by: Jeremy Herve <jeremy@jeremy.hu>
Co-authored-by: Mark George <thingalon@gmail.com>
Co-authored-by: Osk <oskosk@users.noreply.github.com>
@dmsnell dmsnell reopened this Aug 23, 2024
@dmsnell dmsnell force-pushed the image-cdn/rely-on-html-api branch from ab212e3 to 424e68a Compare August 23, 2024 19:36
dmsnell added a commit that referenced this pull request Aug 23, 2024
)

The introduction of the HTML API into WordPress 6.2 offers a new method of
matching and modifying HTML. In this patch we're replacing code that attempts
to parse the input HTML and visit all images, then read and modify them based
on the values of their attributes and computed Photon properties.

In the previous code the `Image_CDN` class scanned the entire HTML document to
generate a list of PREG image match objects, then iterated over those matches
and performed string-replace operations on them.

Now the class does a pass from start to finish, visting each image tag along
the way, and making the appropriate modifications.

Co-authored-by: Adnan Haque <3737780+haqadn@users.noreply.github.com>
Co-authored-by: Brandon Kraft <public@brandonkraft.com>
Co-authored-by: Jeremy Herve <jeremy@jeremy.hu>
Co-authored-by: Mark George <thingalon@gmail.com>
Co-authored-by: Osk <oskosk@users.noreply.github.com>
@dmsnell dmsnell force-pushed the image-cdn/rely-on-html-api branch from 424e68a to 074a797 Compare August 23, 2024 19:44
dmsnell added a commit that referenced this pull request Aug 23, 2024
)

The introduction of the HTML API into WordPress 6.2 offers a new method of
matching and modifying HTML. In this patch we're replacing code that attempts
to parse the input HTML and visit all images, then read and modify them based
on the values of their attributes and computed Photon properties.

In the previous code the `Image_CDN` class scanned the entire HTML document to
generate a list of PREG image match objects, then iterated over those matches
and performed string-replace operations on them.

Now the class does a pass from start to finish, visting each image tag along
the way, and making the appropriate modifications.

Co-authored-by: Adnan Haque <3737780+haqadn@users.noreply.github.com>
Co-authored-by: Brandon Kraft <public@brandonkraft.com>
Co-authored-by: Jeremy Herve <jeremy@jeremy.hu>
Co-authored-by: Mark George <thingalon@gmail.com>
Co-authored-by: Osk <oskosk@users.noreply.github.com>
@dmsnell dmsnell force-pushed the image-cdn/rely-on-html-api branch from 074a797 to c7493a4 Compare August 23, 2024 20:00
)

The introduction of the HTML API into WordPress 6.2 offers a new method of
matching and modifying HTML. In this patch we're replacing code that attempts
to parse the input HTML and visit all images, then read and modify them based
on the values of their attributes and computed Photon properties.

In the previous code the `Image_CDN` class scanned the entire HTML document to
generate a list of PREG image match objects, then iterated over those matches
and performed string-replace operations on them.

Now the class does a pass from start to finish, visting each image tag along
the way, and making the appropriate modifications.

Co-authored-by: Adnan Haque <3737780+haqadn@users.noreply.github.com>
Co-authored-by: Brandon Kraft <public@brandonkraft.com>
Co-authored-by: Jeremy Herve <jeremy@jeremy.hu>
Co-authored-by: Mark George <thingalon@gmail.com>
Co-authored-by: Osk <oskosk@users.noreply.github.com>
@dmsnell dmsnell force-pushed the image-cdn/rely-on-html-api branch from c7493a4 to 227be97 Compare August 24, 2024 04:30
No `assertEquivalentMarkup` exists yet, so this gets around that without
creating one.
@dilirity
Copy link
Member

dilirity commented Aug 26, 2024

I started looking into the failed tests, but I'm not sure what's going on. str_starts_with is returning incorrectly for some reason.

Any idea where that's loaded from?

Fixed broken test.

@dmsnell outside of the failing tests, is this ready for merge? I'm missing context on where it's at 😅

@dmsnell
Copy link
Member Author

dmsnell commented Aug 26, 2024

@dilirity - thanks for fixing those tests - not sure how I copy/pasted the original typo on WP_HTML_Tag_Processor, but I appreciate your help fixing it.

as far as I know it's ready to go, and it was ready to go, but I'm not really sure how to test and vet it to the level of quality at which I would have preferred (since I was jumping in to unfamiliar code and proposing a change in a system I'm not working in).

if you think this is sound (and according to the tests and a code audit it seems ready) then it should be fine to merge. had I been in a place where I could thoroughly test or watch and react to failures post-merge I would have.

Copy link
Member

@dilirity dilirity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad. I was using two different images. Ignore this. Though I'm not sure why only one of these images' width/height is 150 when both are set to use it.

So I did some testing and found something odd. Markup for resulted images below.

If you add an image via the Image block and use these settings:

CleanShot 2024-08-27 at 17 50 36

On the production version of Boost, it looks like this:

CleanShot 2024-08-27 at 17 51 08

<img decoding="async" width="150" height="150" src="https://i0.wp.com/honest-shark.jurassic.ninja/wp-content/uploads/2024/08/donnie-rosie-O7L3MrlSAHA-unsplash.jpg?resize=150%2C150&amp;ssl=1" alt="" class="wp-image-12" style="aspect-ratio:4/3;object-fit:cover" srcset="https://i0.wp.com/honest-shark.jurassic.ninja/wp-content/uploads/2024/08/donnie-rosie-O7L3MrlSAHA-unsplash-scaled.jpg?resize=150%2C150&amp;ssl=1 150w, https://i0.wp.com/honest-shark.jurassic.ninja/wp-content/uploads/2024/08/donnie-rosie-O7L3MrlSAHA-unsplash-scaled.jpg?zoom=2&amp;resize=150%2C150&amp;ssl=1 300w, https://i0.wp.com/honest-shark.jurassic.ninja/wp-content/uploads/2024/08/donnie-rosie-O7L3MrlSAHA-unsplash-scaled.jpg?zoom=3&amp;resize=150%2C150&amp;ssl=1 450w" sizes="(max-width: 150px) 100vw, 150px" data-recalc-dims="1">

However, this branch running Boost seems to ignore the size:

CleanShot 2024-08-27 at 17 51 39

<img fetchpriority="high" decoding="async" width="1707" height="2560" src="https://i0.wp.com/extraordinary-gazelle.jurassic.ninja/wp-content/uploads/2024/08/filip-zrnzevic-QsWG0kjPQRY-unsplash-scaled.jpg?resize=150%2C150&amp;ssl=1" alt="" class="wp-image-14" style="aspect-ratio:4/3;object-fit:cover" srcset="https://i0.wp.com/extraordinary-gazelle.jurassic.ninja/wp-content/uploads/2024/08/filip-zrnzevic-QsWG0kjPQRY-unsplash-scaled.jpg?w=1707&amp;ssl=1 1707w, https://i0.wp.com/extraordinary-gazelle.jurassic.ninja/wp-content/uploads/2024/08/filip-zrnzevic-QsWG0kjPQRY-unsplash-scaled.jpg?resize=200%2C300&amp;ssl=1 200w, https://i0.wp.com/extraordinary-gazelle.jurassic.ninja/wp-content/uploads/2024/08/filip-zrnzevic-QsWG0kjPQRY-unsplash-scaled.jpg?resize=683%2C1024&amp;ssl=1 683w, https://i0.wp.com/extraordinary-gazelle.jurassic.ninja/wp-content/uploads/2024/08/filip-zrnzevic-QsWG0kjPQRY-unsplash-scaled.jpg?resize=768%2C1152&amp;ssl=1 768w, https://i0.wp.com/extraordinary-gazelle.jurassic.ninja/wp-content/uploads/2024/08/filip-zrnzevic-QsWG0kjPQRY-unsplash-scaled.jpg?resize=1024%2C1536&amp;ssl=1 1024w, https://i0.wp.com/extraordinary-gazelle.jurassic.ninja/wp-content/uploads/2024/08/filip-zrnzevic-QsWG0kjPQRY-unsplash-scaled.jpg?resize=1365%2C2048&amp;ssl=1 1365w" sizes="(max-width: 1000px) 100vw, 1000px">

I'm not sure if this is Boost related, but I don't think we've made any changes to Image CDN. Boost is running on the Free version with only Critical CSS and Image CDN enabled.

dilirity
dilirity previously approved these changes Aug 27, 2024
Copy link
Member

@dilirity dilirity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I couldn't find anything broken. Code-wise it is good and tests seem to be happy.

I think we can :shipit:

Copy link
Member

@dilirity dilirity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving after merge with trunk.

@dilirity dilirity merged commit 6dd0cf2 into trunk Aug 28, 2024
65 checks passed
gogdzl pushed a commit that referenced this pull request Oct 25, 2024
)

* Refactor ImageCDN parsing to rely on HTML API instead of RegExps (#32700)

The introduction of the HTML API into WordPress 6.2 offers a new method of
matching and modifying HTML. In this patch we're replacing code that attempts
to parse the input HTML and visit all images, then read and modify them based
on the values of their attributes and computed Photon properties.

In the previous code the `Image_CDN` class scanned the entire HTML document to
generate a list of PREG image match objects, then iterated over those matches
and performed string-replace operations on them.

Now the class does a pass from start to finish, visting each image tag along
the way, and making the appropriate modifications.

Co-authored-by: Adnan Haque <3737780+haqadn@users.noreply.github.com>
Co-authored-by: Brandon Kraft <public@brandonkraft.com>
Co-authored-by: Jeremy Herve <jeremy@jeremy.hu>
Co-authored-by: Mark George <thingalon@gmail.com>
Co-authored-by: Osk <oskosk@users.noreply.github.com>

* Rearrange semantically equivalent test output to avoid false negatives.

No `assertEquivalentMarkup` exists yet, so this gets around that without
creating one.

* Fix broken test

* Remove unnecessary comment

* Fix static analysis issues

* Fix static analysis issue

* Bump project version to 0.4.7-alpha

* Fix project version

---------

Co-authored-by: Adnan Haque <3737780+haqadn@users.noreply.github.com>
Co-authored-by: Brandon Kraft <public@brandonkraft.com>
Co-authored-by: Jeremy Herve <jeremy@jeremy.hu>
Co-authored-by: Mark George <thingalon@gmail.com>
Co-authored-by: Osk <oskosk@users.noreply.github.com>
Co-authored-by: Peter Petrov <peter.petrov89@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs [Feature] Photon aka "Image CDN". Feature developed in the Image CDN package and shipped in multiple plugins [Package] Image CDN [Tests] Includes Tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants