[Website] 100% reliable deployments #1821

adamziel · 2024-09-28T18:08:04Z

Playground deployments require clearing the cache and the service worker way too often. Let's ensure a high standard of stability. All deployments should always work on all browsers without having to clear the cache.

[Website] Invalidate all the cache layers, correctly ship the new WordPress releases #1774
Webapp upgrade protocol: Disable HTTP caching and reload other browser tabs to prevent fatal errors after new deployments. #1822
Refreshless website deployments – load remote.html using the network-first strategy #1849

Done is

We have an E2E suite that tests a Playground website deployment from a very old version to a new version, and ensures the following things work:

New webapp runs on the first page visit.
Offline mode.
Direct visits and iframe embeds.
Neither cache layer should yield stale data (HTTP, service worker, OfflineCache, other Cache instances, etc.)
The open tabs that can be reloaded without a data loss are reloaded. Other open tabs are left alone
Safari, Chrome, Firefox, mobile browsers.

There should be no intermittent failures, stale fetch() responses, or problems with stale service workers.

Root cause of the problem

Two reasons are at play:

Dependency graphs
Caching

Dependency graphs

Deploying a new Playground version does two things:

Publishes new assets on playground.wordpress.net
Deletes old assets

If the previous version of Playground is still running, it will attempt to fetch the old assets – and fail:

This wasn't a big deal a few months ago, since a page reload would solve this, but then we've introduced the offline support in #1483 .

Caching

The offline support keeps a copy of all the accessed old assets until the new service worker is installed. This might take 24 hours or sometimes longer! During that time, visiting playground.wordpress.net would load the cached index.html file and the rest of the stale dependency graph from the previous Playground release. Since some files are only loaded on demand, we'd get a mixture of cached assets and network errors – effectively putting the app in an undefined state.

The solution

I’m 95% convinced we must always force Playground to switch to the new service worker.

However, this would require refreshing all the open tabs and would trash any temporary Playgrounds.

Therefore, we might have to store all Playgrounds in OPFS, even the temporary ones. To maintain good UX, we'd add a cleanup mechanism to hide the "stored temporary" Playgrounds after a regular page refresh, and we'd keep them visible after a page refresh triggered by a new Playground release. We could also add a “Recently archived” button to recover anything archived during the last 24 hours – cc @jarekmorawski for thoughts.

Solutions without a forced page refresh

I couldn't find any solution that would keep the Playground site working without a forced page refresh:

cc @brandonpayton @bgrgicak

…r tabs to prevent fatal errors after new deployments. (#1822) ## Motivation for the change, related issues Solves fatal Playground breakages after new version deployments by adopting the following version upgrade protocol: * Playground version is upgraded as early as possible after a new release * HTTP cache is skipped when fetching new assets * Stale Playground tabs are forcibly refreshed Related to #1821 ## The problem Playground got affected by HTTP caching and ended up loading assets from both the old release and the new release. This broke the app's dependency graph and led to fatal errors. See the visualisation below. When Playground v184 is released, the app will only work properly if all the loaded assets come from v184: ![371781531-608a780e-60b8-4ed4-969a-d7497c7500a7](https://github.com/user-attachments/assets/605cba58-8eba-4fdb-b527-8c1f6942ce24) ## The solution This PR ensures HTTP cache is skipped for assets that are cached offline. This isn't perfect as the browser will sometimes download the same file twice, but it's much better than breaking the app. We'll explore making the most out of both cache layers in the future. Here's a rundown of the caching strategy implemented in this PR: * Playground version is upgraded as early as possible after a new release * HTTP cache is skipped ### Playground version is upgraded as early as possible after a new release New service workers call .skipWaiting(), immediately claim all the clients that were controlled by the previous service worker, and forcibly refreshes them. Why? Because Playground fetches new resources asynchronously and on demand. However, deploying a new webapp version of the app destroys the resources referenced in the previous webapp version. Therefore, we can't allow the previous version to run when a new version becomes available. #### Push notifications It would be supremely useful to proactively notify the webapp after a fresh deployment. Playground doesn't do that yet but it likely will in the future. ### HTTP cache is skipped Playground relies on the **Cache only** strategy. It loads assets from the network, caches them, and serves them from the cache. The assumption is that all network requests yield the most recent version of the remote file. This helps us avoid the HTTP cache problem. #### Cache layers We're dealing with the following cache layers: * HTTP cache in the browser * CacheStorage in the service worker * Edge Cache on playground.wordpress.net #### HTTP cache in the browser This service worker skips the browser HTTP cache for all network requests. This is because the HTTP cache caused a particularly nasty problem in Playground deployments. Installing a new service worker purged the CacheStorage and requested a new set of assets from the network. However, some of these requests were served from the HTTP cache. As a result, Playground would start loading a mix of old and new assets and quickly error out. What made it worse is that this broken state was cached in CacheStorage, breaking Playground for weeks until the cache was refreshed. See #1822 for more details. #### CacheStorage in the service worker This servive worker uses a **Cache only** strategy to ensure all the loaded assets come from the same webapp build. The **Cache only** strategy means Playground only loads each assets from the network once, caches it, and serves it from the cache from that point on. The only times Playground reaches to the network are: * Before the service worker is installed. * When the service worker is being activated. * On CacheStorage cache miss occurs. ### Edge Cache on playground.wordpress.net The remote server (playground.wordpress.net) has an Edge Cache that's populated with all static assets on every webapp deployment. All the assets served by playground.wordpress.net at any point in time come from the same build and are consistent with each other. The deployment process is atomic-ish so the server should never expose a mix of old and new assets. However, what if a new webapp version is deployed right when someone downloaded 10 out of 27 static assets required to boot Playground? Right now, they'd end up in an undefined state and likely see an error. Then, on a page refresh, they'd pick up a new service worker that would purge the stale assets and boot the new webapp version. This is not a big problem for now, but it's also not the best user experience. This can be eventually solved with push notifications. A new deployment would notify all the active clients to upgrade and pick up the new assets. ## Other changes In addition, this PR: * Adds E2E tests for app deployments and offline mode * Updates CI to run Playwright tests: * Firefox * Safari * Chrome * Fixes a few paper cuts * Fixed: Boot halted when OPFS isn't available due to error/success hooks never running (4e0ef74) * Fixed: "Save in this browser" option stays available even when there's no OPFS support (f6225a9) ## Paths not taken * Relying on build-time hashes in the filenames for all caching. We can't rely on that for the most important routes: `/`, `/index.html`, `/remote.html`, `/sw.js` – they need stable URLs for multiple reasons. * A different caching strategy, such as [network falling back to cache](https://web.dev/articles/offline-cookbook#network-falling-back-to-cache). ## Caveats and follow-up work * Let's find a way to leverage HTTP cache without breaking the offline cache. * There's no way to recover from a deployment happening during a page load – let's fix it. * A new service worker forcibly reloads other browser tabs and destroys their in-memory context. Let's solve it by storing temporary sites in OPFS. ## Testing Instructions (or ideally a Blueprint) CI. Yes, it sounds like a lame testing plan fur such a profound change. However, almost none of these changes can be tested in a local dev environment and a large part of this work was about covering app deployment in our E2E tests. If you want to try these tests locally and see what they do, you'll need this special setup: ```bash $ npx nx e2e:playwright:prepare-app-deploy-and-offline-mode playground-website $ npx playwright test --config=packages/playground/website/playwright/playwright.config.ts ./packages/playground/website/playwright/e2e/deployment.spec.ts --ui ``` ## Related resources * PR that turned off HTTP caching: #1822 * Exploring all the cache layers: #1774 * Cache only strategy: https://web.dev/articles/offline-cookbook#cache-only * Service worker caching and HTTP caching: https://web.dev/articles/service-worker-caching-and-http-caching --------- Co-authored-by: Bero <berislav.grgicak@gmail.com> Co-authored-by: Brandon Payton <brandon@happycode.net>

…first strategy (#1849) ## Motivation for the change, related issues Related to #1821 Changes the webapp upgrade protocol proposed in #1822 to avoid forcibly refreshing the browser tabs with unsaved changes in them. ## Technical implementation **Before this PR**, the new service worker would clear the offline cache, claim all the active clients, and forcibly refresh them to ensure the latest Playground version is loaded everywhere. This worked, but every webapp upgrade would destroy any work the user may have done in their temporary Playground. We've explored [storing temporary Playgrounds in OPFS](#1838), but backtracked because 1) it created an uncanny amount of complexity, and 2) some browsers (e.g. Safari in private mode) don't support OPFS and must rely on a temporary in-memory site. **After this PR**, the service worker clears the offline cache, claims all the active clients, but it doesn't forcibly refresh them. Instead, it uses the network-first strategy for the `remote.html` route and the `/` route. All the other files are still loaded using the cache-first strategy. Every Playground that's already open, either temporary or stored, will remain functional. The heavy, asynchronously loaded resources such as PHP.wasm and WordPress.zip were already processed – there's no user flow that could lead to `import()`-ing a non-existing `php.js` file. Every newly opened Playground will be loaded using a freshly downloaded `remote.html` file containing references to freshly deployed Playground assets. Thus ## Other changes This PR inlines the reusable service worker utilities from `packages/php-wasm/web/src/lib/register-service-worker.ts` into `@wp-playground` packages. It turns out, they weren't as reusable and keeping them separate was annoying. I'm now convinced the service worker bits are application specific and splitting them between multiple packages just isn't useful. ## Testing instructions Review the app deployment E2E tests check what we need to check, and them confirm they are green in the CI.

adamziel · 2024-10-09T19:49:31Z

With #1822 and #1849 merged, this seems to be done. Let's reopen if we notice any other problems with website deployments.

adamziel added [Type] Enhancement New feature or request [Type] Reliability Playground uptime, reliability, not crashing [Aspect] Website labels Sep 28, 2024

github-project-automation bot added this to Playground Board Sep 28, 2024

github-project-automation bot moved this to Inbox in Playground Board Sep 28, 2024

adamziel moved this from Inbox to In progress in Playground Board Sep 28, 2024

adamziel added the [Priority] High label Sep 28, 2024

adamziel mentioned this issue Sep 28, 2024

Webapp upgrade protocol: Disable HTTP caching and reload other browser tabs to prevent fatal errors after new deployments. #1822

Merged

adamziel mentioned this issue Oct 7, 2024

Refreshless website deployments – load remote.html using the network-first strategy #1849

Merged

This was referenced Oct 7, 2024

In Safari, PR viewers for wordpress-develop and gutenberg not loading as expected #1816

Closed

Protocol for deploying playground.wp.net incompatible with the previous service worker #566

Closed

adamziel closed this as completed Oct 9, 2024

github-project-automation bot moved this from In progress to Done in Playground Board Oct 9, 2024

This was referenced Oct 24, 2024

E2E test cache busting on playground.wordpress.net #855

Closed

Run E2E tests on Chromium and Firefox #808

Closed

E2E tests for HTTP error handling #950

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Website] 100% reliable deployments #1821

[Website] 100% reliable deployments #1821

adamziel commented Sep 28, 2024 •

edited

Loading

adamziel commented Oct 9, 2024 •

edited

Loading

[Website] 100% reliable deployments #1821

[Website] 100% reliable deployments #1821

Comments

adamziel commented Sep 28, 2024 • edited Loading

Done is

Root cause of the problem

Dependency graphs

Caching

The solution

Solutions without a forced page refresh

adamziel commented Oct 9, 2024 • edited Loading

adamziel commented Sep 28, 2024 •

edited

Loading

adamziel commented Oct 9, 2024 •

edited

Loading