Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Website] 100% reliable deployments #1821

Closed
adamziel opened this issue Sep 28, 2024 · 1 comment
Closed

[Website] 100% reliable deployments #1821

adamziel opened this issue Sep 28, 2024 · 1 comment
Labels
[Aspect] Website [Priority] High [Type] Enhancement New feature or request [Type] Reliability Playground uptime, reliability, not crashing

Comments

@adamziel
Copy link
Collaborator

adamziel commented Sep 28, 2024

Playground deployments require clearing the cache and the service worker way too often. Let's ensure a high standard of stability. All deployments should always work on all browsers without having to clear the cache.

Done is

We have an E2E suite that tests a Playground website deployment from a very old version to a new version, and ensures the following things work:

  • New webapp runs on the first page visit.
  • Offline mode.
  • Direct visits and iframe embeds.
  • Neither cache layer should yield stale data (HTTP, service worker, OfflineCache, other Cache instances, etc.)
  • The open tabs that can be reloaded without a data loss are reloaded. Other open tabs are left alone
  • Safari, Chrome, Firefox, mobile browsers.

There should be no intermittent failures, stale fetch() responses, or problems with stale service workers.

Root cause of the problem

Two reasons are at play:

  • Dependency graphs
  • Caching

Dependency graphs

Deploying a new Playground version does two things:

  • Publishes new assets on playground.wordpress.net
  • Deletes old assets

If the previous version of Playground is still running, it will attempt to fetch the old assets – and fail:

Playground Board

This wasn't a big deal a few months ago, since a page reload would solve this, but then we've introduced the offline support in #1483 .

Caching

The offline support keeps a copy of all the accessed old assets until the new service worker is installed. This might take 24 hours or sometimes longer! During that time, visiting playground.wordpress.net would load the cached index.html file and the rest of the stale dependency graph from the previous Playground release. Since some files are only loaded on demand, we'd get a mixture of cached assets and network errors – effectively putting the app in an undefined state.

The solution

I’m 95% convinced we must always force Playground to switch to the new service worker.

However, this would require refreshing all the open tabs and would trash any temporary Playgrounds.

Therefore, we might have to store all Playgrounds in OPFS, even the temporary ones. To maintain good UX, we'd add a cleanup mechanism to hide the "stored temporary" Playgrounds after a regular page refresh, and we'd keep them visible after a page refresh triggered by a new Playground release. We could also add a “Recently archived” button to recover anything archived during the last 24 hours – cc @jarekmorawski for thoughts.

Solutions without a forced page refresh

I couldn't find any solution that would keep the Playground site working without a forced page refresh:

Playground Board

cc @brandonpayton @bgrgicak

@adamziel adamziel added [Type] Enhancement New feature or request [Type] Reliability Playground uptime, reliability, not crashing [Aspect] Website labels Sep 28, 2024
@adamziel adamziel moved this from Inbox to In progress in Playground Board Sep 28, 2024
adamziel added a commit that referenced this issue Oct 2, 2024
…r tabs to prevent fatal errors after new deployments. (#1822)

## Motivation for the change, related issues

Solves fatal Playground breakages after new version deployments by
adopting the following version upgrade protocol:

* Playground version is upgraded as early as possible after a new
release
* HTTP cache is skipped when fetching new assets
* Stale Playground tabs are forcibly refreshed

Related to #1821

## The problem

Playground got affected by HTTP caching and ended up loading assets from
both the old release and the new release. This broke the app's
dependency graph and led to fatal errors.

See the visualisation below. When Playground v184 is released, the app
will only work properly if all the loaded assets come from v184:


![371781531-608a780e-60b8-4ed4-969a-d7497c7500a7](https://github.com/user-attachments/assets/605cba58-8eba-4fdb-b527-8c1f6942ce24)

## The solution

This PR ensures HTTP cache is skipped for assets that are cached
offline. This isn't perfect as the browser will sometimes download the
same file twice, but it's much better than breaking the app. We'll
explore making the most out of both cache layers in the future.

Here's a rundown of the caching strategy implemented in this PR:

* Playground version is upgraded as early as possible after a new
release
* HTTP cache is skipped

### Playground version is upgraded as early as possible after a new
release

New service workers call .skipWaiting(), immediately claim all the
clients
that were controlled by the previous service worker, and forcibly
refreshes
 them.

 Why?

Because Playground fetches new resources asynchronously and on demand.
However,
deploying a new webapp version of the app destroys the resources
referenced in
the previous webapp version. Therefore, we can't allow the previous
version
 to run when a new version becomes available.

#### Push notifications

It would be supremely useful to proactively notify the webapp after a
fresh deployment.
 Playground doesn't do that yet but it likely will in the future.

### HTTP cache is skipped

 Playground relies on the **Cache only** strategy. It loads assets from
the network, caches them, and serves them from the cache. The assumption
is that all network requests yield the most recent version of the remote
file.

 This helps us avoid the HTTP cache problem.

#### Cache layers

 We're dealing with the following cache layers:

 * HTTP cache in the browser
 * CacheStorage in the service worker
 * Edge Cache on playground.wordpress.net

#### HTTP cache in the browser

This service worker skips the browser HTTP cache for all network
requests. This is because
the HTTP cache caused a particularly nasty problem in Playground
deployments.

Installing a new service worker purged the CacheStorage and requested a
new set of assets
from the network. However, some of these requests were served from the
HTTP cache. As a
result, Playground would start loading a mix of old and new assets and
quickly error out.
What made it worse is that this broken state was cached in CacheStorage,
breaking Playground
 for weeks until the cache was refreshed.

See #1822 for more
details.

#### CacheStorage in the service worker

This servive worker uses a **Cache only** strategy to ensure all the
loaded assets
 come from the same webapp build.

The **Cache only** strategy means Playground only loads each assets from
the network once, caches it, and serves it from the cache from that
point on.

 The only times Playground reaches to the network are:

 * Before the service worker is installed.
 * When the service worker is being activated.
 * On CacheStorage cache miss occurs.

### Edge Cache on playground.wordpress.net

The remote server (playground.wordpress.net) has an Edge Cache that's
populated with
all static assets on every webapp deployment. All the assets served by
playground.wordpress.net
at any point in time come from the same build and are consistent with
each other. The
deployment process is atomic-ish so the server should never expose a mix
of old and new
 assets.

However, what if a new webapp version is deployed right when someone
downloaded 10 out of
 27 static assets required to boot Playground?

Right now, they'd end up in an undefined state and likely see an error.
Then, on a page refresh,
they'd pick up a new service worker that would purge the stale assets
and boot the new webapp
 version.

This is not a big problem for now, but it's also not the best user
experience. This can be
eventually solved with push notifications. A new deployment would notify
all the active
 clients to upgrade and pick up the new assets.

 ## Other changes

In addition, this PR:

* Adds E2E tests for app deployments and offline mode
* Updates CI to run Playwright tests:
	* Firefox
	* Safari
	* Chrome
* Fixes a few paper cuts
* Fixed: Boot halted when OPFS isn't available due to error/success
hooks never running (4e0ef74)
* Fixed: "Save in this browser" option stays available even when there's
no OPFS support (f6225a9)

## Paths not taken

* Relying on build-time hashes in the filenames for all caching. We
can't rely on that for the most important routes: `/`, `/index.html`,
`/remote.html`, `/sw.js` – they need stable URLs for multiple reasons.
* A different caching strategy, such as [network falling back to
cache](https://web.dev/articles/offline-cookbook#network-falling-back-to-cache).

## Caveats and follow-up work

* Let's find a way to leverage HTTP cache without breaking the offline
cache.
* There's no way to recover from a deployment happening during a page
load – let's fix it.
* A new service worker forcibly reloads other browser tabs and destroys
their in-memory context. Let's solve it by storing temporary sites in
OPFS.

## Testing Instructions (or ideally a Blueprint)

CI.

Yes, it sounds like a lame testing plan fur such a profound change.
However, almost none of these changes can be tested in a local dev
environment and a large part of this work was about covering app
deployment in our E2E tests.

If you want to try these tests locally and see what they do, you'll need
this special setup:

```bash
$ npx nx e2e:playwright:prepare-app-deploy-and-offline-mode playground-website
$ npx playwright test --config=packages/playground/website/playwright/playwright.config.ts ./packages/playground/website/playwright/e2e/deployment.spec.ts --ui
```

## Related resources

* PR that turned off HTTP caching:
#1822
* Exploring all the cache layers:
#1774
* Cache only strategy:
https://web.dev/articles/offline-cookbook#cache-only
* Service worker caching and HTTP caching:
https://web.dev/articles/service-worker-caching-and-http-caching

---------

Co-authored-by: Bero <berislav.grgicak@gmail.com>
Co-authored-by: Brandon Payton <brandon@happycode.net>
adamziel added a commit that referenced this issue Oct 4, 2024
…r tabs to prevent fatal errors after new deployments. (#1822)

## Motivation for the change, related issues

Solves fatal Playground breakages after new version deployments by
adopting the following version upgrade protocol:

* Playground version is upgraded as early as possible after a new
release
* HTTP cache is skipped when fetching new assets
* Stale Playground tabs are forcibly refreshed

Related to #1821

## The problem

Playground got affected by HTTP caching and ended up loading assets from
both the old release and the new release. This broke the app's
dependency graph and led to fatal errors.

See the visualisation below. When Playground v184 is released, the app
will only work properly if all the loaded assets come from v184:


![371781531-608a780e-60b8-4ed4-969a-d7497c7500a7](https://github.com/user-attachments/assets/605cba58-8eba-4fdb-b527-8c1f6942ce24)

## The solution

This PR ensures HTTP cache is skipped for assets that are cached
offline. This isn't perfect as the browser will sometimes download the
same file twice, but it's much better than breaking the app. We'll
explore making the most out of both cache layers in the future.

Here's a rundown of the caching strategy implemented in this PR:

* Playground version is upgraded as early as possible after a new
release
* HTTP cache is skipped

### Playground version is upgraded as early as possible after a new
release

New service workers call .skipWaiting(), immediately claim all the
clients
that were controlled by the previous service worker, and forcibly
refreshes
 them.

 Why?

Because Playground fetches new resources asynchronously and on demand.
However,
deploying a new webapp version of the app destroys the resources
referenced in
the previous webapp version. Therefore, we can't allow the previous
version
 to run when a new version becomes available.

#### Push notifications

It would be supremely useful to proactively notify the webapp after a
fresh deployment.
 Playground doesn't do that yet but it likely will in the future.

### HTTP cache is skipped

 Playground relies on the **Cache only** strategy. It loads assets from
the network, caches them, and serves them from the cache. The assumption
is that all network requests yield the most recent version of the remote
file.

 This helps us avoid the HTTP cache problem.

#### Cache layers

 We're dealing with the following cache layers:

 * HTTP cache in the browser
 * CacheStorage in the service worker
 * Edge Cache on playground.wordpress.net

#### HTTP cache in the browser

This service worker skips the browser HTTP cache for all network
requests. This is because
the HTTP cache caused a particularly nasty problem in Playground
deployments.

Installing a new service worker purged the CacheStorage and requested a
new set of assets
from the network. However, some of these requests were served from the
HTTP cache. As a
result, Playground would start loading a mix of old and new assets and
quickly error out.
What made it worse is that this broken state was cached in CacheStorage,
breaking Playground
 for weeks until the cache was refreshed.

See #1822 for more
details.

#### CacheStorage in the service worker

This servive worker uses a **Cache only** strategy to ensure all the
loaded assets
 come from the same webapp build.

The **Cache only** strategy means Playground only loads each assets from
the network once, caches it, and serves it from the cache from that
point on.

 The only times Playground reaches to the network are:

 * Before the service worker is installed.
 * When the service worker is being activated.
 * On CacheStorage cache miss occurs.

### Edge Cache on playground.wordpress.net

The remote server (playground.wordpress.net) has an Edge Cache that's
populated with
all static assets on every webapp deployment. All the assets served by
playground.wordpress.net
at any point in time come from the same build and are consistent with
each other. The
deployment process is atomic-ish so the server should never expose a mix
of old and new
 assets.

However, what if a new webapp version is deployed right when someone
downloaded 10 out of
 27 static assets required to boot Playground?

Right now, they'd end up in an undefined state and likely see an error.
Then, on a page refresh,
they'd pick up a new service worker that would purge the stale assets
and boot the new webapp
 version.

This is not a big problem for now, but it's also not the best user
experience. This can be
eventually solved with push notifications. A new deployment would notify
all the active
 clients to upgrade and pick up the new assets.

 ## Other changes

In addition, this PR:

* Adds E2E tests for app deployments and offline mode
* Updates CI to run Playwright tests:
	* Firefox
	* Safari
	* Chrome
* Fixes a few paper cuts
* Fixed: Boot halted when OPFS isn't available due to error/success
hooks never running (4e0ef74)
* Fixed: "Save in this browser" option stays available even when there's
no OPFS support (f6225a9)

## Paths not taken

* Relying on build-time hashes in the filenames for all caching. We
can't rely on that for the most important routes: `/`, `/index.html`,
`/remote.html`, `/sw.js` – they need stable URLs for multiple reasons.
* A different caching strategy, such as [network falling back to
cache](https://web.dev/articles/offline-cookbook#network-falling-back-to-cache).

## Caveats and follow-up work

* Let's find a way to leverage HTTP cache without breaking the offline
cache.
* There's no way to recover from a deployment happening during a page
load – let's fix it.
* A new service worker forcibly reloads other browser tabs and destroys
their in-memory context. Let's solve it by storing temporary sites in
OPFS.

## Testing Instructions (or ideally a Blueprint)

CI.

Yes, it sounds like a lame testing plan fur such a profound change.
However, almost none of these changes can be tested in a local dev
environment and a large part of this work was about covering app
deployment in our E2E tests.

If you want to try these tests locally and see what they do, you'll need
this special setup:

```bash
$ npx nx e2e:playwright:prepare-app-deploy-and-offline-mode playground-website
$ npx playwright test --config=packages/playground/website/playwright/playwright.config.ts ./packages/playground/website/playwright/e2e/deployment.spec.ts --ui
```

## Related resources

* PR that turned off HTTP caching:
#1822
* Exploring all the cache layers:
#1774
* Cache only strategy:
https://web.dev/articles/offline-cookbook#cache-only
* Service worker caching and HTTP caching:
https://web.dev/articles/service-worker-caching-and-http-caching

---------

Co-authored-by: Bero <berislav.grgicak@gmail.com>
Co-authored-by: Brandon Payton <brandon@happycode.net>
adamziel added a commit that referenced this issue Oct 4, 2024
…r tabs to prevent fatal errors after new deployments. (#1822)

## Motivation for the change, related issues

Solves fatal Playground breakages after new version deployments by
adopting the following version upgrade protocol:

* Playground version is upgraded as early as possible after a new
release
* HTTP cache is skipped when fetching new assets
* Stale Playground tabs are forcibly refreshed

Related to #1821

## The problem

Playground got affected by HTTP caching and ended up loading assets from
both the old release and the new release. This broke the app's
dependency graph and led to fatal errors.

See the visualisation below. When Playground v184 is released, the app
will only work properly if all the loaded assets come from v184:


![371781531-608a780e-60b8-4ed4-969a-d7497c7500a7](https://github.com/user-attachments/assets/605cba58-8eba-4fdb-b527-8c1f6942ce24)

## The solution

This PR ensures HTTP cache is skipped for assets that are cached
offline. This isn't perfect as the browser will sometimes download the
same file twice, but it's much better than breaking the app. We'll
explore making the most out of both cache layers in the future.

Here's a rundown of the caching strategy implemented in this PR:

* Playground version is upgraded as early as possible after a new
release
* HTTP cache is skipped

### Playground version is upgraded as early as possible after a new
release

New service workers call .skipWaiting(), immediately claim all the
clients
that were controlled by the previous service worker, and forcibly
refreshes
 them.

 Why?

Because Playground fetches new resources asynchronously and on demand.
However,
deploying a new webapp version of the app destroys the resources
referenced in
the previous webapp version. Therefore, we can't allow the previous
version
 to run when a new version becomes available.

#### Push notifications

It would be supremely useful to proactively notify the webapp after a
fresh deployment.
 Playground doesn't do that yet but it likely will in the future.

### HTTP cache is skipped

 Playground relies on the **Cache only** strategy. It loads assets from
the network, caches them, and serves them from the cache. The assumption
is that all network requests yield the most recent version of the remote
file.

 This helps us avoid the HTTP cache problem.

#### Cache layers

 We're dealing with the following cache layers:

 * HTTP cache in the browser
 * CacheStorage in the service worker
 * Edge Cache on playground.wordpress.net

#### HTTP cache in the browser

This service worker skips the browser HTTP cache for all network
requests. This is because
the HTTP cache caused a particularly nasty problem in Playground
deployments.

Installing a new service worker purged the CacheStorage and requested a
new set of assets
from the network. However, some of these requests were served from the
HTTP cache. As a
result, Playground would start loading a mix of old and new assets and
quickly error out.
What made it worse is that this broken state was cached in CacheStorage,
breaking Playground
 for weeks until the cache was refreshed.

See #1822 for more
details.

#### CacheStorage in the service worker

This servive worker uses a **Cache only** strategy to ensure all the
loaded assets
 come from the same webapp build.

The **Cache only** strategy means Playground only loads each assets from
the network once, caches it, and serves it from the cache from that
point on.

 The only times Playground reaches to the network are:

 * Before the service worker is installed.
 * When the service worker is being activated.
 * On CacheStorage cache miss occurs.

### Edge Cache on playground.wordpress.net

The remote server (playground.wordpress.net) has an Edge Cache that's
populated with
all static assets on every webapp deployment. All the assets served by
playground.wordpress.net
at any point in time come from the same build and are consistent with
each other. The
deployment process is atomic-ish so the server should never expose a mix
of old and new
 assets.

However, what if a new webapp version is deployed right when someone
downloaded 10 out of
 27 static assets required to boot Playground?

Right now, they'd end up in an undefined state and likely see an error.
Then, on a page refresh,
they'd pick up a new service worker that would purge the stale assets
and boot the new webapp
 version.

This is not a big problem for now, but it's also not the best user
experience. This can be
eventually solved with push notifications. A new deployment would notify
all the active
 clients to upgrade and pick up the new assets.

 ## Other changes

In addition, this PR:

* Adds E2E tests for app deployments and offline mode
* Updates CI to run Playwright tests:
	* Firefox
	* Safari
	* Chrome
* Fixes a few paper cuts
* Fixed: Boot halted when OPFS isn't available due to error/success
hooks never running (4e0ef74)
* Fixed: "Save in this browser" option stays available even when there's
no OPFS support (f6225a9)

## Paths not taken

* Relying on build-time hashes in the filenames for all caching. We
can't rely on that for the most important routes: `/`, `/index.html`,
`/remote.html`, `/sw.js` – they need stable URLs for multiple reasons.
* A different caching strategy, such as [network falling back to
cache](https://web.dev/articles/offline-cookbook#network-falling-back-to-cache).

## Caveats and follow-up work

* Let's find a way to leverage HTTP cache without breaking the offline
cache.
* There's no way to recover from a deployment happening during a page
load – let's fix it.
* A new service worker forcibly reloads other browser tabs and destroys
their in-memory context. Let's solve it by storing temporary sites in
OPFS.

## Testing Instructions (or ideally a Blueprint)

CI.

Yes, it sounds like a lame testing plan fur such a profound change.
However, almost none of these changes can be tested in a local dev
environment and a large part of this work was about covering app
deployment in our E2E tests.

If you want to try these tests locally and see what they do, you'll need
this special setup:

```bash
$ npx nx e2e:playwright:prepare-app-deploy-and-offline-mode playground-website
$ npx playwright test --config=packages/playground/website/playwright/playwright.config.ts ./packages/playground/website/playwright/e2e/deployment.spec.ts --ui
```

## Related resources

* PR that turned off HTTP caching:
#1822
* Exploring all the cache layers:
#1774
* Cache only strategy:
https://web.dev/articles/offline-cookbook#cache-only
* Service worker caching and HTTP caching:
https://web.dev/articles/service-worker-caching-and-http-caching

---------

Co-authored-by: Bero <berislav.grgicak@gmail.com>
Co-authored-by: Brandon Payton <brandon@happycode.net>
adamziel added a commit that referenced this issue Oct 7, 2024
…first strategy (#1849)

## Motivation for the change, related issues

Related to #1821

Changes the webapp upgrade protocol proposed in #1822 to avoid forcibly
refreshing the browser tabs with unsaved changes in them.

## Technical implementation

**Before this PR**, the new service worker would clear the offline
cache, claim all the active clients, and forcibly refresh them to ensure
the latest Playground version is loaded everywhere.

This worked, but every webapp upgrade would destroy any work the user
may have done in their temporary Playground. We've explored [storing
temporary Playgrounds in
OPFS](#1838), but
backtracked because 1) it created an uncanny amount of complexity, and
2) some browsers (e.g. Safari in private mode) don't support OPFS and
must rely on a temporary in-memory site.

**After this PR**, the service worker clears the offline cache, claims
all the active clients, but it doesn't forcibly refresh them. Instead,
it uses the network-first strategy for the `remote.html` route and the
`/` route. All the other files are still loaded using the cache-first
strategy.

Every Playground that's already open, either temporary or stored, will
remain functional. The heavy, asynchronously loaded resources such as
PHP.wasm and WordPress.zip were already processed – there's no user flow
that could lead to `import()`-ing a non-existing `php.js` file.

Every newly opened Playground will be loaded using a freshly downloaded
`remote.html` file containing references to freshly deployed Playground
assets. Thus

## Other changes

This PR inlines the reusable service worker utilities from
`packages/php-wasm/web/src/lib/register-service-worker.ts` into
`@wp-playground` packages. It turns out, they weren't as reusable and
keeping them separate was annoying. I'm now convinced the service worker
bits are application specific and splitting them between multiple
packages just isn't useful.

## Testing instructions

Review the app deployment E2E tests check what we need to check, and
them confirm they are green in the CI.
@adamziel
Copy link
Collaborator Author

adamziel commented Oct 9, 2024

With #1822 and #1849 merged, this seems to be done. Let's reopen if we notice any other problems with website deployments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Aspect] Website [Priority] High [Type] Enhancement New feature or request [Type] Reliability Playground uptime, reliability, not crashing
Projects
Archived in project
Development

No branches or pull requests

1 participant