Skip to content

fix(ci): test stability improvements#2466

Merged
lidel merged 35 commits intomainfrom
fix/e2e-test-improvements
Jan 26, 2026
Merged

fix(ci): test stability improvements#2466
lidel merged 35 commits intomainfrom
fix/e2e-test-improvements

Conversation

@lidel
Copy link
Member

@lidel lidel commented Jan 24, 2026

CI and E2E tests were completely broken for the past 7 months. This PR fixes them and makes the test infrastructure more main

What changed

  • E2E tests now pass reliably:

    • Modernized all tests to use Playwright's locator API instead of brittle CSS selectors
    • Centralized test fixtures (test/e2e/setup/fixtures.js) for page navigation and peer node management
    • Added semantic locators (test/e2e/setup/locators.js) so tests read more clearly and break less often
    • Upgraded Playwright to 1.58.0
    • Fixed Kubo daemon cleanup in test teardown
  • CI runs faster:

    • Consolidated 4 separate workflow files into one ci.yml
    • Added npm/Playwright cache with smart invalidation (skip browser download on cache hit)
    • Pinned Node.js and Go versions in .tool-versions to avoid surprise breakage from upstream (i remember this happening MORE THAN ONCE in past 5 years, due to NodeJS "nuances")
  • Better docs for contributors:

    • Slimmed down README to focus on what the project does and quick start
    • Moved detailed setup, CORS config, and test debugging to docs/developer-notes.md
    • Added docs/RELEASING.md with release checklist

lidel added 4 commits January 24, 2026 22:00
removed 10-shard matrix that was adding complexity and CI time overhead
without providing proportional benefits. tests now run in a single job
with a 10-minute timeout, which is sufficient given test suite completes
in ~15 seconds locally.

also simplified the two conditional test runs (repeated vs non-repeated)
into a single step that always uses --reporter=list for clearer output.
removed conditional logic that changed behavior based on process.env.CI.
this was causing inconsistency between local and CI test runs, making it
harder to reproduce CI failures locally.

now uses consistent settings: 30s timeout per test, 5-minute global
timeout for entire suite, no retries, and always starts fresh server.
added timestamped logging to global-setup.js and ipfs-backend.js to help
diagnose CI hangs when they occur. each major step now logs progress.

added timeout wrappers around async operations that could hang indefinitely:
- ipfs-backend startup: 60s timeout
- kubo daemon spawn: 30s timeout

also fixed two issues:
- disabled DHT bootstrapping (Bootstrap: []) for faster daemon startup
- changed addInitScript to page.evaluate so localStorage values are
  captured by storageState() before browser closes
files.test.js:
- changed file verification to only check the two files we uploaded
  instead of iterating all MFS files. other tests may have added files
  that would cause unexpected matches.

grid-view.test.js:
- added focusGrid() helper that tries multiple approaches to establish
  keyboard focus on the grid container. this fixes intermittent failures
  where arrow key navigation would not work because focus was not set.
- simplified test assertions to use playwright's built-in waiters
  instead of manual count checks.

grid.js helper:
- selectViewMode now waits for files view to be ready before checking
  current mode, preventing race conditions during page load.
lidel added 3 commits January 24, 2026 23:15
the global teardown only removed the JSON config file but never called
ipfsd.stop() on the spawned Kubo daemon. this left orphaned processes
accumulating on CI runners, causing port conflicts and resource exhaustion.

- export stop() function from ipfs-backend.js
- call stop() in global-teardown.js before removing config file
- add logging for teardown progress
trace: 'on-first-retry' was ineffective because retries=0, meaning
traces would never be captured. changed to 'retain-on-failure' so
traces are available when debugging test failures.
migrate from deprecated waitForSelector() pattern to modern locator API
which provides better error messages and auto-waiting behavior.

changes include:
- replace page.waitForSelector() with page.locator().waitFor()
- remove force:true clicks, use proper waits instead
- fix missing await on click operations (files.test.js, ipns.test.js)
- replace custom checkClassWithTimeout polling with waitForFunction
- use .first() where multiple elements match to satisfy strict mode
- use more specific selectors (button#id, [role="menuitem"]) to avoid
  ambiguous matches
the `promise/param-names` rule requires Promise constructor parameters
to match `^_?resolve$` and `^_?reject$` patterns. changed `_` to
`_resolve` in the timeout wrapper functions.
always run `playwright install --with-deps` regardless of cache status
to ensure system dependencies are present. previously this was only run
on cache miss, which could cause failures if deps were missing.
@lidel lidel force-pushed the fix/e2e-test-improvements branch from 38393f3 to b9ed885 Compare January 24, 2026 23:56
tests pass locally (~17s) but hang on CI until 10-minute timeout.
added comprehensive timestamped logging at every async operation:

- global-setup.js: log each step (port check, daemon spawn, browser launch, navigation)
- ipfs-backend.js: log kubo lifecycle (factory, spawn, identity, config write)
- global-teardown.js: log cleanup operations
- test-e2e.yml: add shell timestamps, enable DEBUG=pw:api for Playwright logging

logs output to both stdout and stderr to ensure CI captures output.
timeout wrappers now warn at 80% of timeout before failing.

after CI run, last log message before timeout identifies the hanging operation.
address CI hang by:
- make webui port configurable via WEBUI_PORT env var
- use dynamic port allocation in CI workflow
- increase webServer timeout from 5s to 30s for CI
- add stdout/stderr piping to capture webServer output
- add build directory check before running tests
- add config logging to track port and cwd
- use 127.0.0.1 instead of localhost for consistency

the CI was hanging with no output because Playwright initialization
was blocking before globalSetup even ran. these changes will help
identify exactly where the hang occurs.
previous CI runs showed 8-minute hangs with zero output from Playwright.
this indicates cross-env or npm is buffering stdout.

changes:
- run playwright directly in CI instead of via npm script
- add playwright version check before running tests
- use fs.writeSync for config logging to bypass Node.js buffering
- log environment variables to verify they're passed correctly
trying to isolate the CI hang:
- removed DEBUG=pw:api which might cause infinite output buffering
- removed NODE_OPTIONS which might affect behavior
- added direct node test to load playwright.config.js independently
- added timeout wrapper around playwright test command
- added chromium installation dry-run check

if config load test fails, the issue is in config/node setup.
if config loads but playwright test hangs, issue is in playwright runner.
the CI was hanging because npx http-server was not starting.
replaced it with a simple inline node http server that:
- serves files from ./build directory
- handles common MIME types
- starts immediately without npx overhead

also fixed eslint single-quote error.
replace inline node command with dedicated serve-build.js script
to avoid issues with shell escaping and ES module requirements
- remove timeout wrapper that was hiding failures (exit code 124)
- remove || echo that swallowed error codes
- change webServer stdout/stderr from pipe to inherit for visibility
- clean up unnecessary diagnostic steps
- remove serve-build.js, restore npx http-server
- remove withTimeout() wrappers (not needed with Node.js fix)
- keep stop() export and proper daemon shutdown
- fix grid.js to use 127.0.0.1 (matches playwright config)
- keep: locator API, Bootstrap:[], cache optimization, Node pin
- add data-testid attributes to File, FilesList, FilesGrid, GridFile
- create fixtures.js with worker-scoped peerNode fixture for speed
- create locators.js with centralized selector definitions
- replace brittle CSS selectors with getByRole/getByTestId
- replace waitFor() calls with web-first assertions (toBeVisible, toHaveClass)
- update coverage.js to re-export from fixtures.js for backward compat
- modernize all test files to use shared locators and fixtures
- add test/e2e/test/ to gitignore (artifact from running tests)
@lidel lidel force-pushed the fix/e2e-test-improvements branch from 37ee0ab to 1e6740d Compare January 26, 2026 04:40
…ions

- add .tool-versions for nodejs 24.11.0 and golang 1.25
- upgrade actions/checkout v4 to v6
- upgrade actions/cache v4 to v5
- upgrade actions/setup-node v4 to v6
- upgrade actions/upload-artifact v4 to v6
- upgrade actions/download-artifact v4 to v7
- upgrade actions/setup-go v5 to v6
- switch setup-node and setup-go to use go-version-file/node-version-file
- remove NODE_VERSION env and node-version inputs from reusable workflows
- update README to point to .tool-versions for version info
@lidel lidel marked this pull request as ready for review January 26, 2026 05:39
@lidel lidel requested a review from a team as a code owner January 26, 2026 05:39
- forbidOnly now only applies in CI, allowing local debugging with .only
- restored "Bulk import" menu item assertion in files test
- pass secrets to reusable workflows so CODECOV_TOKEN is available
- add node_modules caching with conditional npm ci to skip install on cache hit
- include patches/** and .tool-versions in cache key to invalidate on changes
- add missing npm install step to e2e-coverage job
… hit

- on cache miss: run `playwright install --with-deps` (full install)
- on cache hit: run `playwright install-deps` (only OS deps, ~45s faster)
@lidel lidel force-pushed the fix/e2e-test-improvements branch from 805ba53 to 31014c6 Compare January 26, 2026 17:40
- @playwright/test: 1.48.2 -> 1.58.0
- playwright-chromium: 1.48.2 -> 1.58.0

no breaking changes affect this codebase
- reorganize README with cleaner layout and navigation links
- clarify Web UI is specifically for Kubo nodes
- add features list and "Getting Help" section
- move detailed dev docs to docs/developer-notes.md
- move release instructions to docs/RELEASING.md
- replace Matrix badge with Discourse forum badge
@lidel lidel merged commit d11475a into main Jan 26, 2026
12 checks passed
@lidel lidel deleted the fix/e2e-test-improvements branch January 26, 2026 18:27
This was referenced Jan 26, 2026
ipfs-gui-bot pushed a commit that referenced this pull request Feb 5, 2026
## [4.11.0](v4.10.0...v4.11.0) (2026-02-05)

 CID `bafybeidfgbcqy435sdbhhejifdxq4o64tlsezajc272zpyxcsmz47uyc64`

 ---

### Features

* Add search/filter functionality to Files UI ([#2451](#2451)) ([c866be6](c866be6)), closes [#2447](#2447)
* DHT Provide Sweep Diagnostic Screen ([#2463](#2463)) ([fb22ea6](fb22ea6))
* **files:** resolve paths before inspect and support protocol URL ([#2465](#2465)) ([74a44d8](74a44d8))
* **files:** support additional image file extensions ([#2347](#2347)) ([371341a](371341a))

### Bug Fixes

* **ci:** test stability improvements ([#2466](#2466)) ([d11475a](d11475a))
* CLI tutor commands missing some parameters ([#2470](#2470)) ([ed8ad6a](ed8ad6a))
* **diagnostics:** handle Go zero time in DHT provide screen ([dc51cd4](dc51cd4))
* **files:** not found page ([#2455](#2455)) ([18b9b0d](18b9b0d))
* show proper error state in import notifications ([#2452](#2452)) ([391470e](391470e)), closes [#2448](#2448)

### Trivial Changes

* **ci:** skip publishPreview for dependabot PRs ([17f675e](17f675e))
* pull new translations ([#2467](#2467)) ([cc569f4](cc569f4))
* pull transifex translations ([#2464](#2464)) ([8d7a17f](8d7a17f))
@ipfs-gui-bot
Copy link
Collaborator

🎉 This PR is included in version 4.11.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants