chore: improve CI performance by upgrading runner and removing test overhead#2076
chore: improve CI performance by upgrading runner and removing test overhead#2076nick-inkeep merged 4 commits intomainfrom
Conversation
…verhead - Upgrade ci job runner from ubuntu-latest to ubuntu-16gb for more resources - Remove OpenTelemetry NodeSDK initialization from test setup (was creating full auto-instrumentation per worker thread with no benefit in unit tests) - Reduce agents-api vitest maxThreads from 10 to 8 and minThreads from 4 to 2 to better match runner core count and reduce per-worker initialization cost Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
There was a problem hiding this comment.
PR Review Summary
(1) Total Issues | Risk: Low
🔴❗ Critical (0) ❗🔴
None.
🟠⚠️ Major (0) 🟠⚠️
None.
🟡 Minor (0) 🟡
None.
💭 Consider (1) 💭
💭 1) agents-api/package.json Potentially unused OTel devDependencies
Issue: With the removal of OTel NodeSDK initialization from test setup, two devDependencies appear to no longer have any consumers in agents-api:
@opentelemetry/exporter-trace-otlp-proto(line 98)@opentelemetry/sdk-metrics(line 99)
Why: These were only imported in the removed test setup code. Production instrumentation uses @opentelemetry/exporter-trace-otlp-http (not -proto) and doesn't use sdk-metrics. Leaving unused deps adds install overhead and may trigger false positives in dependency analysis.
Fix: Run pnpm knip to confirm these are flagged as unused, then remove if confirmed. This can be addressed in a follow-up PR.
Refs:
🧹 While You're Here (1) 🧹
🧹 1) .github/workflows/ci.yml CI job missing timeout-minutes
Issue: The ci job lacks a timeout-minutes setting, while the Cypress workflow (which also uses ubuntu-16gb) sets timeout-minutes: 30.
Why: Without a timeout, a hung build/test step could consume runner minutes until the 360-minute GitHub default. Given this PR aims to reduce CI time, an explicit timeout documents expected duration and provides a safeguard.
Fix: Add timeout-minutes: 30 (or appropriate value) to the ci job definition, similar to the Cypress workflow.
Refs:
✅ APPROVE
Summary: Clean CI performance optimization. The OTel removal is verified safe (tests mock the API layer directly, production instrumentation is unaffected), the runner upgrade matches the existing Cypress workflow pattern, and the thread pool tuning is well-reasoned for the target runner specs. The Consider item (unused devDeps) is a minor cleanup opportunity, and the While You're Here item (timeout) is a pre-existing improvement opportunity. Ship it! 🚀
Discarded (2)
| Location | Issue | Reason Discarded |
|---|---|---|
.github/workflows/ci.yml |
Missing explicit permissions: block |
Pre-existing, consistent with other workflows in repo, very low priority security hygiene |
agents-api/vitest.config.ts |
Thread pool values lack explanatory comment | AGENTS.md guidance: "No Comments: Do not add comments unless explicitly requested" |
Reviewers (2)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
pr-review-devops |
5 | 0 | 1 | 1 | 0 | 0 | 2 |
pr-review-standards |
0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Total | 5 | 0 | 1 | 1 | 0 | 0 | 2 |
Note: pr-review-standards confirmed no bugs, security issues, or AGENTS.md compliance problems. One finding ("thread pool comment now stale") was discarded as it conflicts with AGENTS.md guidance on comments.
Remove @opentelemetry/exporter-trace-otlp-proto and @opentelemetry/sdk-metrics from agents-api devDependencies since they were only used in the test setup OTel initialization that was removed in the previous commit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three targeted fixes to prevent catastrophic cache invalidation in CI: 1. Exclude test files from build task inputs - prevents test file changes in core packages from cascading build hash invalidation to all 11+ downstream packages. Uses officially documented $TURBO_DEFAULT$ with negation globs (turbo 1.12+). 2. Remove transit dependency from lint task - transit is a no-op coordination task (no package defines a transit script) but its hash changes on any file change, cascading to all downstream lint tasks. Lint only reads local source files and doesn't need dependency ordering. 3. Move TURBO_TOKEN/TURBO_TEAM to job-level env - ensures all turbo invocations (check, knip) use remote cache, not just pnpm check. Also adds timeout-minutes: 30 as a safety guardrail. Evidence: PR #2068 changed 2 test files but caused 36/45 cache misses. With these fixes, the same change would cause ~6 misses (only the directly affected packages). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
PR Review Summary
(0) Total Issues | Risk: Low
🔴❗ Critical (0) ❗🔴
None.
🟠⚠️ Major (0) 🟠⚠️
None.
🟡 Minor (0) 🟡
None.
💭 Consider (0) 💭
None.
🕐 Pending Recommendations (0)
None — both items from the previous review have been addressed.
✅ APPROVE
Summary: Clean delta addressing all prior feedback. The two commits since last review implement exactly what was suggested: (1) timeout-minutes: 30 added to the CI job matching the Cypress workflow pattern, and (2) unused OTel devDependencies removed from agents-api/package.json. The additional turbo cache optimizations (test file exclusions from build inputs, transit dependency removal from lint) are correctly implemented — the devops reviewer verified these don't affect build correctness since test files are already excluded from tsdown bundling and no package defines a transit script. Ship it! 🚀
Discarded (4)
| Location | Issue | Reason Discarded |
|---|---|---|
.github/workflows/ci.yml:19 |
Previous feedback addressed verification | INFO-level confirmation, not an actionable issue |
.github/workflows/ci.yml:20-24 |
TURBO env vars hoisting verification | INFO-level confirmation, not an actionable issue |
turbo.json:31-34 |
Build input exclusions verification | INFO-level confirmation, not an actionable issue |
turbo.json:74-76 |
Lint transit dependency removal verification | INFO-level confirmation, not an actionable issue |
Reviewers (1)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
pr-review-devops |
4 | 0 | 0 | 0 | 0 | 0 | 4 |
| Total | 4 | 0 | 0 | 0 | 0 | 0 | 4 |
Note: All devops findings were INFO-level verifications confirming the delta changes are correctly implemented. No issues identified.
CI Status NoteThe CI failure is caused by 4 pre-existing CORS test failures (
Turbo cache improvement signalOn the re-run, turbo cached 42 out of 45 tasks (93%), with "Run CI checks" completing in 6m42s. For comparison, PR #2068 (before these changes) cached only 7/45 tasks (15%) and took 17+ minutes for the same step. The full impact of the cache cascade fixes will be visible on subsequent PRs, where test-only file changes in |
Summary
Targeted CI performance improvements to reduce CI check times from 8-20 minutes to an expected 3-6 minutes for typical PRs.
Changes
1. Remove OpenTelemetry from test setup (biggest per-worker impact)
NodeSDKinitialization withgetNodeAutoInstrumentations()fromagents-api/src/__tests__/setup.ts@opentelemetry/exporter-trace-otlp-protoand@opentelemetry/sdk-metricsdevDependenciesvi.mock2. Tune Vitest worker threads (reduce CPU contention)
agents-api: maxThreads 10→8, minThreads 4→23. Upgrade CI runner (more resources)
runs-on: ubuntu-latest→runs-on: ubuntu-16gb(matches Cypress workflow)timeout-minutes: 30safety guardrail4. Fix turbo cache cascade invalidation (biggest systemic impact)
buildinputs:!**/*.test.*,!**/*.spec.*,!**/__tests__/**— prevents test file changes in core packages from cascading build hash invalidation to all 11+ downstream packages. Uses officially documented$TURBO_DEFAULT$+ negation glob pattern (turbo 1.12+).transitdependency fromlint:transitis a no-op coordination task (no package defines a transit script) but its hash changes on any file change, cascading to all downstream packages' lint tasks. Lint only reads local source files and doesn't need this ordering.pnpm check,pnpm knip) use remote cache — previously onlypnpm checkhad access.Evidence
Before (PR #2068 — changed 2 test files):
Expected after:
Test plan
agents-apitests pass (97 passed, 4 skipped, 4 pre-existing CORS failures from PR gate cors for local dev #2066)agents-coretests pass (84 passed, all green)turbo check --dryvalidates config (lint has no transit dependency, build excludes test files)turbo build --filter=@inkeep/agents-sdksucceeds)turbo lint --filter=@inkeep/agents-sdksucceeds)🤖 Generated with Claude Code