diff --git a/.github/workflows/chroma-issue-indexer.lock.yml b/.github/workflows/chroma-issue-indexer.lock.yml index 6aff049917..5bbc5dfaa1 100644 --- a/.github/workflows/chroma-issue-indexer.lock.yml +++ b/.github/workflows/chroma-issue-indexer.lock.yml @@ -365,37 +365,45 @@ jobs: Index the 100 most recent issues from the repository into the Chroma vector database: - 1. **Fetch Issues**: - - Use the GitHub MCP server tools to list the most recent 100 issues - - Include both open and closed issues - - Get issue number, title, body, state, created date, and author + 1. **Create Chroma Collection First**: + - IMPORTANT: Check if the "issues" collection exists using `chroma_list_collections` + - If it doesn't exist, create it using `chroma_create_collection` with: + - Collection name: "issues" + - Use default embedding function (omit embedding_function_name parameter) - 2. **Create/Update Chroma Collection**: - - Create a collection named "issues" if it doesn't exist (use `chroma_create_collection`) - - Use an appropriate embedding function for semantic search + 2. **Fetch Issues Using GitHub MCP Tools** (NOT Python scripts): + - Use the `list_issues` tool from GitHub MCP server to fetch issues + - Fetch issues in batches of 5 at a time using the `perPage: 5` parameter + - Start with page 1, then page 2, page 3, etc. until you have 100 issues total + - Include both open and closed issues (omit state parameter to get both) + - Order by created date descending to get most recent first: `orderBy: "CREATED_AT"`, `direction: "DESC"` + - For each issue, extract: number, title, body, state, createdAt, author.login, url - 3. **Index Issues**: - - For each issue, add it to the Chroma collection (use `chroma_add_documents`) - - Use ID format: `issue-{issue_number}` - - Document content should be: `{title}\n\n{body}` (title and body combined) - - Include metadata: - - `number`: Issue number + 3. **Index Issues in Batches**: + - Process each batch of 5 issues immediately after fetching + - For each batch, use `chroma_add_documents` to add all 5 issues at once + - Use ID format: `issue-{issue_number}` (e.g., "issue-123") + - Document content: `{title}\n\n{body}` (combine title and body) + - If body is empty/null, use just the title as content + - Include metadata for each issue: + - `number`: Issue number (as string) - `title`: Issue title - `state`: Issue state (OPEN or CLOSED) - `author`: Issue author username - - `created_at`: Issue creation date + - `created_at`: Issue creation date (ISO 8601 format) - `url`: Issue URL 4. **Report Progress**: - - Log how many issues were indexed - - Note any issues that couldn't be indexed (e.g., empty body) - - Report the total number of issues in the collection + - After processing all batches, use `chroma_get_collection_count` to get total issue count + - Report how many issues were successfully indexed + - Note any issues that couldn't be indexed (e.g., API errors) ## Important Notes - - Process issues in batches if needed to avoid rate limits - - Skip issues that have already been indexed (check if ID exists) - - For issues with empty bodies, use just the title as content + - **MUST use GitHub MCP tools** (`list_issues` tool), NOT Python scripts or `gh` CLI + - **MUST create collection first** before attempting to add documents + - Process exactly 5 issues per batch using `perPage: 5` and incrementing page number + - Skip duplicate issues (Chroma will update if ID exists) - The collection persists in `/tmp/gh-aw/cache-memory-chroma/` across runs - This helps other workflows search for similar issues using semantic search diff --git a/.github/workflows/chroma-issue-indexer.md b/.github/workflows/chroma-issue-indexer.md index 90bce7502b..ef62044c33 100644 --- a/.github/workflows/chroma-issue-indexer.md +++ b/.github/workflows/chroma-issue-indexer.md @@ -26,37 +26,45 @@ This workflow indexes issues from the repository into a Chroma vector database f Index the 100 most recent issues from the repository into the Chroma vector database: -1. **Fetch Issues**: - - Use the GitHub MCP server tools to list the most recent 100 issues - - Include both open and closed issues - - Get issue number, title, body, state, created date, and author - -2. **Create/Update Chroma Collection**: - - Create a collection named "issues" if it doesn't exist (use `chroma_create_collection`) - - Use an appropriate embedding function for semantic search - -3. **Index Issues**: - - For each issue, add it to the Chroma collection (use `chroma_add_documents`) - - Use ID format: `issue-{issue_number}` - - Document content should be: `{title}\n\n{body}` (title and body combined) - - Include metadata: - - `number`: Issue number +1. **Create Chroma Collection First**: + - IMPORTANT: Check if the "issues" collection exists using `chroma_list_collections` + - If it doesn't exist, create it using `chroma_create_collection` with: + - Collection name: "issues" + - Use default embedding function (omit embedding_function_name parameter) + +2. **Fetch Issues Using GitHub MCP Tools** (NOT Python scripts): + - Use the `list_issues` tool from GitHub MCP server to fetch issues + - Fetch issues in batches of 5 at a time using the `perPage: 5` parameter + - Start with page 1, then page 2, page 3, etc. until you have 100 issues total + - Include both open and closed issues (omit state parameter to get both) + - Order by created date descending to get most recent first: `orderBy: "CREATED_AT"`, `direction: "DESC"` + - For each issue, extract: number, title, body, state, createdAt, author.login, url + +3. **Index Issues in Batches**: + - Process each batch of 5 issues immediately after fetching + - For each batch, use `chroma_add_documents` to add all 5 issues at once + - Use ID format: `issue-{issue_number}` (e.g., "issue-123") + - Document content: `{title}\n\n{body}` (combine title and body) + - If body is empty/null, use just the title as content + - Include metadata for each issue: + - `number`: Issue number (as string) - `title`: Issue title - `state`: Issue state (OPEN or CLOSED) - `author`: Issue author username - - `created_at`: Issue creation date + - `created_at`: Issue creation date (ISO 8601 format) - `url`: Issue URL 4. **Report Progress**: - - Log how many issues were indexed - - Note any issues that couldn't be indexed (e.g., empty body) - - Report the total number of issues in the collection + - After processing all batches, use `chroma_get_collection_count` to get total issue count + - Report how many issues were successfully indexed + - Note any issues that couldn't be indexed (e.g., API errors) ## Important Notes -- Process issues in batches if needed to avoid rate limits -- Skip issues that have already been indexed (check if ID exists) -- For issues with empty bodies, use just the title as content +- **MUST use GitHub MCP tools** (`list_issues` tool), NOT Python scripts or `gh` CLI +- **MUST create collection first** before attempting to add documents +- Process exactly 5 issues per batch using `perPage: 5` and incrementing page number +- Skip duplicate issues (Chroma will update if ID exists) - The collection persists in `/tmp/gh-aw/cache-memory-chroma/` across runs - This helps other workflows search for similar issues using semantic search diff --git a/specs/artifacts.md b/specs/artifacts.md index 0f2beaa3f7..2b6fd30e62 100644 --- a/specs/artifacts.md +++ b/specs/artifacts.md @@ -24,13 +24,13 @@ This section provides an overview of artifacts organized by job name, with dupli - `agent-artifacts` - **Paths**: `/tmp/gh-aw/agent-stdio.log`, `/tmp/gh-aw/aw-prompts/prompt.txt`, `/tmp/gh-aw/aw.patch`, `/tmp/gh-aw/aw_info.json`, `/tmp/gh-aw/mcp-logs/`, `/tmp/gh-aw/safe-inputs/logs/`, `/tmp/gh-aw/sandbox/firewall/logs/` - - **Used in**: 71 workflow(s) - agent-performance-analyzer.md, agent-persona-explorer.md, agentic-campaign-generator.md, ai-moderator.md, archie.md, brave.md, breaking-change-checker.md, changeset.md, ci-coach.md, ci-doctor.md, cli-consistency-checker.md, cloclo.md, code-scanning-fixer.md, codex-github-remote-mcp-test.md, commit-changes-analyzer.md, copilot-pr-merged-report.md, copilot-pr-nlp-analysis.md, craft.md, daily-choice-test.md, daily-copilot-token-report.md, daily-fact.md, daily-file-diet.md, daily-issues-report.md, daily-news.md, daily-observability-report.md, daily-repo-chronicle.md, daily-team-status.md, deep-report.md, dependabot-go-checker.md, dev-hawk.md, dev.md, dictation-prompt.md, example-custom-error-patterns.md, example-permissions-warning.md, firewall.md, github-mcp-structural-analysis.md, glossary-maintainer.md, go-fan.md, go-pattern-detector.md, grumpy-reviewer.md, hourly-ci-cleaner.md, issue-classifier.md, issue-triage-agent.md, layout-spec-maintainer.md, mergefest.md, metrics-collector.md, notion-issue-summary.md, pdf-summary.md, plan.md, poem-bot.md, pr-nitpick-reviewer.md, python-data-charts.md, q.md, release.md, repo-audit-analyzer.md, repository-quality-improver.md, research.md, scout.md, security-compliance.md, security-review.md, slide-deck-maintainer.md, stale-repo-identifier.md, super-linter.md, technical-doc-writer.md, test-create-pr-error-handling.md, tidy.md, typist.md, video-analyzer.md, weekly-issue-summary.md, workflow-generator.md, workflow-health-manager.md + - **Used in**: 72 workflow(s) - agent-performance-analyzer.md, agent-persona-explorer.md, agentic-campaign-generator.md, ai-moderator.md, archie.md, brave.md, breaking-change-checker.md, changeset.md, chroma-issue-indexer.md, ci-coach.md, ci-doctor.md, cli-consistency-checker.md, cloclo.md, code-scanning-fixer.md, codex-github-remote-mcp-test.md, commit-changes-analyzer.md, copilot-pr-merged-report.md, copilot-pr-nlp-analysis.md, craft.md, daily-choice-test.md, daily-copilot-token-report.md, daily-fact.md, daily-file-diet.md, daily-issues-report.md, daily-news.md, daily-observability-report.md, daily-repo-chronicle.md, daily-team-status.md, deep-report.md, dependabot-go-checker.md, dev-hawk.md, dev.md, dictation-prompt.md, example-custom-error-patterns.md, example-permissions-warning.md, firewall.md, github-mcp-structural-analysis.md, glossary-maintainer.md, go-fan.md, go-pattern-detector.md, grumpy-reviewer.md, hourly-ci-cleaner.md, issue-classifier.md, issue-triage-agent.md, layout-spec-maintainer.md, mergefest.md, metrics-collector.md, notion-issue-summary.md, pdf-summary.md, plan.md, poem-bot.md, pr-nitpick-reviewer.md, python-data-charts.md, q.md, release.md, repo-audit-analyzer.md, repository-quality-improver.md, research.md, scout.md, security-compliance.md, security-review.md, slide-deck-maintainer.md, stale-repo-identifier.md, super-linter.md, technical-doc-writer.md, test-create-pr-error-handling.md, tidy.md, typist.md, video-analyzer.md, weekly-issue-summary.md, workflow-generator.md, workflow-health-manager.md - `agent-output` - **Paths**: `${{ env.GH_AW_AGENT_OUTPUT }}` - **Used in**: 65 workflow(s) - agent-performance-analyzer.md, agent-persona-explorer.md, agentic-campaign-generator.md, ai-moderator.md, archie.md, brave.md, breaking-change-checker.md, changeset.md, ci-coach.md, ci-doctor.md, cli-consistency-checker.md, cloclo.md, code-scanning-fixer.md, commit-changes-analyzer.md, copilot-pr-merged-report.md, copilot-pr-nlp-analysis.md, craft.md, daily-choice-test.md, daily-copilot-token-report.md, daily-fact.md, daily-file-diet.md, daily-issues-report.md, daily-news.md, daily-observability-report.md, daily-repo-chronicle.md, daily-team-status.md, deep-report.md, dependabot-go-checker.md, dev-hawk.md, dictation-prompt.md, github-mcp-structural-analysis.md, glossary-maintainer.md, go-fan.md, go-pattern-detector.md, grumpy-reviewer.md, hourly-ci-cleaner.md, issue-classifier.md, issue-triage-agent.md, layout-spec-maintainer.md, mergefest.md, notion-issue-summary.md, pdf-summary.md, plan.md, poem-bot.md, pr-nitpick-reviewer.md, python-data-charts.md, q.md, release.md, repo-audit-analyzer.md, repository-quality-improver.md, research.md, scout.md, security-compliance.md, security-review.md, slide-deck-maintainer.md, stale-repo-identifier.md, super-linter.md, technical-doc-writer.md, test-create-pr-error-handling.md, tidy.md, typist.md, video-analyzer.md, weekly-issue-summary.md, workflow-generator.md, workflow-health-manager.md - `agent_outputs` - **Paths**: `/tmp/gh-aw/mcp-config/logs/`, `/tmp/gh-aw/redacted-urls.log`, `/tmp/gh-aw/sandbox/agent/logs/` - - **Used in**: 60 workflow(s) - agent-performance-analyzer.md, agent-persona-explorer.md, ai-moderator.md, archie.md, brave.md, breaking-change-checker.md, changeset.md, ci-coach.md, ci-doctor.md, cli-consistency-checker.md, code-scanning-fixer.md, codex-github-remote-mcp-test.md, copilot-pr-merged-report.md, copilot-pr-nlp-analysis.md, craft.md, daily-copilot-token-report.md, daily-fact.md, daily-file-diet.md, daily-issues-report.md, daily-news.md, daily-observability-report.md, daily-repo-chronicle.md, daily-team-status.md, deep-report.md, dependabot-go-checker.md, dev-hawk.md, dev.md, dictation-prompt.md, example-custom-error-patterns.md, example-permissions-warning.md, firewall.md, glossary-maintainer.md, grumpy-reviewer.md, hourly-ci-cleaner.md, issue-triage-agent.md, layout-spec-maintainer.md, mergefest.md, metrics-collector.md, notion-issue-summary.md, pdf-summary.md, plan.md, poem-bot.md, pr-nitpick-reviewer.md, python-data-charts.md, q.md, release.md, repo-audit-analyzer.md, repository-quality-improver.md, research.md, security-compliance.md, security-review.md, slide-deck-maintainer.md, stale-repo-identifier.md, super-linter.md, technical-doc-writer.md, tidy.md, video-analyzer.md, weekly-issue-summary.md, workflow-generator.md, workflow-health-manager.md + - **Used in**: 61 workflow(s) - agent-performance-analyzer.md, agent-persona-explorer.md, ai-moderator.md, archie.md, brave.md, breaking-change-checker.md, changeset.md, chroma-issue-indexer.md, ci-coach.md, ci-doctor.md, cli-consistency-checker.md, code-scanning-fixer.md, codex-github-remote-mcp-test.md, copilot-pr-merged-report.md, copilot-pr-nlp-analysis.md, craft.md, daily-copilot-token-report.md, daily-fact.md, daily-file-diet.md, daily-issues-report.md, daily-news.md, daily-observability-report.md, daily-repo-chronicle.md, daily-team-status.md, deep-report.md, dependabot-go-checker.md, dev-hawk.md, dev.md, dictation-prompt.md, example-custom-error-patterns.md, example-permissions-warning.md, firewall.md, glossary-maintainer.md, grumpy-reviewer.md, hourly-ci-cleaner.md, issue-triage-agent.md, layout-spec-maintainer.md, mergefest.md, metrics-collector.md, notion-issue-summary.md, pdf-summary.md, plan.md, poem-bot.md, pr-nitpick-reviewer.md, python-data-charts.md, q.md, release.md, repo-audit-analyzer.md, repository-quality-improver.md, research.md, security-compliance.md, security-review.md, slide-deck-maintainer.md, stale-repo-identifier.md, super-linter.md, technical-doc-writer.md, tidy.md, video-analyzer.md, weekly-issue-summary.md, workflow-generator.md, workflow-health-manager.md - `cache-memory` - **Paths**: `/tmp/gh-aw/cache-memory` - **Used in**: 28 workflow(s) - agent-persona-explorer.md, ci-coach.md, ci-doctor.md, cloclo.md, code-scanning-fixer.md, copilot-pr-nlp-analysis.md, daily-copilot-token-report.md, daily-issues-report.md, daily-news.md, daily-repo-chronicle.md, deep-report.md, github-mcp-structural-analysis.md, glossary-maintainer.md, go-fan.md, grumpy-reviewer.md, pdf-summary.md, poem-bot.md, pr-nitpick-reviewer.md, python-data-charts.md, q.md, scout.md, security-review.md, slide-deck-maintainer.md, stale-repo-identifier.md, super-linter.md, technical-doc-writer.md, test-create-pr-error-handling.md, weekly-issue-summary.md @@ -671,6 +671,25 @@ This section provides an overview of artifacts organized by job name, with dupli - **Download path**: `/tmp/gh-aw/` - **Depends on jobs**: [activation agent detection] +### chroma-issue-indexer.md + +#### Job: `agent` + +**Uploads:** + +- **Artifact**: `agent_outputs` + - **Upload paths**: + - `/tmp/gh-aw/sandbox/agent/logs/` + - `/tmp/gh-aw/redacted-urls.log` + +- **Artifact**: `agent-artifacts` + - **Upload paths**: + - `/tmp/gh-aw/aw-prompts/prompt.txt` + - `/tmp/gh-aw/aw_info.json` + - `/tmp/gh-aw/mcp-logs/` + - `/tmp/gh-aw/sandbox/firewall/logs/` + - `/tmp/gh-aw/agent-stdio.log` + ### ci-coach.md #### Job: `agent`