[refactor] Semantic Function Clustering Analysis - Code Organization Opportunities

This analysis examined **286 non-test Go files** across the repository (82,301 lines of code), cataloging function names, signatures, and organizational patterns to identify refactoring opportunities through semantic clustering and duplicate detection.

**Key Statistics:**
- **Total Go files analyzed**: 286 (161 in pkg/workflow, 89 in pkg/cli, 14 in pkg/parser, 22 utility files)
- **Total lines of production code**: ~82,301 lines
- **Files >1000 lines**: 16 files requiring attention
- **Validation files scattered**: 17 separate validation files in pkg/workflow alone
- **Duplicate patterns identified**: Token handling (4 similar functions), Upload generation (7 similar functions), Package collection (3 duplicates)

**Major Findings:**
- Large files mixing multiple responsibilities (compiler_yaml.go: 1,446 lines with YAML + prompts + uploads)
- Validation logic scattered across 17+ files instead of centralized
- Token handling functions with nearly identical implementations (4 variants)
- Upload artifact generation duplicated 7 times with minimal variations
- Safe outputs system fragmented across 8+ files

<details>
<summary><b>Full Analysis Report</b></summary>

## Executive Summary

This semantic function clustering analysis identified significant refactoring opportunities across three major packages (workflow, cli, parser) to improve code organization, reduce duplication, and enhance maintainability. The analysis focused on identifying functions in wrong files (outliers), duplicate implementations, and opportunities for better modularization.

**Repository Structure:**
- **pkg/workflow**: 161 files, 43,633 lines - Core workflow compilation engine
- **pkg/cli**: 89 files, 29,033 lines - Command-line interface
- **pkg/parser**: 14 files, 5,897 lines - Configuration parsing
- **Utilities**: 22 files, ~3,700 lines - Supporting packages

---

## Package 1: pkg/workflow (161 files, 43,633 lines)

### Large Files Requiring Attention

| File | Lines | Primary Issues |
|------|-------|----------------|
| compiler_yaml.go | 1,446 | Mixed YAML generation, prompt generation, and upload steps |
| compiler_jobs.go | 1,415 | Job building with helper predicates mixed |
| copilot_engine.go | 1,369 | Could extract MCP rendering logic |
| frontmatter_extraction.go | 1,047 | 22 extraction functions - focused but large |
| safe_outputs_config.go | 1,024 | Config parsing + generation + formatting mixed |
| runtime_setup.go | 982 | Detection + generation + deduplication mixed |
| mcp-config.go | 982 | Configuration + validation + parsing mixed |

---

### Issue 1.1: Token Handling Duplication (HIGH PRIORITY)

**Location**: `pkg/workflow/safe_outputs_env_helpers.go`

**Four nearly identical functions for GitHub token precedence:**

```go
// Line 32
func (c *Compiler) addSafeOutputGitHubToken(steps *[]string, data *WorkflowData)

// Line 43
func (c *Compiler) addSafeOutputGitHubTokenForConfig(steps *[]string, data *WorkflowData, configToken string)

// Line 62 
func (c *Compiler) addSafeOutputCopilotGitHubTokenForConfig(steps *[]string, data *WorkflowData, configToken string)

// Line 82
func (c *Compiler) addSafeOutputAgentGitHubTokenForConfig(steps *[]string, data *WorkflowData, configToken string)
```

**Problem**: All four follow nearly identical token precedence logic (check custom token → check copilot token → fall back to default) with only minor variations for copilot vs agent contexts.

**Recommendation**: Consolidate using configuration-based approach:
```go
type TokenContext struct {
    ConfigToken  string
    TokenType    TokenType // Generic, Copilot, Agent
    DefaultToken string
}

func (c *Compiler) addSafeOutputGitHubTokenWithContext(
    steps *[]string, 
    data *WorkflowData, 
    context TokenContext,
)
```

**Impact**: 
- Reduce ~80 lines of duplicate code
- Single source of truth for token precedence logic
- Easier to test and modify precedence rules
- Estimated effort: 2-3 hours

---

### Issue 1.2: Upload Artifact Generation Duplication (HIGH PRIORITY)

**Location**: `pkg/workflow/compiler_yaml.go`

**Seven similar functions generating artifact upload steps (lines 477-665):**

```go
func (c *Compiler) generateUploadAgentLogs(yaml *strings.Builder, logFileFull string)       // Line 477
func (c *Compiler) generateUploadAssets(yaml *strings.Builder)                                // Line 490
func (c *Compiler) generateUploadAwInfo(yaml *strings.Builder)                                // Line 618
func (c *Compiler) generateUploadPrompt(yaml *strings.Builder)                                // Line 631
func (c *Compiler) generateUploadAccessLogs(yaml *strings.Builder, tools map[string]any)      // Line 648
func (c *Compiler) generateUploadMCPLogs(yaml *strings.Builder)                               // Line 652
func (c *Compiler) generateUploadSafeInputsLogs(yaml *strings.Builder)                        // Line 665
```

**Pattern**: All follow identical structure:
```yaml
- name: Upload [X]
  if: [condition]
  uses: actions/upload-artifact@[PIN]
  with:
    name: [artifact-name]
    path: [artifact-path]
    retention-days: [days]
```

**Recommendation**: Create unified helper function:
```go
type UploadArtifactConfig struct {
    Name           string
    Path           string
    Condition      string
    RetentionDays  int
}

func (c *Compiler) generateUploadArtifactStep(
    yaml *strings.Builder, 
    config UploadArtifactConfig,
)
```

**Impact**:
- Reduce ~100-120 lines of duplicate code
- Easier to update upload-artifact action versions
- Consistent upload configuration across all artifacts
- Estimated effort: 1-2 hours

---

### Issue 1.3: Validation Functions Scattered (MEDIUM PRIORITY)

**Current Distribution** - 17 validation files in pkg/workflow:

```
pkg/workflow/
├── agent_validation.go
├── bundler_validation.go
├── docker_validation.go
├── engine_validation.go
├── expression_validation.go
├── github_toolset_validation_error.go
├── mcp_config_validation.go
├── npm_validation.go
├── pip_validation.go
├── repository_features_validation.go
├── runtime_validation.go
├── safe_output_validation_config.go
├── schema_validation.go
├── step_order_validation.go
├── strict_mode_validation.go
├── template_validation.go
└── validation.go
```

**Problem**: Validation logic for different domains is scattered across the main workflow directory, making it hard to understand validation boundaries and maintain consistent validation patterns.

**Recommendation**: Create `pkg/workflow/validation/` subdirectory:

```
pkg/workflow/validation/
├── agent.go         (from agent_validation.go)
├── bundler.go       (from bundler_validation.go)
├── docker.go        (from docker_validation.go)
├── engine.go        (from engine_validation.go)
├── expression.go    (from expression_validation.go)
├── mcp_config.go    (from mcp_config_validation.go)
├── npm.go           (from npm_validation.go)
├── permissions.go   (extracted from permissions.go)
├── pip.go           (from pip_validation.go)
├── repository.go    (from repository_features_validation.go)
├── runtime.go       (from runtime_validation.go)
├── safe_outputs.go  (from safe_output_validation_config.go)
├── schema.go        (from schema_validation.go)
├── step_order.go    (from step_order_validation.go)
├── strict_mode.go   (from strict_mode_validation.go)
├── template.go      (from template_validation.go)
└── validation.go    (from validation.go - core types)
```

**Impact**:
- Clear module boundaries for validation logic
- Easier to locate and maintain validation rules
- Natural import path (`workflow/validation`)
- Estimated effort: 3-4 hours (mostly file moves)

---

### Issue 1.4: compiler_yaml.go Mixed Responsibilities (MEDIUM PRIORITY)

**File**: `pkg/workflow/compiler_yaml.go` (1,446 lines)

**Functions**: 29 functions covering multiple concerns:
- YAML generation: `generateYAML`, `generateMainJobSteps`, `generatePostSteps`
- Prompt generation: `generatePrompt`, `generatePromptStep`, `generateEngineSpecificPromptStep` (5+ functions)
- Upload step generation: 7 `generateUpload*` functions
- Pattern conversion: `convertGoPatternToJavaScript`, `convertErrorPatternsToJavaScript`
- Helper utilities: `splitContentIntoChunks`, `generatePlaceholderSubstitutionStep`

**Problem**: Single file handles YAML orchestration, prompt generation, upload steps, pattern conversion, and utilities - too many distinct concerns.

**Recommendation**: Split into focused files:

```
pkg/workflow/
├── compiler_yaml.go                (~400 lines - main YAML orchestration)
├── compiler_yaml_prompts.go        (~300 lines - all prompt generation)
├── compiler_yaml_uploads.go        (~200 lines - all upload steps)
├── compiler_yaml_patterns.go       (~200 lines - pattern conversion utilities)
└── compiler_yaml_steps.go          (~300 lines - step generation helpers)
```

**Impact**:
- Five focused files (~200-400 lines each) vs one 1,446-line file
- Clearer separation of concerns
- Easier to test individual components
- Estimated effort: 4-5 hours

---

### Issue 1.5: Safe Outputs System Fragmentation (MEDIUM PRIORITY)

**Problem**: Safe outputs logic is scattered across 8+ files in the main workflow directory:

**Current files:**
```
pkg/workflow/
├── safe_outputs.go                     (core types)
├── safe_output_builder.go             (builder pattern)
├── safe_output_validation_config.go   (validation config)
├── safe_outputs_app.go                (app integration)
├── safe_outputs_config.go             (configuration - 1,024 lines!)
├── safe_outputs_env_helpers.go        (environment variable helpers)
├── safe_outputs_jobs.go               (job generation)
├── safe_outputs_steps.go              (step generation)
└── safe_inputs.go                     (related safe inputs system)
```

**Recommendation**: Create `pkg/workflow/safeoutputs/` subdirectory:

```
pkg/workflow/safeoutputs/
├── outputs.go           (from safe_outputs.go - core types)
├── builder.go           (from safe_output_builder.go)
├── config.go            (from safe_outputs_config.go)
├── validation.go        (from safe_output_validation_config.go)
├── app.go               (from safe_outputs_app.go)
├── env_helpers.go       (from safe_outputs_env_helpers.go)
├── jobs.go              (from safe_outputs_jobs.go)
├── steps.go             (from safe_outputs_steps.go)
└── inputs.go            (from safe_inputs.go - separate or related?)
```

**Impact**:
- Modularizes safe outputs system with clear boundary
- Natural import path (`workflow/safeoutputs`)
- Easier to understand safe outputs architecture
- Estimated effort: 3-4 hours (mostly file moves + import updates)

---

### Issue 1.6: Package Collection Pattern Duplication (LOW PRIORITY)

**Location**: `pkg/workflow/dependabot.go`

**Three functions with identical structure for different package managers:**

```go
func (c *Compiler) collectNpmDependencies(...) ([]npmPackage, error)    // Line ~100
func (c *Compiler) collectPipDependencies(...) ([]pipPackage, error)    // Line ~250
func (c *Compiler) collectGoDependencies(...) ([]goPackage, error)      // Line ~400
```

**Pattern**: All follow identical logic:
1. Iterate through actions in lock file
2. Extract package references from action specifications
3. Parse package strings
4. Deduplicate package list
5. Return typed package list

**Recommendation**: Use generics or interface-based approach:

```go
type PackageCollector interface {
    ParsePackage(spec string) (Package, error)
    PackageType() string
}

func (c *Compiler) collectPackages[T Package](
    collector PackageCollector,
    actions []Action,
) ([]T, error)
```

**Impact**:
- Reduce ~150 lines of duplicate logic
- Easier to add new package managers
- Consistent package collection behavior
- Estimated effort: 3-4 hours

---

### Issue 1.7: Validation Files Should Use Subdirectory (BEST PRACTICE)

**Current State**: 17 `*_validation.go` files mixed with other workflow files in pkg/workflow/

**Best Practice Example**: The `expression_*` files show good organization:
- `expression_parser.go` - Parsing
- `expression_builder.go` - Building
- `expression_extraction.go` - Extraction
- `expression_nodes.go` - AST nodes
- `expression_validation.go` - Validation

**Recommendation**: Apply same pattern to validation - move to subdirectory (already covered in Issue 1.3)

---

## Package 2: pkg/cli (89 files, 29,033 lines)

### Large Command Files

| File | Lines | Primary Issues |
|------|-------|----------------|
| compile_command.go | 1,474 | Compilation + watching + security tools + PR creation |
| logs.go | 1,338 | Download + parsing + analysis + rendering mixed |
| update_command.go | 1,331 | Extension updates + workflow updates + PR creation |
| mcp_inspect.go | 948 | MCP inspection + display logic |
| trial_command.go | 944 | Trial execution + git operations + result collection |
| add_command.go | 904 | Adding workflows + compilation mixed |

---

### Issue 2.1: logs.go Mixed Responsibilities (HIGH PRIORITY)

**File**: `pkg/cli/logs.go` (1,338 lines)

**Multiple Responsibilities**:
- Command creation and flag parsing
- Job status fetching from GitHub API
- Log downloading and aggregation (concurrent downloads)
- Log parsing and analysis (engine-specific)
- Error detection and reporting
- MCP tool usage analysis
- Firewall log analysis
- Metrics calculation
- Output formatting (JSON + console)
- Cache management

**Current Support Files**: logs_parsing.go, logs_metrics.go, logs_report.go, logs_download.go, logs_cache.go, logs_models.go

**Recommendation**: The support files already exist but logs.go still mixes too many concerns. Further split logs.go:

```
pkg/cli/
├── logs_command.go          (~300 lines - command setup and orchestration)
├── logs_download.go         (exists - artifact downloading)
├── logs_parsing.go          (exists - log parsing)
├── logs_analysis.go         (~400 lines - NEW - extract analysis logic from logs.go)
├── logs_metrics.go          (exists - metrics calculation)
├── logs_report.go           (exists - report generation)
├── logs_cache.go            (exists - caching)
└── logs_models.go           (exists - data models)
```

**Impact**:
- Complete separation of concerns for logs feature
- logs_command.go becomes thin orchestrator
- Each file has single, focused responsibility
- Estimated effort: 4-6 hours

---

### Issue 2.2: compile_command.go Multiple Tools Integration (MEDIUM PRIORITY)

**File**: `pkg/cli/compile_command.go` (1,474 lines)

**Functions**: Handles compilation plus integration with:
- File watching and recompilation
- Security scanning (zizmor, poutine, actionlint)
- Action SHA validation
- YAML validation
- JSON schema generation
- PR creation
- Dependabot configuration

**Recommendation**: Extract security tool integrations:

```
pkg/cli/
├── compile_command.go           (~800 lines - core compilation)
├── compile_watch.go             (~200 lines - file watching)
├── compile_security.go          (~300 lines - zizmor, poutine, actionlint)
└── compile_validation.go        (~200 lines - YAML and SHA validation)
```

**Impact**:
- Clearer responsibility boundaries
- Security tools can be tested independently
- Easier to add new security integrations
- Estimated effort: 5-6 hours

---

### Issue 2.3: Shared Flag Parsing Pattern (LOW PRIORITY - INFORMATIONAL)

**Pattern**: Flag parsing repeated 69 times across 12 command files:

```go
repoSpec, _ := cmd.Flags().GetString("repo")
format, _ := cmd.Flags().GetString("format")
verbose, _ := cmd.Flags().GetBool("verbose")
```

**Observation**: This is acceptable for Cobra-based CLIs and doesn't require refactoring. Each command has unique flags and the pattern is clear and consistent.

**No Action Recommended** - This is idiomatic Cobra usage.

---

### Issue 2.4: MCP Commands Well-Organized (GOOD EXAMPLE ✓)

**MCP command files demonstrate excellent organization:**

```
pkg/cli/
├── mcp.go                    (main command)
├── mcp_add.go               (add subcommand)
├── mcp_inspect.go           (inspect subcommand)
├── mcp_inspect_mcp.go       (inspect MCP-specific logic)
├── mcp_list.go              (list subcommand)
├── mcp_list_tools.go        (list tools helper)
├── mcp_server.go            (server management)
├── mcp_gateway.go           (gateway configuration)
├── mcp_registry.go          (registry operations)
├── mcp_logs_guardrail.go    (log analysis)
└── mcp_validation.go        (validation)
```

**Best Practice**: Clear subcommand structure with focused helper files. Use this pattern for other complex commands!

**No Action Needed** - This is exemplary organization.

---

## Package 3: pkg/parser (14 files, 5,897 lines)

### Large Files

| File | Lines | Issues |
|------|-------|--------|
| frontmatter.go | 1,283 | Mixed imports, includes, extraction, merging |
| schema.go | 1,156 | Mixed validation, suggestions, compilation |
| mcp.go | 713 | MCP parsing + validation combined |

---

### Issue 3.1: frontmatter.go Multiple Concerns (HIGH PRIORITY)

**File**: `pkg/parser/frontmatter.go` (1,283 lines)

**Functions**: 30+ functions covering:
- Import directive parsing: `ParseImportDirective()` (~34 lines)
- Import processing: `ProcessImportsFromFrontmatter()` + 3 variants (~500 lines)
- Include expansion: `ExpandIncludes()` + 3 variants (~100 lines)
- Include processing: `ProcessIncludes()` + variants (~150 lines)
- Field extraction: 12 `extract*FromContent()` functions (~200 lines)
- Content merging: `MergeTools()`, include processing (~150 lines)

**Recommendation**: Split by functional domain:

```
pkg/parser/
├── frontmatter.go                (~100 lines - core ParseImportDirective + types)
├── frontmatter_imports.go        (~350 lines - all ProcessImports* functions)
├── frontmatter_includes.go       (~250 lines - ExpandIncludes* and ProcessIncludes*)
├── frontmatter_extract.go        (~200 lines - all extract*FromContent functions)
└── frontmatter_merge.go          (~150 lines - MergeTools and merging logic)
```

**Impact**:
- Five focused files (~100-350 lines each) vs one 1,283-line file
- Clear separation: imports vs includes vs extraction vs merging
- Easier to test each domain independently
- Estimated effort: 6-8 hours

---

### Issue 3.2: schema.go Multiple Concerns (HIGH PRIORITY)

**File**: `pkg/parser/schema.go` (1,156 lines)

**Functions**: 30+ functions covering:
- Schema compilation/caching: `getCompiledMainWorkflowSchema()` etc. (~40 lines)
- Validation orchestration: 8 `Validate*` functions (~400 lines)
- Custom rule validation: `validateCommandTriggerConflicts()`, `validateEngineSpecificRules()` (~100 lines)
- Schema suggestion generation: `generateSchemaBasedSuggestions()`, navigation, examples (~200 lines)
- Deprecated field handling: `GetMainWorkflowDeprecatedFields()`, `FindDeprecatedFieldsInFrontmatter()` (~100 lines)
- Utility functions: `LevenshteinDistance()`, `removeDuplicates()`, `min()` (~100 lines)

**Recommendation**: Split by functional domain:

```
pkg/parser/
├── schema.go                     (~150 lines - public validation API + types)
├── schema_cache.go               (~100 lines - schema compilation and caching)
├── schema_validate.go            (~400 lines - validation orchestration + custom rules)
├── schema_suggestions.go         (~250 lines - error suggestions and schema navigation)
├── schema_deprecated.go          (~100 lines - deprecated field handling)
└── schema_utils.go               (~100 lines - utilities OR move to pkg/util/)
```

**Alternative for utilities**: Extract to `pkg/util/`:
- `LevenshteinDistance()` → `pkg/util/strings.go`
- `removeDuplicates()` → `pkg/util/slices.go`
- `min()`, `max()` → `pkg/util/math.go`

**Impact**:
- Five focused files (~100-400 lines each) vs one 1,156-line file
- Reusable utilities available across packages
- Clearer separation: validation vs suggestions vs deprecated fields
- Estimated effort: 6-8 hours

---

### Issue 3.3: Extract Generic Utilities to pkg/util (LOW PRIORITY)

**Current Location**: `pkg/parser/schema.go`

**Generic utilities that should be reusable:**
```go
func LevenshteinDistance(a, b string) int          // String algorithm
func removeDuplicates(slice []string) []string     // Slice utility
func min(a, b int) int                             // Math utility
```

Also found in other packages:
- `pkg/parser/ansi_strip.go`: `StripANSI()` - could be in `pkg/util/strings.go`

**Recommendation**: Create `pkg/util/` package:

```
pkg/util/
├── strings.go      (LevenshteinDistance, StripANSI)
├── slices.go       (removeDuplicates, generic slice helpers)
└── math.go         (min, max helpers)
```

**Impact**:
- Reusable utilities across all packages
- Consistent utility implementations
- Clear location for shared helper functions
- Estimated effort: 2-3 hours

---

## Semantic Function Clustering Analysis

### Function Naming Patterns Across Packages

| Pattern | pkg/workflow | pkg/cli | pkg/parser | Purpose |
|---------|--------------|---------|------------|---------|
| `build*` | 33 | 2 | 0 | Construct structures, AST nodes |
| `generate*` | 48 | 8 | 3 | Create/produce output structures |
| `parse*` | 12 | 6 | 5 | Interpret and structure input |
| `extract*` | 26 | 3 | 12 | Retrieve data from structures |
| `validate*` | 15 | 8 | 8 | Verify correctness |
| `render*` | 15 | 10 | 0 | Transform to output format |
| `convert*` | 10 | 2 | 0 | Transform between formats |
| `collect*` | 12 | 2 | 0 | Gather items from sources |
| `New*` | 45 | 12 | 4 | Constructors |

**Observations:**
- Consistent naming conventions across packages
- Clear verb-noun structure for function names
- Domain-specific verb preferences (workflow: generate/build, cli: render, parser: parse/extract)

---

### Validation Functions - Scattered Pattern

**Total validation-related files: 25+** across repository

**pkg/workflow**: 17 validation files (Issue 1.3)
**pkg/cli**: validation functions embedded in command files
**pkg/parser**: validation in schema.go (Issue 3.2)

**Pattern**: Validation logic is distributed but could benefit from consolidation in each package.

**Recommendation**: Already covered in package-specific issues above.

---

## Priority Refactoring Recommendations

### Priority 1: High-Impact Quick Wins (1-2 Weeks)

**Estimated Total Effort: 16-22 hours**

1. **✅ Consolidate Token Handling** (pkg/workflow/safe_outputs_env_helpers.go)
   - 4 similar functions → 1 configurable function
   - Lines saved: ~80
   - Effort: 2-3 hours
   - **Impact: HIGH** - Single source of truth for token logic

2. **✅ Consolidate Upload Artifact Generation** (pkg/workflow/compiler_yaml.go)
   - 7 similar functions → 1 configurable helper
   - Lines saved: ~100-120
   - Effort: 1-2 hours
   - **Impact: HIGH** - Easier action version updates

3. **✅ Split frontmatter.go** (pkg/parser/)
   - 1,283 lines → 5 focused files (100-350 lines each)
   - Effort: 6-8 hours
   - **Impact: HIGH** - Clearer import vs include vs merge separation

4. **✅ Split schema.go** (pkg/parser/)
   - 1,156 lines → 5-6 focused files (100-400 lines each)
   - Effort: 6-8 hours
   - **Impact: HIGH** - Clearer validation vs suggestion vs deprecated separation

---

### Priority 2: Structural Improvements (2-4 Weeks)

**Estimated Total Effort: 20-28 hours**

5. **✅ Create pkg/workflow/validation/ subdirectory** (Issue 1.3)
   - Move 17 validation files to subdirectory
   - Effort: 3-4 hours
   - **Impact: MEDIUM** - Clear module boundary for validation

6. **✅ Create pkg/workflow/safeoutputs/ subdirectory** (Issue 1.5)
   - Move 8+ safe outputs files to subdirectory
   - Effort: 3-4 hours
   - **Impact: MEDIUM** - Modularizes safe outputs system

7. **✅ Split compiler_yaml.go** (Issue 1.4)
   - 1,446 lines → 5 focused files (200-400 lines each)
   - Effort: 4-5 hours
   - **Impact: MEDIUM** - Separates YAML vs prompts vs uploads

8. **✅ Split logs.go further** (Issue 2.1)
   - Extract analysis logic to logs_analysis.go
   - Effort: 4-6 hours
   - **Impact: MEDIUM** - Completes logs feature separation

9. **✅ Split compile_command.go** (Issue 2.2)
   - Extract security tools to compile_security.go
   - Effort: 5-6 hours
   - **Impact: MEDIUM** - Clearer security tool integration

---

### Priority 3: Code Quality Improvements (Ongoing)

**Estimated Total Effort: 8-12 hours**

10. **✅ Consolidate Package Collection** (Issue 1.6)
    - 3 duplicate functions → 1 generic approach
    - Lines saved: ~150
    - Effort: 3-4 hours
    - **Impact: LOW** - Easier to add new package managers

11. **✅ Extract Generic Utilities** (Issue 3.3)
    - Create pkg/util/ package
    - Effort: 2-3 hours
    - **Impact: LOW** - Reusable utilities across packages

12. **✅ Document Best Practices**
    - Document MCP command pattern as best practice
    - Document expression_* pattern as best practice
    - Effort: 2-3 hours
    - **Impact: LOW** - Maintains consistency for future development

---

## Summary of Findings

### Total Impact by Category

| Category | Files Affected | Lines to Reduce | Estimated Effort |
|----------|----------------|-----------------|------------------|
| **Duplicate Code** | 5 | ~410 lines | 10-13 hours |
| **File Splitting** | 6 large files | Improve ~7,500 lines organization | 30-38 hours |
| **Modularization** | 25+ files | Better boundaries | 6-8 hours |
| **Utilities** | 3-5 files | Reusable helpers | 2-3 hours |
| **TOTAL** | 35-40 files | ~410 lines removed, 7,500+ reorganized | 48-62 hours |

---

### Good Examples to Maintain (✓)

These areas demonstrate excellent organization and should serve as patterns:

1. **✓ Expression Handling** (pkg/workflow/):
   - expression_parser.go, expression_builder.go, expression_extraction.go, expression_nodes.go, expression_validation.go
   - **Pattern**: Clear feature prefix with responsibility suffix

2. **✓ MCP Commands** (pkg/cli/):
   - mcp.go, mcp_add.go, mcp_inspect.go, mcp_list.go, etc.
   - **Pattern**: Main command with focused subcommand files

3. **✓ Consistent Naming**:
   - Strong verb-noun structure across all packages
   - Clear function purpose from name

4. **✓ Logger Initialization**:
   - Consistent `var {name}Log = logger.New("package:feature")` pattern
   - Used consistently across 60+ files

---

## Implementation Checklist

### Phase 1: Quick Wins (Weeks 1-2)
- [ ] Consolidate token handling functions (Issue 1.1)
- [ ] Consolidate upload artifact generation (Issue 1.2)
- [ ] Split pkg/parser/frontmatter.go (Issue 3.1)
- [ ] Split pkg/parser/schema.go (Issue 3.2)
- [ ] Review and test changes

### Phase 2: Structural (Weeks 3-4)
- [ ] Create pkg/workflow/validation/ subdirectory (Issue 1.3)
- [ ] Create pkg/workflow/safeoutputs/ subdirectory (Issue 1.5)
- [ ] Split pkg/workflow/compiler_yaml.go (Issue 1.4)
- [ ] Split pkg/cli/logs.go further (Issue 2.1)
- [ ] Split pkg/cli/compile_command.go (Issue 2.2)
- [ ] Update imports and test

### Phase 3: Polish (Weeks 5-6)
- [ ] Consolidate package collection pattern (Issue 1.6)
- [ ] Create pkg/util/ and move generic utilities (Issue 3.3)
- [ ] Document best practices
- [ ] Final review and testing

---

## Analysis Metadata

- **Analysis Date**: 2025-12-12
- **Repository**: githubnext/gh-aw
- **Commit**: 6f53345
- **Files Analyzed**: 286 non-test Go files
- **Total Lines Analyzed**: 82,301 lines
- **Detection Methods**:
  - Semantic code exploration (Claude Code Explore agents)
  - Function pattern matching (grep/awk analysis)
  - Manual review of largest files
  - Comparative analysis of similar functions
  - Cross-file pattern recognition

</details>

---

## Conclusion

The codebase demonstrates **strong architectural foundations** with clear naming conventions and good separation of concerns at the package level. The primary opportunities for improvement are:

1. **Reducing duplication** in token handling and artifact upload generation (~230 lines)
2. **Splitting oversized files** to improve cognitive load (16 files >1000 lines)
3. **Modularizing related files** into subdirectories (validation, safeoutputs)
4. **Maintaining excellent patterns** from expression handling and MCP commands

**Recommended Approach**: Start with Priority 1 quick wins (duplication removal and parser splits) to build momentum, then tackle Priority 2 structural improvements incrementally to avoid disrupting ongoing development.




> AI generated by [Semantic Function Refactoring](https://github.com/githubnext/gh-aw/actions/runs/20160412259)

File	Lines	Primary Issues
compiler_yaml.go	1,446	Mixed YAML generation, prompt generation, and upload steps
compiler_jobs.go	1,415	Job building with helper predicates mixed
copilot_engine.go	1,369	Could extract MCP rendering logic
frontmatter_extraction.go	1,047	22 extraction functions - focused but large
safe_outputs_config.go	1,024	Config parsing + generation + formatting mixed
runtime_setup.go	982	Detection + generation + deduplication mixed
mcp-config.go	982	Configuration + validation + parsing mixed

File	Lines	Primary Issues
compile_command.go	1,474	Compilation + watching + security tools + PR creation
logs.go	1,338	Download + parsing + analysis + rendering mixed
update_command.go	1,331	Extension updates + workflow updates + PR creation
mcp_inspect.go	948	MCP inspection + display logic
trial_command.go	944	Trial execution + git operations + result collection
add_command.go	904	Adding workflows + compilation mixed

File	Lines	Issues
frontmatter.go	1,283	Mixed imports, includes, extraction, merging
schema.go	1,156	Mixed validation, suggestions, compilation
mcp.go	713	MCP parsing + validation combined

Pattern	pkg/workflow	pkg/cli	pkg/parser	Purpose
`build*`	33	2	0	Construct structures, AST nodes
`generate*`	48	8	3	Create/produce output structures
`parse*`	12	6	5	Interpret and structure input
`extract*`	26	3	12	Retrieve data from structures
`validate*`	15	8	8	Verify correctness
`render*`	15	10	0	Transform to output format
`convert*`	10	2	0	Transform between formats
`collect*`	12	2	0	Gather items from sources
`New*`	45	12	4	Constructors

Category	Files Affected	Lines to Reduce	Estimated Effort
Duplicate Code	5	~410 lines	10-13 hours
File Splitting	6 large files	Improve ~7,500 lines organization	30-38 hours
Modularization	25+ files	Better boundaries	6-8 hours
Utilities	3-5 files	Reusable helpers	2-3 hours
TOTAL	35-40 files	~410 lines removed, 7,500+ reorganized	48-62 hours

[refactor] Semantic Function Clustering Analysis - Code Organization Opportunities #6225

Description

Executive Summary

Package 1: pkg/workflow (161 files, 43,633 lines)

Large Files Requiring Attention

Issue 1.1: Token Handling Duplication (HIGH PRIORITY)

Issue 1.2: Upload Artifact Generation Duplication (HIGH PRIORITY)

Issue 1.3: Validation Functions Scattered (MEDIUM PRIORITY)

Issue 1.4: compiler_yaml.go Mixed Responsibilities (MEDIUM PRIORITY)

Issue 1.5: Safe Outputs System Fragmentation (MEDIUM PRIORITY)

Issue 1.6: Package Collection Pattern Duplication (LOW PRIORITY)

Issue 1.7: Validation Files Should Use Subdirectory (BEST PRACTICE)

Package 2: pkg/cli (89 files, 29,033 lines)

Large Command Files

Issue 2.1: logs.go Mixed Responsibilities (HIGH PRIORITY)

Issue 2.2: compile_command.go Multiple Tools Integration (MEDIUM PRIORITY)

Issue 2.3: Shared Flag Parsing Pattern (LOW PRIORITY - INFORMATIONAL)

Issue 2.4: MCP Commands Well-Organized (GOOD EXAMPLE ✓)

Package 3: pkg/parser (14 files, 5,897 lines)

Large Files

Issue 3.1: frontmatter.go Multiple Concerns (HIGH PRIORITY)

Issue 3.2: schema.go Multiple Concerns (HIGH PRIORITY)

Issue 3.3: Extract Generic Utilities to pkg/util (LOW PRIORITY)

Semantic Function Clustering Analysis

Function Naming Patterns Across Packages

Validation Functions - Scattered Pattern

Priority Refactoring Recommendations

Priority 1: High-Impact Quick Wins (1-2 Weeks)

Priority 2: Structural Improvements (2-4 Weeks)

Priority 3: Code Quality Improvements (Ongoing)

Summary of Findings

Total Impact by Category

Good Examples to Maintain (✓)

Implementation Checklist

Phase 1: Quick Wins (Weeks 1-2)

Phase 2: Structural (Weeks 3-4)

Phase 3: Polish (Weeks 5-6)

Analysis Metadata

Conclusion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions