Skip to content

[refactor] Semantic Function Clustering Analysis - Code Organization Improvements #6110

@github-actions

Description

@github-actions

Overview

Comprehensive semantic analysis of the Go codebase identified key refactoring opportunities focused on reducing code duplication, improving file organization, and enhancing maintainability. The analysis examined 275 non-test Go files containing 1,734 functions across all packages.

Key Findings:

  • 3 duplicate token handling functions with near-identical logic
  • 30+ files over 500 lines requiring modularization
  • Well-organized validation and engine interface patterns (positive findings)
  • Clear semantic function clustering by naming conventions

The codebase demonstrates strong architectural patterns with clear naming conventions. The identified issues represent targeted opportunities for incremental improvement.

Executive Summary

Repository Statistics:

  • Total Files Analyzed: 275 Go source files (excluding tests)
  • Total Functions: 1,734 functions
  • Largest Package: pkg/workflow (1,032 functions in 161 files)
  • Second Largest: pkg/cli (487 functions in 81 files)
  • Average Functions Per File: 6.3 functions

Function Naming Patterns (Top 10):

  1. Get* - 118 functions (getters, retrievers)
  2. New* - 70 functions (constructors)
  3. Render* - 34 functions (output rendering)
  4. Build* - 31 functions (builders, job construction)
  5. Extract* - 29 functions (data extraction)
  6. Parse* - 27 functions (parsing logic)
  7. Generate* - 24 functions (code/config generation)
  8. Format* - 22 functions (formatting)
  9. Is* - 21 functions (boolean checks)
  10. Validate* - 15 functions (validation)

(details)
(summary)Full Analysis Report(/summary)

Critical Findings

Issue 1: Duplicate Token Handling Functions (Priority 1 - High Impact)

Location: pkg/workflow/safe_outputs.go

Three nearly identical functions for adding GitHub tokens to custom action steps:

Function 1: addCustomActionGitHubToken (lines 79-91)

func (c *Compiler) addCustomActionGitHubToken(steps *[]string, data *WorkflowData, customToken string) {
    token := customToken
    if token == "" && data.SafeOutputs != nil {
        token = data.SafeOutputs.GitHubToken
    }
    if token == "" {
        token = data.GitHubToken
    }
    if token == "" {
        token = "${{ secrets.GITHUB_TOKEN }}"
    }
    *steps = append(*steps, fmt.Sprintf("          token: %s\n", token))
}

Function 2: addCustomActionCopilotGitHubToken (lines 93-102)

func (c *Compiler) addCustomActionCopilotGitHubToken(steps *[]string, data *WorkflowData, customToken string) {
    token := customToken
    if token == "" && data.SafeOutputs != nil {
        token = data.SafeOutputs.GitHubToken
    }
    if token == "" {
        token = "${{ secrets.COPILOT_TOKEN || secrets.GITHUB_TOKEN }}"
    }
    *steps = append(*steps, fmt.Sprintf("          token: %s\n", token))
}

Function 3: addCustomActionAgentGitHubToken (lines 104-110)

func (c *Compiler) addCustomActionAgentGitHubToken(steps *[]string, data *WorkflowData, customToken string) {
    token := customToken
    if token == "" {
        token = "${{ env.GH_AW_AGENT_TOKEN }}"
    }
    *steps = append(*steps, fmt.Sprintf("          token: %s\n", token))
}

Analysis:

  • Similarity: ~85% code overlap
  • Differences: Only the fallback token precedence logic differs
  • Impact: Code duplication, maintenance burden, potential for inconsistency

Recommendation: Consolidate using configuration-based approach

type TokenConfig struct {
    UseAgentToken   bool
    UseCopilotToken bool
}

func (c *Compiler) addCustomActionGitHubToken(steps *[]string, data *WorkflowData, customToken string, config TokenConfig) {
    token := customToken
    
    // Standard fallback logic
    if token == "" && data.SafeOutputs != nil {
        token = data.SafeOutputs.GitHubToken
    }
    
    // Variant-specific fallback
    if token == "" {
        if config.UseAgentToken {
            token = "${{ env.GH_AW_AGENT_TOKEN }}"
        } else if config.UseCopilotToken {
            token = "${{ secrets.COPILOT_TOKEN || secrets.GITHUB_TOKEN }}"
        } else {
            if data.GitHubToken != "" {
                token = data.GitHubToken
            } else {
                token = "${{ secrets.GITHUB_TOKEN }}"
            }
        }
    }
    
    *steps = append(*steps, fmt.Sprintf("          token: %s\n", token))
}

Estimated Effort: 2-3 hours
Benefits: Single source of truth, reduced duplication (~30 lines), easier maintenance


Issue 2: Oversized Files Requiring Modularization (Priority 2 - Medium Impact)

Files Over 1,000 Lines:

File Lines Functions Issue Recommendation
pkg/workflow/safe_outputs.go 1,530 24 Mixed responsibilities: config extraction, job building, env var generation Split into safe_outputs_config.go, safe_outputs_jobs.go, safe_outputs_env.go
pkg/workflow/compiler_yaml.go 1,446 29 YAML generation + prompt generation + upload logic Split into compiler_yaml_core.go, compiler_yaml_prompts.go, compiler_yaml_uploads.go
pkg/workflow/compiler_jobs.go 1,419 14 Very large functions (~100+ lines each) Extract job building helpers
pkg/workflow/copilot_engine.go 1,369 25 Engine implementation with many methods Consider extracting MCP rendering and log parsing
pkg/cli/logs.go 1,339 9 Download + analysis + display mixed Split into logs_download.go (exists?), logs_analysis.go, logs_display.go
pkg/cli/update_command.go 1,331 20 Workflow updates + action updates + PR creation Split into update_workflows.go, update_actions.go, update_pr.go
pkg/parser/frontmatter.go 1,283 33 Import processing + includes + extraction + merging Split into frontmatter_imports.go, frontmatter_includes.go, frontmatter_extract.go
pkg/cli/compile_command.go 1,279 11 Compilation + watching + validation Split into compile_core.go, compile_watch.go, compile_validation.go
pkg/cli/audit_report.go 1,247 21 Data building + rendering + analysis generation Split into audit_report_data.go, audit_report_render.go, audit_report_analysis.go
pkg/parser/schema.go 1,156 34 Validation + suggestions + compilation Split into schema_validate.go, schema_suggest.go, schema_compile.go

Common Pattern: Files over 1,000 lines typically mix 3-4 distinct responsibilities

Recommendation: Apply single responsibility principle - split each large file into focused modules of 300-500 lines each.

Estimated Effort: 20-30 hours total (2-3 hours per file)
Benefits: Improved readability, easier testing, better code navigation


Positive Findings (Excellent Patterns to Maintain)

1. Engine Interface Pattern (pkg/workflow/)

Well-implemented polymorphism:

  • CodingAgentEngine interface with clear contract (pkg/workflow/agentic_engine.go:19)
  • Multiple implementations: ClaudeEngine, CopilotEngine, CodexEngine, CustomEngine
  • Common methods: GetInstallationSteps, GetExecutionSteps, ParseLogMetrics, RenderMCPConfig, GetErrorPatterns

Status: ✅ Excellent design - no changes needed

This is intentional polymorphism where each engine implements the same interface with engine-specific behavior. The "duplicate" function names (RenderMCPConfig, ParseLogMetrics, etc.) across 4 engine files are correct interface implementations.


2. Log Analysis Interface Pattern (pkg/cli/)

Well-implemented interface:

type LogAnalysis interface {
    AddMetrics(other LogAnalysis)
}

Implementations:

  • DomainAnalysis (pkg/cli/access_log.go:41)
  • FirewallAnalysis (pkg/cli/firewall_log.go:128)

Status: ✅ Good design - no changes needed

The AddMetrics duplication is intentional polymorphism for aggregating different log analysis types.


3. Validation File Organization (pkg/workflow/)

Excellent modularity - each validation concern has its own file:

  • agent_validation.go
  • bundler_validation.go
  • docker_validation.go
  • engine_validation.go
  • expression_validation.go
  • mcp_config_validation.go
  • npm_validation.go
  • pip_validation.go
  • repository_features_validation.go
  • runtime_validation.go
  • schema_validation.go
  • step_order_validation.go
  • strict_mode_validation.go
  • template_validation.go
  • permissions_validator.go

Status: ✅ Best practice example - this is exactly how validation should be organized!


Function Clustering Analysis

Build Functions (31 functions)

Pattern: build* - Construct GitHub workflow jobs and steps

Key Files:

  • pkg/workflow/compiler_jobs.go (primary location)
  • pkg/workflow/safe_outputs.go (buildSafeOutputJob, etc.)

Notable Pattern: buildCreate* functions (15 occurrences)

  • buildCreateOutputAddCommentJob
  • buildCreateOutputAgentTaskJob
  • buildCreateOutputCloseDiscussionJob
  • etc.

Analysis: Well-clustered in compiler_jobs.go, clear naming convention

Recommendation: ✅ No action needed - already well-organized


Generate Functions (24 functions)

Pattern: generate* - Generate configuration, YAML, prompts dynamically

Key Files:

  • pkg/workflow/compiler_yaml.go (multiple generate functions)
  • pkg/workflow/safe_outputs.go (generateSafeOutputsConfig, generateFilteredToolsJSON)

Sub-patterns:

  • generateSafe* (9 functions) - Safe output generation
  • generateUpload* (7 functions) - Upload artifact steps

Recommendation: Consider extracting generateUpload* functions to dedicated helper if they follow similar patterns (need deeper analysis)


Render Functions (34 functions)

Pattern: render* - Render output in various formats

Key Files:

  • pkg/workflow/expression_nodes.go (14 Render methods - AST node rendering)
  • pkg/cli/audit_report.go (multiple render* functions)
  • pkg/console/render.go

Analysis:

  • expression_nodes.go: Intentional polymorphism (each AST node implements Render)
  • audit_report.go: Could be extracted to audit_report_render.go for better organization

Recommendation: Extract audit report rendering functions to separate file


Parse Functions (27 functions)

Pattern: parse* - Parse various formats (YAML, logs, URLs, etc.)

Distribution:

  • pkg/parser/ (appropriate location for parsing)
  • pkg/cli/ (command-line parsing)
  • pkg/workflow/ (workflow-specific parsing)

Analysis: Well-distributed by domain

Recommendation: ✅ No action needed - appropriate organization


Extract Functions (29 functions)

Pattern: extract* - Extract data from maps, frontmatter, configurations

Key Files:

  • pkg/workflow/frontmatter_extraction.go (22 extract functions - justified concentration)
  • pkg/workflow/safe_outputs.go (extractSafeOutputsConfig)

Analysis: The high concentration in frontmatter_extraction.go is justified - this file's purpose is extracting data from frontmatter.

Recommendation: ✅ No action needed - this is appropriate organization


Detailed File Analysis

pkg/workflow/safe_outputs.go (1,530 lines, 24 functions)

Responsibilities (mixed):

  1. Custom action step building (buildCustomActionStep - lines 18-76)
  2. Token handling (3 duplicate functions - lines 79-110)
  3. Configuration extraction (extractSafeOutputsConfig - lines 155-356)
  4. Safe output job building (buildSafeOutputJob - lines 624-691)
  5. Environment variable generation (multiple functions - lines 1149-1530)

Recommendation: Split into 3-4 focused files

  • safe_outputs_config.go - Configuration extraction
  • safe_outputs_jobs.go - Job building
  • safe_outputs_env.go - Environment variable generation
  • safe_outputs_tokens.go - Consolidated token handling

Estimated Effort: 4-6 hours
Impact: High - this is the largest file and would benefit most from modularization


pkg/workflow/compiler_yaml.go (1,446 lines, 29 functions)

Function Categories:

  • YAML generation: 8 functions
  • Prompt generation: 10+ functions
  • Upload step generation: 7 generateUpload* functions
  • Validation: 4 functions

Recommendation: Split by category

  • Keep core YAML orchestration in compiler_yaml.go
  • Extract prompts to compiler_yaml_prompts.go
  • Extract uploads to compiler_yaml_uploads.go

Estimated Effort: 3-4 hours
Impact: Medium - improves file navigability


pkg/cli/audit_report.go (1,247 lines, 21 functions)

Function Categories:

  • Data building: 3 functions
  • Rendering: 12 render* functions
  • Analysis generation: 4 generate* functions
  • Utility: 2 functions

Recommendation: Split by category

  • audit_report.go - Core orchestration and data building
  • audit_report_render.go - All render* functions
  • audit_report_analysis.go - All generate* functions

Estimated Effort: 3-4 hours
Impact: Medium-High - clearly separates display from analysis logic


pkg/parser/frontmatter.go (1,283 lines, 33 functions)

Function Categories:

  • Import processing: 6 functions
  • Include processing: 6 functions
  • Extraction: 12 functions
  • Merging: 4 functions
  • Utilities: 5 functions

Recommendation: Split by processing stage

  • frontmatter_imports.go - Import handling
  • frontmatter_includes.go - Include handling
  • frontmatter_extract.go - Extraction functions
  • frontmatter_merge.go - Merging logic
  • frontmatter.go - Core types and utilities

Estimated Effort: 4-5 hours
Impact: Medium - improves parser package organization


Prioritized Recommendations

Priority 1: High-Impact, Low-Effort (Immediate Action)

1.1 Consolidate Token Handling Functions ⭐⭐⭐

File: pkg/workflow/safe_outputs.go (lines 79-110)
Issue: 3 nearly identical functions with ~85% code overlap
Effort: 2-3 hours
Impact: Reduces ~30 lines of duplicate code, single source of truth

Action Items:

  • Create unified addCustomActionGitHubToken with TokenConfig parameter
  • Update 3 call sites to use new unified function
  • Add unit tests for all token precedence scenarios
  • Verify no behavior changes

Priority 2: Structural Improvements (Next Sprint)

2.1 Split pkg/workflow/safe_outputs.go ⭐⭐

Current: 1,530 lines, 24 functions
Target: 4 files (~350-400 lines each)
Effort: 4-6 hours
Impact: Significantly improves file navigability

Action Items:

  • Extract configuration extraction to safe_outputs_config.go
  • Extract job building to safe_outputs_jobs.go
  • Extract environment variables to safe_outputs_env.go
  • Keep core types in safe_outputs.go
  • Update imports across codebase
  • Run full test suite

2.2 Split pkg/cli/audit_report.go ⭐⭐

Current: 1,247 lines, 21 functions
Target: 3 files (~400 lines each)
Effort: 3-4 hours
Impact: Separates rendering from analysis logic

Action Items:

  • Extract rendering functions to audit_report_render.go
  • Extract analysis generation to audit_report_analysis.go
  • Keep data building in audit_report.go
  • Update imports
  • Run tests

2.3 Split pkg/parser/frontmatter.go ⭐

Current: 1,283 lines, 33 functions
Target: 5 files (~250 lines each)
Effort: 4-5 hours
Impact: Clearer parser module organization

Action Items:

  • Split imports, includes, extract, merge into separate files
  • Keep core in frontmatter.go
  • Update imports
  • Run parser tests

Priority 3: Long-term Improvements (Future Sprints)

3.1 Split Additional Large Files

Candidates (in priority order):

  1. pkg/workflow/compiler_yaml.go (1,446 lines) - Split prompts and uploads
  2. pkg/cli/update_command.go (1,331 lines) - Split workflows, actions, PRs
  3. pkg/cli/compile_command.go (1,279 lines) - Split core, watch, validation
  4. pkg/parser/schema.go (1,156 lines) - Split validate, suggest, compile

Total Estimated Effort: 12-16 hours
Impact: Improved maintainability across all major packages


Implementation Guidelines

General Principles

  1. Preserve Behavior: All refactoring must be behavior-preserving
  2. Test Coverage: Run full test suite after each change
  3. Incremental Changes: Split one file at a time, commit after each
  4. Update Documentation: Add file-level comments explaining module boundaries
  5. Consistent Naming: Follow existing naming conventions (e.g., *_config.go, *_render.go)

File Splitting Strategy

When splitting a file:

  1. Read original file to understand all dependencies
  2. Create new files with appropriate names
  3. Move functions maintaining all comments and documentation
  4. Update package-level imports in all files
  5. Update imports in calling code
  6. Run go build to verify compilation
  7. Run make test-unit to verify tests pass
  8. Run make lint to verify code quality
  9. Commit with descriptive message: "refactor: split [file] into [modules]"

Testing Strategy

After each refactoring:

# Verify compilation
go build ./...

# Run unit tests
make test-unit

# Run integration tests (if available)
make test-integration

# Run linting
make lint

# Verify no regressions
make test

Success Criteria

This refactoring initiative will be successful when:

  1. Zero duplicate token handling functions - consolidated into single implementation
  2. No files over 1,000 lines - all large files split into focused modules
  3. Clear module boundaries - each file has single, well-defined responsibility
  4. All tests passing - no regressions introduced
  5. Improved code navigation - developers can find functions more easily
  6. Maintained or improved performance - no performance degradation
  7. Documentation updated - file-level comments explain module purposes

Estimated Total Effort

Priority Tasks Estimated Hours
Priority 1 Token consolidation 2-3 hours
Priority 2 Split 3 largest files 11-15 hours
Priority 3 Split 4 additional files 12-16 hours
Testing \u0026 Documentation Comprehensive testing and docs 5-7 hours
Total All phases 30-41 hours

Recommended Approach:

  • Week 1: Priority 1 (token consolidation)
  • Week 2-3: Priority 2 (split 3 largest files)
  • Week 4-5: Priority 3 (split additional files)
  • Ongoing: Testing and documentation

Analysis Metadata

  • Analysis Date: 2025-12-11
  • Repository: githubnext/gh-aw
  • Commit: b7443c2
  • Total Files Analyzed: 275 non-test Go files
  • Total Functions Cataloged: 1,734 functions
  • Packages Analyzed: pkg/cli (81 files, 487 functions), pkg/workflow (161 files, 1,032 functions), pkg/parser (13 files, 141 functions), utilities (20 files, 74 functions)
  • Duplicate Functions Identified: 3 (token handling)
  • Large Files (\u003e1000 lines): 10 files
  • Files Over 500 Lines: 30 files
  • Detection Method: Grep-based function extraction + semantic clustering + manual code review
  • Analysis Tools: Bash scripts + grep + awk + manual review

Conclusion

The gh-aw codebase demonstrates strong architectural patterns with clear naming conventions, excellent validation organization, and well-designed interfaces. The primary opportunities for improvement are:

  1. Eliminating duplication in token handling functions (~30 lines, 2-3 hours)
  2. Splitting oversized files to improve cognitive load (10 files \u003e1,000 lines, 30-40 hours total)
  3. Maintaining excellent patterns from validation files and engine interfaces

The codebase is fundamentally well-designed. The identified issues are specific, actionable, and represent opportunities for incremental improvement rather than fundamental restructuring.

Next Steps:

  1. Review and prioritize recommendations
  2. Start with Priority 1 (token consolidation) for quick win
  3. Plan Priority 2 file splits for next sprint
  4. Maintain excellent patterns from validation and interface design

(/details)


Quick Reference

Top 3 Immediate Actions:

  1. Consolidate 3 duplicate token functions (pkg/workflow/safe_outputs.go:79-110)

    • Effort: 2-3 hours | Impact: High - reduces duplication, single source of truth
  2. Split safe_outputs.go (1,530 lines → 4 files of ~350-400 lines each)

    • Effort: 4-6 hours | Impact: High - most impactful file split
  3. Split audit_report.go (1,247 lines → 3 files of ~400 lines each)

    • Effort: 3-4 hours | Impact: Medium-High - separates rendering from analysis

Total Quick-Win Effort: 9-13 hours
Total Quick-Win Impact: Eliminates major duplication, modularizes 2 largest files

AI generated by Semantic Function Refactoring

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions