Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorize files in "source", "test", "configuration" etc and use that for sorting #20

Open
mohsen1 opened this issue Jan 19, 2025 · 0 comments
Labels
enhancement New feature or request

Comments

@mohsen1
Copy link
Collaborator

mohsen1 commented Jan 19, 2025

Enhance yek by detecting and labeling files according to categories such as:

  • Source: main application or library code
  • Test: files under typical test directories (e.g., tests/, __tests__/, spec/)
  • Configuration: files like .toml, .yaml, .yml, .json, .config, docker-compose.yml, etc.
  • Documentation: .md, .rst, or docs folder
  • Others: fallback category when none of the above applies

Once categorized, yek can factor each category into its sorting logic (e.g., always place source files last if they’re typically highest priority, or place test files earlier).

Motivation

  1. Improved Organization
    Many projects separate source code from tests and configuration files. Automatically recognizing these categories helps produce a more intuitive ordering or chunk assignment.

  2. Better Defaults
    Users who rely on yek for LLM consumption often want the “main code” (source) to appear last in the final chunk. By default, test files or config files might appear interleaved. Category-aware sorting will better reflect typical developer workflows.

  3. Finer-Grained Control
    Later, we can allow custom weighting per category in yek.toml or via CLI flags. For instance, a user might decide tests are less important or more important than config files.

Proposed Approach

  1. Heuristic/Path-Based Detection

    • If the file path contains test, spec, or is located under directories like tests/, __tests__/, or spec/, classify as test.
    • If the file extension is typical for config (.toml, .yaml, .yml, .json, .ini, etc.), or if the name suggests config (docker-compose.yml, Makefile, etc.), classify as configuration.
    • Otherwise, assume it’s source or documentation based on extension or path (e.g., if it’s in docs/, classify as documentation).
    • Fall back on other if none of the above matches.
  2. Category Priority
    After each file is categorized, attach a category-based priority offset:

    • Configuration: e.g., +5
    • Tests: e.g., +10
    • Documentation: e.g., +15
    • Source: e.g., +20
    • Others: e.g., +1
    • The default numeric values can be adjusted to fit typical usage.
  3. Integration

    • Merge with existing priority rules. If a user sets manual rules in yek.toml, those can override or combine with the category-based logic.
    • Possibly add a [category_weights] section in yek.toml to let users redefine the default offsets.
  4. Output

    • The final chunk ordering still respects the sum of user-defined priority + category offset + Git-based recentness.
    • We can optionally display each file’s category in debug logs:
      DEBUG: Categorized src/main.rs as “source” (priority offset: +20)
      

Example

If a project has:

  • src/main.rs → “source”
  • tests/test_foo.rs → “test”
  • yek.toml → “configuration”
  • docker-compose.yml → “configuration”
  • docs/intro.md → “documentation”

Then, we attach category offsets. Suppose user priority rules in yek.toml are minimal. We compute final priorities from category plus user priority plus optional Git recency. The chunking logic then ensures all config files appear before the test files, which appear before doc files, which appear before main source files, etc., or whichever scheme we adopt.

Open Questions

  1. Exact Heuristics

    • Where do we draw the line between “source” and “configuration”? Some .js or .json files might be config or source depending on context.
    • Should we allow a fallback or an override in yek.toml?
  2. Optional vs Default

    • Should category-based sorting be enabled by default or require a flag like --categorize?
  3. Custom Category Definition

    • Should advanced users be able to define their own category patterns with custom offsets?

Next Steps

  • Implement a classification function that inspects paths/extensions.
  • Integrate the resulting category offset into the existing priority computation.
  • Add or update tests ensuring we correctly label test/config/source files.
  • Decide whether to enable by default, or guard behind a config/CLI switch.

Feel free to comment with additional suggestions or open a PR implementing this feature.

@mohsen1 mohsen1 added documentation Improvements or additions to documentation enhancement New feature or request and removed documentation Improvements or additions to documentation labels Jan 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant