You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Enhance yek by detecting and labeling files according to categories such as:
Source: main application or library code
Test: files under typical test directories (e.g., tests/, __tests__/, spec/)
Configuration: files like .toml, .yaml, .yml, .json, .config, docker-compose.yml, etc.
Documentation: .md, .rst, or docs folder
Others: fallback category when none of the above applies
Once categorized, yek can factor each category into its sorting logic (e.g., always place source files last if they’re typically highest priority, or place test files earlier).
Motivation
Improved Organization
Many projects separate source code from tests and configuration files. Automatically recognizing these categories helps produce a more intuitive ordering or chunk assignment.
Better Defaults
Users who rely on yek for LLM consumption often want the “main code” (source) to appear last in the final chunk. By default, test files or config files might appear interleaved. Category-aware sorting will better reflect typical developer workflows.
Finer-Grained Control
Later, we can allow custom weighting per category in yek.toml or via CLI flags. For instance, a user might decide tests are less important or more important than config files.
Proposed Approach
Heuristic/Path-Based Detection
If the file path contains test, spec, or is located under directories like tests/, __tests__/, or spec/, classify as test.
If the file extension is typical for config (.toml, .yaml, .yml, .json, .ini, etc.), or if the name suggests config (docker-compose.yml, Makefile, etc.), classify as configuration.
Otherwise, assume it’s source or documentation based on extension or path (e.g., if it’s in docs/, classify as documentation).
Fall back on other if none of the above matches.
Category Priority
After each file is categorized, attach a category-based priority offset:
Configuration: e.g., +5
Tests: e.g., +10
Documentation: e.g., +15
Source: e.g., +20
Others: e.g., +1
The default numeric values can be adjusted to fit typical usage.
Integration
Merge with existing priority rules. If a user sets manual rules in yek.toml, those can override or combine with the category-based logic.
Possibly add a [category_weights] section in yek.toml to let users redefine the default offsets.
Output
The final chunk ordering still respects the sum of user-defined priority + category offset + Git-based recentness.
We can optionally display each file’s category in debug logs:
DEBUG: Categorized src/main.rs as “source” (priority offset: +20)
Example
If a project has:
src/main.rs → “source”
tests/test_foo.rs → “test”
yek.toml → “configuration”
docker-compose.yml → “configuration”
docs/intro.md → “documentation”
Then, we attach category offsets. Suppose user priority rules in yek.toml are minimal. We compute final priorities from category plus user priority plus optional Git recency. The chunking logic then ensures all config files appear before the test files, which appear before doc files, which appear before main source files, etc., or whichever scheme we adopt.
Open Questions
Exact Heuristics
Where do we draw the line between “source” and “configuration”? Some .js or .json files might be config or source depending on context.
Should we allow a fallback or an override in yek.toml?
Optional vs Default
Should category-based sorting be enabled by default or require a flag like --categorize?
Custom Category Definition
Should advanced users be able to define their own category patterns with custom offsets?
Next Steps
Implement a classification function that inspects paths/extensions.
Integrate the resulting category offset into the existing priority computation.
Add or update tests ensuring we correctly label test/config/source files.
Decide whether to enable by default, or guard behind a config/CLI switch.
Feel free to comment with additional suggestions or open a PR implementing this feature.
The text was updated successfully, but these errors were encountered:
Enhance
yek
by detecting and labeling files according to categories such as:tests/
,__tests__/
,spec/
).toml
,.yaml
,.yml
,.json
,.config
,docker-compose.yml
, etc..md
,.rst
, or docs folderOnce categorized,
yek
can factor each category into its sorting logic (e.g., always place source files last if they’re typically highest priority, or place test files earlier).Motivation
Improved Organization
Many projects separate source code from tests and configuration files. Automatically recognizing these categories helps produce a more intuitive ordering or chunk assignment.
Better Defaults
Users who rely on
yek
for LLM consumption often want the “main code” (source) to appear last in the final chunk. By default, test files or config files might appear interleaved. Category-aware sorting will better reflect typical developer workflows.Finer-Grained Control
Later, we can allow custom weighting per category in
yek.toml
or via CLI flags. For instance, a user might decide tests are less important or more important than config files.Proposed Approach
Heuristic/Path-Based Detection
test
,spec
, or is located under directories liketests/
,__tests__/
, orspec/
, classify as test..toml
,.yaml
,.yml
,.json
,.ini
, etc.), or if the name suggests config (docker-compose.yml
,Makefile
, etc.), classify as configuration.docs/
, classify as documentation).Category Priority
After each file is categorized, attach a category-based priority offset:
Integration
yek.toml
, those can override or combine with the category-based logic.[category_weights]
section inyek.toml
to let users redefine the default offsets.Output
Example
If a project has:
src/main.rs
→ “source”tests/test_foo.rs
→ “test”yek.toml
→ “configuration”docker-compose.yml
→ “configuration”docs/intro.md
→ “documentation”Then, we attach category offsets. Suppose user priority rules in
yek.toml
are minimal. We compute final priorities from category plus user priority plus optional Git recency. The chunking logic then ensures all config files appear before the test files, which appear before doc files, which appear before main source files, etc., or whichever scheme we adopt.Open Questions
Exact Heuristics
.js
or.json
files might be config or source depending on context.yek.toml
?Optional vs Default
--categorize
?Custom Category Definition
Next Steps
Feel free to comment with additional suggestions or open a PR implementing this feature.
The text was updated successfully, but these errors were encountered: