Code Indexer Loop is a Python library designed to index and retrieve code snippets.
It uses the useful indexing utilities of the LlamaIndex library and the multi-language tree-sitter library to parse the code from many popular programming languages. tiktoken is used to right-size retrieval based on number of tokens and LangChain is used to obtain embeddings (defaults to OpenAI's text-embedding-ada-002
) and store them in an embedded ChromaDB vector database. watchdog is used for continuous updating of the index based on file system events.
Read the launch blog post for more details about why we've built this!
Use pip
to install Code Indexer Loop from PyPI.
pip install code-indexer-loop
- Import necessary modules:
from code_indexer_loop.api import CodeIndexer
- Create a CodeIndexer object and have it watch for changes:
indexer = CodeIndexer(src_dir="path/to/code/", watch=True)
- Use
.query
to perform a search query:
query = "pandas"
print(indexer.query(query)[0:30])
Note: make sure the OPENAI_API_KEY
environment variable is set. This is needed for generating the embeddings.
You can also use indexer.query_nodes
to get the nodes of a query or indexer.query_documents
to receive the entire source code files.
Note that if you edit any of the source code files in the src_dir
it will efficiently re-index those files using watchdog
and an md5
based caching mechanism. This results in up-to-date embeddings every time you query the index.
Check out the basic_usage notebook for a quick overview of the API.
You can configure token limits for the chunks through the CodeIndexer constructor:
indexer = CodeIndexer(
src_dir="path/to/code/", watch=True,
target_chunk_tokens = 300,
max_chunk_tokens = 1000,
enforce_max_chunk_tokens = False,
coalesce = 50
token_model = "gpt-4"
)
Note you can choose whether the max_chunk_tokens
is enforced. If it is, it will raise an exception in case there is no semantic parsing that respects the max_chunk_tokens
.
The coalesce
argument controls the limit of combining smaller chunks into single chunks to avoid having many very small chunks. The unit for coalesce
is also tokens.
Using tree-sitter
for parsing, the chunks are broken only at valid node-level string positions in the source file. This avoids breaking up e.g. function and class definitions.
C, C++, C#, Go, Haskell, Java, Julia, JavaScript, PHP, Python, Ruby, Rust, Scala, Swift, SQL, TypeScript
Note, we're mainly testing Python support. Use other languages at your own peril.
Pull requests are welcome. Please make sure to update tests as appropriate. Use tools provided within dev
dependencies to maintain the code standard.
Run the unit tests by invoking pytest
in the root.
Please see the LICENSE file provided with the source code.
We'd like to thank the Sweep AI for publishing their ideas about code chunking. Read their blog posts about the topic here and here. The implementation in code_indexer_loop
is modified from their original implementation mainly to limit based on tokens instead of characters and to achieve perfect document reconstruction ("".join(chunks) == original_source_code
).