-
Notifications
You must be signed in to change notification settings - Fork 191
Fix: ts upload file does not create index and document store #422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🦋 Changeset detectedLatest commit: bdd7416 The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
WalkthroughA new patch titled "create-llama" has been introduced to ensure that both the index and document store are created when a file is uploaded without an existing index. This change modifies the Changes
Possibly related PRs
Suggested reviewers
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (3)
.changeset/big-turtles-own.md (1)
1-5: Consider enhancing the changeset description with more details.While the current description accurately captures the core change, consider adding more context about:
- The specific scenarios this fixes (e.g., "Fixes an issue where uploading files without a pre-existing index would fail to create necessary storage structures")
- The technical changes made (e.g., "Implements proper storage context initialization and document persistence")
- The impact on users (e.g., "Users can now upload files without manually creating indexes first")
--- "create-llama": patch --- -Ensure that the index and document store are created when uploading a file with no available index. +Ensure that the index and document store are created when uploading a file with no available index. + +Previously, uploading files without a pre-existing index would fail to create necessary storage structures. This patch: +- Implements proper storage context initialization +- Ensures document persistence in the store +- Allows users to upload files without manually creating indexes firsttemplates/components/llamaindex/typescript/documents/pipeline.ts (2)
Line range hint
27-42: Inconsistent persistence behavior between branchesThe persistence behavior differs between existing and new index scenarios:
- Existing index (if block): Uses
persist()without directory- New index (else block): Uses
persist(STORAGE_CACHE_DIR)This inconsistency could lead to documents being stored in different locations.
Apply this fix to maintain consistent persistence behavior:
if (currentIndex) { await currentIndex.insertNodes(nodes); - currentIndex.storageContext.docStore.persist(); + currentIndex.storageContext.docStore.persist(STORAGE_CACHE_DIR); console.log("Added nodes to the vector store."); return documents.map((document) => document.id_); } else {
Line range hint
11-45: Consider adding error handling for storage operationsThe function performs critical storage operations without explicit error handling. Consider:
- Handling storage context initialization failures
- Catching and logging persistence errors
- Implementing cleanup in case of partial failures
This would improve reliability and debugging capabilities.
Would you like me to provide an example implementation with proper error handling?
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
📒 Files selected for processing (2)
.changeset/big-turtles-own.md(1 hunks)templates/components/llamaindex/typescript/documents/pipeline.ts(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
templates/components/llamaindex/typescript/documents/pipeline.ts (1)
Pattern templates/**: For files under the templates folder, do not report 'Missing Dependencies Detected' errors.
🔇 Additional comments (2)
templates/components/llamaindex/typescript/documents/pipeline.ts (2)
6-6: LGTM: Required imports added for storage context functionality
The new imports are correctly added to support the storage context and persistence directory functionality.
Also applies to: 9-9
36-41: LGTM: Proper storage context initialization and index creation
The implementation correctly:
- Initializes storage context with persistence directory
- Uses the storage context when creating new index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
templates/components/vectordbs/typescript/none/index.ts (2)
21-21: Add missing Document type importThe
Documenttype used in the parameter is not imported.Add this import at the top of the file:
-import { SimpleDocumentStore, VectorStoreIndex } from "llamaindex"; +import { Document, SimpleDocumentStore, VectorStoreIndex } from "llamaindex";
21-30: Consider adding error handlingThe function should handle potential errors during index creation and document persistence.
Consider wrapping the implementation in a try-catch block:
export async function createIndex(documents: Document[]) { + if (!documents?.length) { + throw new Error("No documents provided for index creation"); + } + try { const storageContext = await storageContextFromDefaults({ persistDir: `${STORAGE_CACHE_DIR}`, }); const index = await VectorStoreIndex.fromDocuments(documents, { storageContext, }); index.storageContext.docStore.persist(); return index; + } catch (error) { + throw new Error(`Failed to create index: ${error.message}`); + } }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
📒 Files selected for processing (2)
templates/components/llamaindex/typescript/documents/pipeline.ts(1 hunks)templates/components/vectordbs/typescript/none/index.ts(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- templates/components/llamaindex/typescript/documents/pipeline.ts
🧰 Additional context used
📓 Path-based instructions (1)
templates/components/vectordbs/typescript/none/index.ts (1)
Pattern templates/**: For files under the templates folder, do not report 'Missing Dependencies Detected' errors.
🔇 Additional comments (1)
templates/components/vectordbs/typescript/none/index.ts (1)
21-30: Implementation looks good!
The core implementation correctly:
- Creates a storage context with the same configuration as
getDataSource - Creates an index from the provided documents
- Persists the document store
- Returns the created index
This should fix the issue where the TS upload file wasn't creating the index and document store.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (1)
templates/components/vectordbs/typescript/none/index.ts (1)
26-35: Consider performance optimizations for large document setsFor better performance with large document sets, consider these improvements:
- Add batch processing:
export async function createIndex(documents: Document<Metadata>[]) { // ... existing error checking ... const BATCH_SIZE = 100; const storageContext = await storageContextFromDefaults({ persistDir: `${STORAGE_CACHE_DIR}`, }); try { let index: VectorStoreIndex | null = null; // Process documents in batches for (let i = 0; i < documents.length; i += BATCH_SIZE) { const batch = documents.slice(i, i + BATCH_SIZE); if (!index) { index = await VectorStoreIndex.fromDocuments(batch, { storageContext }); } else { await index.insertDocuments(batch); } // Periodic persistence to avoid memory pressure if ((i + 1) % (BATCH_SIZE * 5) === 0) { await Promise.all([ index.storageContext.docStore.persist(), index.storage.persist() ]); } } // Final persistence if (index) { await Promise.all([ index.storageContext.docStore.persist(), index.storage.persist() ]); } return index; } catch (error) { throw new Error(`Failed to create index: ${error.message}`); } }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
📒 Files selected for processing (1)
templates/components/vectordbs/typescript/none/index.ts(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
templates/components/vectordbs/typescript/none/index.ts (1)
Pattern templates/**: For files under the templates folder, do not report 'Missing Dependencies Detected' errors.
🔇 Additional comments (1)
templates/components/vectordbs/typescript/none/index.ts (1)
1-6: LGTM: Import changes are appropriate
The addition of Document and Metadata types from llamaindex is necessary for proper type definitions in the new function.
31b2550 to
ba4d536
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (1)
helpers/env-variables.ts (1)
220-226: Consider adding directory creation handling.To ensure robust operation, consider adding logic to create the cache directory if it doesn't exist. This could prevent potential runtime errors.
Example implementation to add in the relevant file:
import { mkdir } from 'fs/promises'; import { existsSync } from 'fs'; // Add this where the cache directory is used const cacheDir = process.env.STORAGE_CACHE_DIR || '.cache'; if (!existsSync(cacheDir)) { await mkdir(cacheDir, { recursive: true }); }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
📒 Files selected for processing (3)
helpers/env-variables.ts(1 hunks)templates/components/llamaindex/typescript/documents/pipeline.ts(2 hunks)templates/components/vectordbs/typescript/none/shared.ts(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- templates/components/llamaindex/typescript/documents/pipeline.ts
🧰 Additional context used
📓 Path-based instructions (1)
templates/components/vectordbs/typescript/none/shared.ts (1)
Pattern templates/**: For files under the templates folder, do not report 'Missing Dependencies Detected' errors.
🔇 Additional comments (1)
helpers/env-variables.ts (1)
220-226: LGTM! Verify storage directory usage across codebase.
The addition of STORAGE_CACHE_DIR environment variable for the default case is appropriate and aligns with the PR's objective of fixing file upload issues.
Let's verify the consistent usage of this storage path across the codebase:
✅ Verification successful
Verified! Storage directory is used consistently across the codebase.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for consistent usage of STORAGE_CACHE_DIR and .cache directory
# Search for STORAGE_CACHE_DIR references
echo "Checking STORAGE_CACHE_DIR usage:"
rg "STORAGE_CACHE_DIR"
# Search for .cache directory references
echo -e "\nChecking .cache directory references:"
rg "\.cache"
Length of output: 1101
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (3)
templates/components/vectordbs/typescript/none/index.ts (1)
Line range hint
4-24: Consider these improvements for better type safety and error handling.While the changes look good, here are some suggestions to enhance the function:
- Remove the unused
paramsparameter if it's not needed- Add try-catch for storage context initialization
- Use type guards instead of type casting for docStore
Consider this improved implementation:
- export async function getDataSource(params?: any) { + export async function getDataSource() { const persistDir = process.env.STORAGE_CACHE_DIR; if (!persistDir) { throw new Error("STORAGE_CACHE_DIR environment variable is required!"); } - const storageContext = await storageContextFromDefaults({ - persistDir, - }); + try { + const storageContext = await storageContextFromDefaults({ + persistDir, + }); - const numberOfDocs = Object.keys( - (storageContext.docStore as SimpleDocumentStore).toDict(), - ).length; + const docStore = storageContext.docStore; + if (!(docStore instanceof SimpleDocumentStore)) { + throw new Error("Unexpected document store type"); + } + + const numberOfDocs = Object.keys(docStore.toDict()).length; + if (numberOfDocs === 0) { + return null; + } + return await VectorStoreIndex.init({ + storageContext, + }); + } catch (error) { + throw new Error(`Failed to initialize data source: ${error.message}`); + } - if (numberOfDocs === 0) { - return null; - } - return await VectorStoreIndex.init({ - storageContext, - }); }templates/components/vectordbs/typescript/none/generate.ts (2)
22-25: Enhance environment variable validationWhile the basic validation is good, consider adding more robust checks:
- Verify the directory exists
- Ensure write permissions
- Sanitize the path to prevent injection
Consider this enhanced validation:
const persistDir = process.env.STORAGE_CACHE_DIR; if (!persistDir) { throw new Error("STORAGE_CACHE_DIR environment variable is required!"); } + // Add these validation checks + import { existsSync, accessSync, constants } from 'fs'; + import { resolve } from 'path'; + + const absolutePath = resolve(persistDir); + if (!existsSync(absolutePath)) { + throw new Error(`Storage directory ${persistDir} does not exist!`); + } + try { + accessSync(absolutePath, constants.W_OK); + } catch (err) { + throw new Error(`Storage directory ${persistDir} is not writable!`); + }
Line range hint
17-35: Add detailed logging for better observabilityConsider enhancing the logging to include:
- The resolved storage directory path
- Number of documents processed
- Size of the generated index
Example enhancement:
async function generateDatasource() { console.log(`Generating storage context...`); const persistDir = process.env.STORAGE_CACHE_DIR; if (!persistDir) { throw new Error("STORAGE_CACHE_DIR environment variable is required!"); } + console.log(`Using storage directory: ${resolve(persistDir)}`); const ms = await getRuntime(async () => { const storageContext = await storageContextFromDefaults({ persistDir, }); const documents = await getDocuments(); + console.log(`Processing ${documents.length} documents...`); await VectorStoreIndex.fromDocuments(documents, { storageContext, }); }); console.log(`Storage context successfully generated in ${ms / 1000}s.`); }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
📒 Files selected for processing (3)
templates/components/vectordbs/typescript/none/generate.ts(1 hunks)templates/components/vectordbs/typescript/none/index.ts(1 hunks)templates/components/vectordbs/typescript/none/shared.ts(0 hunks)
💤 Files with no reviewable changes (1)
- templates/components/vectordbs/typescript/none/shared.ts
🧰 Additional context used
📓 Path-based instructions (2)
templates/components/vectordbs/typescript/none/generate.ts (1)
Pattern templates/**: For files under the templates folder, do not report 'Missing Dependencies Detected' errors.
templates/components/vectordbs/typescript/none/index.ts (1)
Pattern templates/**: For files under the templates folder, do not report 'Missing Dependencies Detected' errors.
🔇 Additional comments (3)
templates/components/vectordbs/typescript/none/index.ts (2)
5-8: LGTM! Good addition of environment variable validation.
The early validation of STORAGE_CACHE_DIR with a clear error message follows best practices for configuration management and helps prevent undefined behavior.
10-10: LGTM! Proper usage of validated configuration.
The validated persistDir is correctly used in the storage context configuration.
templates/components/vectordbs/typescript/none/generate.ts (1)
28-28: LGTM: Storage context configuration
The storage context configuration correctly uses the validated persistDir.
Summary by CodeRabbit
New Features
Bug Fixes
Documentation