-
Notifications
You must be signed in to change notification settings - Fork 3.3k
feat(kb): added json/yaml parser+chunker, added dedicated csv chunker #1539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Summary
This PR introduces comprehensive structured data processing capabilities to the knowledge base system by adding dedicated parsers and chunkers for JSON, YAML, and CSV file formats. The changes address a critical issue where CSV files were being chunked row-by-row, causing database batch insert failures due to excessive chunk creation.The implementation includes three new chunker types: JsonYamlChunker for hierarchical data structures, StructuredDataChunker for tabular data (CSV/Excel), and an enhanced TextChunker with improved markdown-aware splitting. These chunkers intelligently group related data together - for example, batching CSV rows into appropriately-sized chunks while preserving headers for context, and keeping JSON objects semantically intact when possible.
The file parsing layer has been extended with new json-parser.ts and yaml-parser.ts modules, while the existing csv-parser.ts has been completely rewritten to use streaming with the csv-parse library for better memory efficiency. The validation system now accepts JSON/YAML/YML file types, and both upload modals have been updated to reflect the newly supported formats.
To handle large documents more reliably, the system implements batch processing at multiple levels: embedding generation (50 per batch), database insertion (500 records per batch), and includes comprehensive timeout and size limits. A new process-docs.ts script consolidates documentation processing functionality, replacing the previous separate scripts.
The changes integrate seamlessly with the existing codebase architecture, using the established file parser interface patterns and maintaining backward compatibility with existing chunk structures while adding enhanced metadata tracking for different content types.
Important Files Changed
Changed Files
| Filename | Score | Overview |
|---|---|---|
| apps/sim/package.json | 5/5 | Added csv-parse dependency for enhanced CSV processing |
| apps/sim/lib/chunkers/index.ts | 5/5 | Created centralized export point for all chunker modules |
| apps/sim/lib/chunkers/structured-data-chunker.ts | 4/5 | New intelligent chunker for CSV/Excel data that groups rows semantically |
| apps/sim/app/workspace/[workspaceId]/knowledge/components/create-modal/create-modal.tsx | 4/5 | Updated UI to support JSON/YAML/YML file uploads with some text inconsistencies |
| apps/sim/lib/chunkers/docs-chunker.ts | 4/5 | Refactored and simplified docs chunker with hardcoded production URL |
| apps/sim/lib/uploads/validation.ts | 5/5 | Added JSON/YAML file type validation support |
| apps/sim/lib/chunkers/text-chunker.ts | 4/5 | Refactored to class-based chunker with hierarchical splitting and overlap handling |
| apps/sim/lib/embeddings/utils.ts | 4/5 | Reduced batch size and added delays to prevent API rate limiting |
| apps/sim/lib/file-parsers/xlsx-parser.ts | 4/5 | Major rewrite with memory optimization and chunked processing for large files |
| apps/sim/lib/file-parsers/index.ts | 2/5 | Added JSON/YAML parsers but with architectural inconsistencies and error handling issues |
| apps/sim/lib/chunkers/json-yaml-chunker.ts | 3/5 | New chunker for JSON data but lacks actual YAML parsing capability |
| apps/sim/lib/file-parsers/json-parser.ts | 3/5 | New JSON parser with potential runtime errors in depth calculation |
| apps/sim/app/workspace/[workspaceId]/knowledge/[id]/components/upload-modal/upload-modal.tsx | 5/5 | Clean update to support new file formats in upload modal |
| apps/sim/lib/chunkers/types.ts | 5/5 | New type definitions supporting structured data chunking |
| apps/sim/lib/file-parsers/csv-parser.ts | 3/5 | Complete rewrite with streaming but potential memory leak and timing issues |
| apps/sim/scripts/process-docs.ts | 4/5 | New comprehensive documentation processing script with batch handling |
| apps/sim/lib/knowledge/documents/service.ts | 4/5 | Enhanced with large document handling and batch processing limits |
| apps/sim/lib/file-parsers/yaml-parser.ts | 3/5 | New YAML parser with unsafe loading and potential runtime errors |
| apps/sim/lib/knowledge/documents/document-processor.ts | 4/5 | Intelligent chunker selection based on content type with enhanced metadata |
| .github/workflows/docs-embeddings.yml | 4/5 | Updated to use new consolidated documentation processing script |
| apps/sim/scripts/process-docs-embeddings.ts | 3/5 | Deleted 215-line script as part of consolidation effort |
| apps/sim/scripts/chunk-docs.ts | 4/5 | Removed utility script for docs chunking as functionality moved to main pipeline |
| apps/sim/lib/env.ts | 2/5 | Removed PostgreSQL SSL configuration variables potentially affecting security |
Confidence score: 3/5
- This PR introduces significant functionality but has several implementation issues that could cause runtime problems
- Score reflects well-intentioned architectural improvements undermined by technical debt in key parsing components and potential security concerns
- Pay close attention to json-parser.ts, yaml-parser.ts, csv-parser.ts, and file-parsers/index.ts for runtime errors and architectural inconsistencies
Sequence Diagram
sequenceDiagram
participant User
participant UploadModal
participant DocumentService
participant FileParser
participant Chunker
participant DocumentProcessor
participant Database
User->>UploadModal: "Upload files (JSON, YAML, CSV)"
UploadModal->>UploadModal: "Validate file types and size"
UploadModal->>DocumentService: "createDocumentRecords(documents)"
DocumentService->>Database: "Insert document records"
Database-->>DocumentService: "Document IDs created"
DocumentService->>DocumentProcessor: "processDocumentsWithQueue()"
loop For each document
DocumentProcessor->>FileParser: "parseFile() or parseBuffer()"
alt JSON/YAML file
FileParser->>FileParser: "Parse and format JSON/YAML"
FileParser-->>DocumentProcessor: "Structured content"
DocumentProcessor->>Chunker: "JsonYamlChunker.chunkJsonYaml()"
Chunker->>Chunker: "Chunk by object/array structure"
else CSV/Structured data
FileParser->>FileParser: "Parse CSV with streaming"
FileParser-->>DocumentProcessor: "Tabular content"
DocumentProcessor->>Chunker: "StructuredDataChunker.chunkStructuredData()"
Chunker->>Chunker: "Group rows intelligently"
else Other file types
DocumentProcessor->>Chunker: "TextChunker.chunk()"
Chunker->>Chunker: "Hierarchical text splitting"
end
Chunker-->>DocumentProcessor: "Generated chunks"
DocumentProcessor->>DocumentProcessor: "Generate embeddings for chunks"
DocumentProcessor->>Database: "Insert embeddings in batches"
DocumentProcessor->>Database: "Update document status to completed"
end
DocumentProcessor-->>DocumentService: "Processing complete"
DocumentService-->>UploadModal: "Upload successful"
UploadModal-->>User: "Files processed successfully"
Additional Comments (2)
-
apps/sim/app/workspace/[workspaceId]/knowledge/components/create-modal/create-modal.tsx, line 545 (link)syntax: Inconsistent file type listing - missing JSON, YAML, YML compared to the primary drop zone description and error message
-
apps/sim/lib/chunkers/text-chunker.ts, line 178 (link)style: Complex metadata calculation with overlap adjustments could benefit from extraction into a separate method for better maintainability.
Context Used: Context from
dashboard- If a switch statement is large and handles multiple cases, extract each case into separate functions... (source)
23 files reviewed, 13 comments
Summary
Type of Change
Testing
Tested manually.
Checklist