Skip to content

Conversation

@waleedlatif1
Copy link
Collaborator

Summary

  • added json/yaml parser+chunker, added dedicated csv chunker
  • previously, the chunker created a separate chunk for each row and this caused the batch insert to the DB to fail

Type of Change

  • New feature

Testing

Tested manually.

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

@vercel
Copy link

vercel bot commented Oct 3, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
sim Ready Ready Preview Comment Oct 4, 2025 10:03pm
1 Skipped Deployment
Project Deployment Preview Comments Updated (UTC)
docs Skipped Skipped Oct 4, 2025 10:03pm

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Summary

This PR introduces comprehensive structured data processing capabilities to the knowledge base system by adding dedicated parsers and chunkers for JSON, YAML, and CSV file formats. The changes address a critical issue where CSV files were being chunked row-by-row, causing database batch insert failures due to excessive chunk creation.

The implementation includes three new chunker types: JsonYamlChunker for hierarchical data structures, StructuredDataChunker for tabular data (CSV/Excel), and an enhanced TextChunker with improved markdown-aware splitting. These chunkers intelligently group related data together - for example, batching CSV rows into appropriately-sized chunks while preserving headers for context, and keeping JSON objects semantically intact when possible.

The file parsing layer has been extended with new json-parser.ts and yaml-parser.ts modules, while the existing csv-parser.ts has been completely rewritten to use streaming with the csv-parse library for better memory efficiency. The validation system now accepts JSON/YAML/YML file types, and both upload modals have been updated to reflect the newly supported formats.

To handle large documents more reliably, the system implements batch processing at multiple levels: embedding generation (50 per batch), database insertion (500 records per batch), and includes comprehensive timeout and size limits. A new process-docs.ts script consolidates documentation processing functionality, replacing the previous separate scripts.

The changes integrate seamlessly with the existing codebase architecture, using the established file parser interface patterns and maintaining backward compatibility with existing chunk structures while adding enhanced metadata tracking for different content types.

Important Files Changed

Changed Files
Filename Score Overview
apps/sim/package.json 5/5 Added csv-parse dependency for enhanced CSV processing
apps/sim/lib/chunkers/index.ts 5/5 Created centralized export point for all chunker modules
apps/sim/lib/chunkers/structured-data-chunker.ts 4/5 New intelligent chunker for CSV/Excel data that groups rows semantically
apps/sim/app/workspace/[workspaceId]/knowledge/components/create-modal/create-modal.tsx 4/5 Updated UI to support JSON/YAML/YML file uploads with some text inconsistencies
apps/sim/lib/chunkers/docs-chunker.ts 4/5 Refactored and simplified docs chunker with hardcoded production URL
apps/sim/lib/uploads/validation.ts 5/5 Added JSON/YAML file type validation support
apps/sim/lib/chunkers/text-chunker.ts 4/5 Refactored to class-based chunker with hierarchical splitting and overlap handling
apps/sim/lib/embeddings/utils.ts 4/5 Reduced batch size and added delays to prevent API rate limiting
apps/sim/lib/file-parsers/xlsx-parser.ts 4/5 Major rewrite with memory optimization and chunked processing for large files
apps/sim/lib/file-parsers/index.ts 2/5 Added JSON/YAML parsers but with architectural inconsistencies and error handling issues
apps/sim/lib/chunkers/json-yaml-chunker.ts 3/5 New chunker for JSON data but lacks actual YAML parsing capability
apps/sim/lib/file-parsers/json-parser.ts 3/5 New JSON parser with potential runtime errors in depth calculation
apps/sim/app/workspace/[workspaceId]/knowledge/[id]/components/upload-modal/upload-modal.tsx 5/5 Clean update to support new file formats in upload modal
apps/sim/lib/chunkers/types.ts 5/5 New type definitions supporting structured data chunking
apps/sim/lib/file-parsers/csv-parser.ts 3/5 Complete rewrite with streaming but potential memory leak and timing issues
apps/sim/scripts/process-docs.ts 4/5 New comprehensive documentation processing script with batch handling
apps/sim/lib/knowledge/documents/service.ts 4/5 Enhanced with large document handling and batch processing limits
apps/sim/lib/file-parsers/yaml-parser.ts 3/5 New YAML parser with unsafe loading and potential runtime errors
apps/sim/lib/knowledge/documents/document-processor.ts 4/5 Intelligent chunker selection based on content type with enhanced metadata
.github/workflows/docs-embeddings.yml 4/5 Updated to use new consolidated documentation processing script
apps/sim/scripts/process-docs-embeddings.ts 3/5 Deleted 215-line script as part of consolidation effort
apps/sim/scripts/chunk-docs.ts 4/5 Removed utility script for docs chunking as functionality moved to main pipeline
apps/sim/lib/env.ts 2/5 Removed PostgreSQL SSL configuration variables potentially affecting security

Confidence score: 3/5

  • This PR introduces significant functionality but has several implementation issues that could cause runtime problems
  • Score reflects well-intentioned architectural improvements undermined by technical debt in key parsing components and potential security concerns
  • Pay close attention to json-parser.ts, yaml-parser.ts, csv-parser.ts, and file-parsers/index.ts for runtime errors and architectural inconsistencies

Sequence Diagram

sequenceDiagram
    participant User
    participant UploadModal
    participant DocumentService
    participant FileParser
    participant Chunker
    participant DocumentProcessor
    participant Database

    User->>UploadModal: "Upload files (JSON, YAML, CSV)"
    UploadModal->>UploadModal: "Validate file types and size"
    UploadModal->>DocumentService: "createDocumentRecords(documents)"
    DocumentService->>Database: "Insert document records"
    Database-->>DocumentService: "Document IDs created"
    DocumentService->>DocumentProcessor: "processDocumentsWithQueue()"
    
    loop For each document
        DocumentProcessor->>FileParser: "parseFile() or parseBuffer()"
        alt JSON/YAML file
            FileParser->>FileParser: "Parse and format JSON/YAML"
            FileParser-->>DocumentProcessor: "Structured content"
            DocumentProcessor->>Chunker: "JsonYamlChunker.chunkJsonYaml()"
            Chunker->>Chunker: "Chunk by object/array structure"
        else CSV/Structured data
            FileParser->>FileParser: "Parse CSV with streaming"
            FileParser-->>DocumentProcessor: "Tabular content"
            DocumentProcessor->>Chunker: "StructuredDataChunker.chunkStructuredData()"
            Chunker->>Chunker: "Group rows intelligently"
        else Other file types
            DocumentProcessor->>Chunker: "TextChunker.chunk()"
            Chunker->>Chunker: "Hierarchical text splitting"
        end
        
        Chunker-->>DocumentProcessor: "Generated chunks"
        DocumentProcessor->>DocumentProcessor: "Generate embeddings for chunks"
        DocumentProcessor->>Database: "Insert embeddings in batches"
        DocumentProcessor->>Database: "Update document status to completed"
    end
    
    DocumentProcessor-->>DocumentService: "Processing complete"
    DocumentService-->>UploadModal: "Upload successful"
    UploadModal-->>User: "Files processed successfully"
Loading

Additional Comments (2)

  1. apps/sim/app/workspace/[workspaceId]/knowledge/components/create-modal/create-modal.tsx, line 545 (link)

    syntax: Inconsistent file type listing - missing JSON, YAML, YML compared to the primary drop zone description and error message

  2. apps/sim/lib/chunkers/text-chunker.ts, line 178 (link)

    style: Complex metadata calculation with overlap adjustments could benefit from extraction into a separate method for better maintainability.

    Context Used: Context from dashboard - If a switch statement is large and handles multiple cases, extract each case into separate functions... (source)

23 files reviewed, 13 comments

Edit Code Review Agent Settings | Greptile

@vercel vercel bot temporarily deployed to Preview – docs October 4, 2025 21:55 Inactive
@vercel vercel bot temporarily deployed to Preview – docs October 4, 2025 21:58 Inactive
@waleedlatif1 waleedlatif1 merged commit 86ed32e into staging Oct 4, 2025
4 of 5 checks passed
@waleedlatif1 waleedlatif1 deleted the fix/kb branch October 4, 2025 21:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants