improvement(kb): optimize processes, add more robust fallbacks for large file ops #2684

waleedlatif1 · 2026-01-05T21:31:36Z

Summary

optimize processes, add more robust fallbacks for large file ops
drop refresh/manual polling in favor of react query refetchInterval when requests are in flight
make batch insertions idempotent
added more fallbacks for doc/docx formats to other libs
increased trigger.dev machine size for knowledge task to handle large docs
batch calls to mistral to avoid 1k page limit
stronger typing
upgraded turborepo

Type of Change

Bug fix

Testing

Tested manually

Checklist

Code follows project style guidelines
Self-reviewed my changes
Tests added/updated and passing
No new warnings introduced
I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

…rge file ops

vercel · 2026-01-05T21:31:41Z

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment

Project	Deployment	Review	Updated (UTC)
docs	Skipped		Jan 6, 2026 4:17am

greptile-apps · 2026-01-05T21:35:12Z

Greptile Summary

This PR optimizes knowledge base document processing with a focus on handling large files more robustly. Key improvements include:

PDF batching for large documents: Added logic to split PDFs over 1,000 pages into chunks for Mistral OCR processing, avoiding API limits and enabling processing of very large PDFs
React Query polling: Replaced manual setInterval polling with React Query's refetchInterval parameter, simplifying frontend state management and reducing unnecessary API calls
Idempotent batch insertions: Moved embedding deletion inside the transaction (line 557 in service.ts), ensuring atomic delete-then-insert operations that prevent partial data states
Parser fallbacks: Added bidirectional fallbacks between officeparser and mammoth for DOC/DOCX files, improving extraction reliability
Type safety: Replaced any types with proper TypeScript interfaces (FileParseMetadata, JsonValue, EmbeddingAPIResponse)
Infrastructure upgrades: Increased trigger.dev machine size to large-1x (2 vCPU, 2GB RAM) for large PDF processing

The changes address previous thread concerns about transaction safety and improve overall robustness for large document operations.

Confidence Score: 4/5

This PR is safe to merge with good improvements to robustness and type safety
Score reflects solid improvements in transaction safety, type safety, and large file handling. The delete-inside-transaction fix addresses a critical race condition. PDF batching logic is well-implemented with proper cleanup. Minor confidence reduction due to the complexity of concurrent PDF chunk processing and potential edge cases with external HTTPS URLs that fail page count checks.
Pay close attention to document-processor.ts for the new PDF batching logic, particularly the S3 cleanup in the finally block and handling of partially failed chunks

Important Files Changed

Filename	Overview
apps/sim/lib/knowledge/documents/document-processor.ts	Added robust PDF batching with 1k page limit for Mistral OCR, improved concurrency control, added page count checks, better fallback handling
apps/sim/lib/knowledge/documents/service.ts	Made batch insertions idempotent by moving delete inside transaction, improved typing for tag operations, simplified embedding generation
apps/sim/app/workspace/[workspaceId]/knowledge/[id]/base.tsx	Replaced manual polling with React Query refetchInterval, removed redundant refresh logic after retry operations
apps/sim/lib/file-parsers/doc-parser.ts	Added fallback from officeparser to mammoth for DOC files, improved extraction methods with better error handling
apps/sim/lib/file-parsers/docx-parser.ts	Added fallback from mammoth to officeparser for DOCX files, improved extraction with plaintext fallback for non-zip files
apps/sim/lib/knowledge/embeddings.ts	Improved typing with proper interfaces, removed verbose logging, simplified embedding result flattening

Sequence Diagram

sequenceDiagram
    participant Client as Frontend (base.tsx)
    participant API as Document Service
    participant Processor as Document Processor
    participant OCR as Mistral OCR API
    participant Parser as File Parser (DOC/DOCX)
    participant S3 as S3 Storage
    participant DB as Database

    Client->>API: processDocumentAsync(documentId)
    activate API
    
    API->>DB: Update status to 'processing'
    API->>Processor: processDocument(fileUrl, mimeType)
    activate Processor
    
    alt PDF with OCR enabled
        Processor->>Processor: getPdfPageCount(buffer)
        
        alt Page count > 1000
            Processor->>Processor: splitPdfIntoChunks(buffer, 1000)
            loop For each chunk batch (MAX_CONCURRENT_CHUNKS)
                Processor->>S3: Upload chunk PDF
                S3-->>Processor: presigned URL
                Processor->>OCR: Process chunk via Mistral OCR
                OCR-->>Processor: Extracted text
                Processor->>S3: Delete chunk PDF
            end
            Processor->>Processor: Combine all chunk results
        else Page count <= 1000
            Processor->>S3: Upload full PDF
            S3-->>Processor: presigned URL
            Processor->>OCR: Process full PDF
            OCR-->>Processor: Extracted text
        end
    else DOC/DOCX file
        alt Primary parser (mammoth/officeparser)
            Processor->>Parser: Parse with primary parser
            Parser-->>Processor: Extracted text
        else Primary fails
            Processor->>Parser: Fallback to secondary parser
            Parser-->>Processor: Extracted text
        end
    end
    
    Processor->>Processor: Chunk content (TextChunker)
    Processor-->>API: {chunks, metadata}
    deactivate Processor
    
    API->>API: generateEmbeddings(chunks) in batches
    
    API->>DB: BEGIN TRANSACTION
    activate DB
    API->>DB: DELETE embeddings for documentId
    loop For each batch of embeddings
        API->>DB: INSERT embedding batch
    end
    API->>DB: UPDATE document status to 'completed'
    DB-->>API: COMMIT TRANSACTION
    deactivate DB
    
    API-->>Client: Success
    deactivate API
    
    Note over Client: React Query refetchInterval (3s)<br/>polls while processing=true
    Client->>API: Fetch documents (auto-refresh)
    API-->>Client: Updated document status

greptile-apps

_{20 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

apps/sim/lib/knowledge/documents/document-processor.ts

waleedlatif1 · 2026-01-05T23:08:35Z

@greptile

greptile-apps

_{21 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

apps/sim/lib/knowledge/documents/service.ts

apps/sim/lib/knowledge/documents/document-processor.ts

waleedlatif1 · 2026-01-06T00:19:39Z

@greptile

greptile-apps

_{21 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

apps/sim/lib/knowledge/documents/service.ts

waleedlatif1 · 2026-01-06T00:28:06Z

@greptile

greptile-apps

_{21 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

apps/sim/lib/knowledge/documents/service.ts

apps/sim/lib/knowledge/documents/document-processor.ts

waleedlatif1 · 2026-01-06T04:17:20Z

@greptile

…rge file ops (#2684) * improvement(kb): optimize processes, add more robust fallbacks for large file ops * stronger typing * comments cleanup * ack PR comments * upgraded turborepo * ack more PR comments * fix failing test * moved doc update inside tx for embeddings chunks upload * ack more PR comments

waleedlatif1 added 2 commits January 5, 2026 13:06

improvement(kb): optimize processes, add more robust fallbacks for la…

523d8a9

…rge file ops

stronger typing

307d7ab

comments cleanup

b54036c

greptile-apps bot reviewed Jan 5, 2026

View reviewed changes

apps/sim/lib/knowledge/documents/document-processor.ts Show resolved Hide resolved

vercel bot temporarily deployed to Preview January 5, 2026 21:35 Inactive

ack PR comments

c997912

vercel bot temporarily deployed to Preview January 5, 2026 23:07 Inactive

upgraded turborepo

fd284b2

vercel bot deployed to Preview January 5, 2026 23:10 View deployment

greptile-apps bot reviewed Jan 5, 2026

View reviewed changes

apps/sim/lib/knowledge/documents/service.ts Outdated Show resolved Hide resolved

apps/sim/lib/knowledge/documents/document-processor.ts Show resolved Hide resolved

ack more PR comments

98d58a6

vercel bot temporarily deployed to Preview January 6, 2026 00:14 Inactive

fix failing test

16c5515

vercel bot temporarily deployed to Preview January 6, 2026 00:19 Inactive

greptile-apps bot reviewed Jan 6, 2026

View reviewed changes

apps/sim/lib/knowledge/documents/service.ts Outdated Show resolved Hide resolved

vercel bot temporarily deployed to Preview January 6, 2026 00:26 Inactive

moved doc update inside tx for embeddings chunks upload

c065eb7

waleedlatif1 force-pushed the improvement/cost branch from c62a575 to c065eb7 Compare January 6, 2026 00:27

vercel bot temporarily deployed to Preview January 6, 2026 00:28 Inactive

greptile-apps bot reviewed Jan 6, 2026

View reviewed changes

apps/sim/lib/knowledge/documents/service.ts Outdated Show resolved Hide resolved

apps/sim/lib/knowledge/documents/document-processor.ts Show resolved Hide resolved

apps/sim/lib/knowledge/documents/document-processor.ts Show resolved Hide resolved

ack more PR comments

74eb0d0

vercel bot temporarily deployed to Preview January 6, 2026 04:17 Inactive

waleedlatif1 merged commit 75aca00 into staging Jan 6, 2026
11 checks passed

waleedlatif1 deleted the improvement/cost branch January 6, 2026 04:26

waleedlatif1 mentioned this pull request Jan 6, 2026

v0.5.51: triggers, kb, condition block improvements, supabase and grain integration updates #2693

Merged

improvement(kb): optimize processes, add more robust fallbacks for large file ops #2684

improvement(kb): optimize processes, add more robust fallbacks for large file ops #2684

Uh oh!

Conversation

waleedlatif1 commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Type of Change

Testing

Checklist

Uh oh!

vercel bot commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps bot commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

waleedlatif1 commented Jan 5, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

waleedlatif1 commented Jan 6, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

waleedlatif1 commented Jan 6, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

waleedlatif1 commented Jan 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

waleedlatif1 commented Jan 5, 2026 •

edited

Loading

vercel bot commented Jan 5, 2026 •

edited

Loading

greptile-apps bot commented Jan 5, 2026 •

edited

Loading