-
Notifications
You must be signed in to change notification settings - Fork 3.2k
improvement(kb): optimize processes, add more robust fallbacks for large file ops #2684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
Greptile SummaryThis PR optimizes knowledge base document processing with a focus on handling large files more robustly. Key improvements include:
The changes address previous thread concerns about transaction safety and improve overall robustness for large document operations. Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Client as Frontend (base.tsx)
participant API as Document Service
participant Processor as Document Processor
participant OCR as Mistral OCR API
participant Parser as File Parser (DOC/DOCX)
participant S3 as S3 Storage
participant DB as Database
Client->>API: processDocumentAsync(documentId)
activate API
API->>DB: Update status to 'processing'
API->>Processor: processDocument(fileUrl, mimeType)
activate Processor
alt PDF with OCR enabled
Processor->>Processor: getPdfPageCount(buffer)
alt Page count > 1000
Processor->>Processor: splitPdfIntoChunks(buffer, 1000)
loop For each chunk batch (MAX_CONCURRENT_CHUNKS)
Processor->>S3: Upload chunk PDF
S3-->>Processor: presigned URL
Processor->>OCR: Process chunk via Mistral OCR
OCR-->>Processor: Extracted text
Processor->>S3: Delete chunk PDF
end
Processor->>Processor: Combine all chunk results
else Page count <= 1000
Processor->>S3: Upload full PDF
S3-->>Processor: presigned URL
Processor->>OCR: Process full PDF
OCR-->>Processor: Extracted text
end
else DOC/DOCX file
alt Primary parser (mammoth/officeparser)
Processor->>Parser: Parse with primary parser
Parser-->>Processor: Extracted text
else Primary fails
Processor->>Parser: Fallback to secondary parser
Parser-->>Processor: Extracted text
end
end
Processor->>Processor: Chunk content (TextChunker)
Processor-->>API: {chunks, metadata}
deactivate Processor
API->>API: generateEmbeddings(chunks) in batches
API->>DB: BEGIN TRANSACTION
activate DB
API->>DB: DELETE embeddings for documentId
loop For each batch of embeddings
API->>DB: INSERT embedding batch
end
API->>DB: UPDATE document status to 'completed'
DB-->>API: COMMIT TRANSACTION
deactivate DB
API-->>Client: Success
deactivate API
Note over Client: React Query refetchInterval (3s)<br/>polls while processing=true
Client->>API: Fetch documents (auto-refresh)
API-->>Client: Updated document status
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
20 files reviewed, 1 comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
21 files reviewed, 2 comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
21 files reviewed, 1 comment
c62a575 to
c065eb7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
21 files reviewed, 3 comments
…rge file ops (#2684) * improvement(kb): optimize processes, add more robust fallbacks for large file ops * stronger typing * comments cleanup * ack PR comments * upgraded turborepo * ack more PR comments * fix failing test * moved doc update inside tx for embeddings chunks upload * ack more PR comments
Summary
Type of Change
Testing
Tested manually
Checklist