[Bug] Can't decode DOC files #303

amomra · 2024-02-08T12:27:31Z

Context / Scenario

The user upload a document with the DOC format

What happened?

When the user upload a document with the legacy DOC format, the application throws an exception saying that the file is corrupt. Analyzing the logs is shown that the application is trying to parse it as a DOCX file.

Importance

I cannot use Kernel Memory

Platform, Language, Versions

It happens in Windows and Linux in dotnet 8

Relevant log output

System.IO.FileFormatException: File contains corrupted data.
   at System.IO.Packaging.ZipPackage..ctor(Stream s, FileMode packageFileMode, FileAccess packageFileAccess)
   at System.IO.Packaging.Package.Open(Stream stream, FileMode packageMode, FileAccess packageAccess)
   at DocumentFormat.OpenXml.Packaging.PackageLoader.OpenCore(Stream stream, Boolean readWriteMode)
   at DocumentFormat.OpenXml.Packaging.WordprocessingDocument.Open(Stream stream, Boolean isEditable, OpenSettings openSettings)
   at DocumentFormat.OpenXml.Packaging.WordprocessingDocument.Open(Stream stream, Boolean isEditable)
   at Microsoft.KernelMemory.DataFormats.Office.MsWordDecoder.DocToText(Stream data) in /sistemas/projetos/teste-llama/kernel-memory/kernel-memory/service/Core/DataFormats/Office/MsWordDecoder.cs:line 28
   at Microsoft.KernelMemory.DataFormats.Office.MsWordDecoder.DocToText(BinaryData data) in /sistemas/projetos/teste-llama/kernel-memory/kernel-memory/service/Core/DataFormats/Office/MsWordDecoder.cs:line 23
   at Microsoft.KernelMemory.Handlers.TextExtractionHandler.ExtractTextAsync(FileDetails uploadedFile, BinaryData fileContent, CancellationToken cancellationToken) in /sistemas/projetos/teste-llama/kernel-memory/kernel-memory/service/Core/Handlers/TextExtractionHandler.cs:line 140
   at Microsoft.KernelMemory.Handlers.TextExtractionHandler.InvokeAsync(DataPipeline pipeline, CancellationToken cancellationToken) in /sistemas/projetos/teste-llama/kernel-memory/kernel-memory/service/Core/Handlers/TextExtractionHandler.cs:line 78
   at Microsoft.KernelMemory.Pipeline.DistributedPipelineOrchestrator.RunPipelineStepAsync(DataPipeline pipeline, IPipelineStepHandler handler, CancellationToken cancellationToken) in /sistemas/projetos/teste-llama/kernel-memory/kernel-memory/service/Core/Pipeline/DistributedPipelineOrchestrator.cs:line 214
   at Microsoft.KernelMemory.Pipeline.DistributedPipelineOrchestrator.<>c__DisplayClass3_0.<<AddHandlerAsync>b__0>d.MoveNext() in /sistemas/projetos/teste-llama/kernel-memory/kernel-memory/service/Core/Pipeline/DistributedPipelineOrchestrator.cs:line 148

dluc · 2024-02-09T22:16:50Z

@amomra how would you prefer handling the legacy DOC format? ignore and skip the files, logging an error maybe?

amomra · 2024-02-10T14:09:19Z

@dluc I observed that the app enters in an error loop when you uses RabbitMQ. This happens because the extractor throws an error with the DOC file, requeues the message and immediately retrieves it again causing the loop.
So, a least for now, I think DOC files should be ignored until there is a way to handle theses files since there isn't a portable way to do that (there is the Office Interop library but I think it works only in Windows).

## Motivation and Context (Why the change? What's the scenario?) Proposal for fix #303: Can't decode DOC files. ## High level description (Approach, Design) Like described in the issue, the legacy Word 97-2003 files will be ignored. Also the MIME types for the XML formats are changed to reflect the actual content type for those files.

dluc · 2024-04-16T16:58:44Z

The old format is now automatically ignored, unless a specific decoder is provided. The same approach is used for old formats of Word, Excel and PowerPoint

amomra added bug Something isn't working triage labels Feb 8, 2024

This was referenced Mar 12, 2024

Possible fix for 303 - Ignoring legacy Word 97-2003 files #356

Merged

[Question] Should document upload status endpoint signal when the files were ignored by the text extractor? #360

Closed

dluc closed this as completed in #356 Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Can't decode DOC files #303

[Bug] Can't decode DOC files #303

amomra commented Feb 8, 2024

dluc commented Feb 9, 2024

amomra commented Feb 10, 2024

dluc commented Apr 16, 2024

[Bug] Can't decode DOC files #303

[Bug] Can't decode DOC files #303

Comments

amomra commented Feb 8, 2024

Context / Scenario

What happened?

Importance

Platform, Language, Versions

Relevant log output

dluc commented Feb 9, 2024

amomra commented Feb 10, 2024

dluc commented Apr 16, 2024