Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Can't decode DOC files #303

Closed
amomra opened this issue Feb 8, 2024 · 3 comments · Fixed by #356
Closed

[Bug] Can't decode DOC files #303

amomra opened this issue Feb 8, 2024 · 3 comments · Fixed by #356
Labels
bug Something isn't working triage

Comments

@amomra
Copy link
Contributor

amomra commented Feb 8, 2024

Context / Scenario

The user upload a document with the DOC format

What happened?

When the user upload a document with the legacy DOC format, the application throws an exception saying that the file is corrupt. Analyzing the logs is shown that the application is trying to parse it as a DOCX file.

Importance

I cannot use Kernel Memory

Platform, Language, Versions

It happens in Windows and Linux in dotnet 8

Relevant log output

System.IO.FileFormatException: File contains corrupted data.
   at System.IO.Packaging.ZipPackage..ctor(Stream s, FileMode packageFileMode, FileAccess packageFileAccess)
   at System.IO.Packaging.Package.Open(Stream stream, FileMode packageMode, FileAccess packageAccess)
   at DocumentFormat.OpenXml.Packaging.PackageLoader.OpenCore(Stream stream, Boolean readWriteMode)
   at DocumentFormat.OpenXml.Packaging.WordprocessingDocument.Open(Stream stream, Boolean isEditable, OpenSettings openSettings)
   at DocumentFormat.OpenXml.Packaging.WordprocessingDocument.Open(Stream stream, Boolean isEditable)
   at Microsoft.KernelMemory.DataFormats.Office.MsWordDecoder.DocToText(Stream data) in /sistemas/projetos/teste-llama/kernel-memory/kernel-memory/service/Core/DataFormats/Office/MsWordDecoder.cs:line 28
   at Microsoft.KernelMemory.DataFormats.Office.MsWordDecoder.DocToText(BinaryData data) in /sistemas/projetos/teste-llama/kernel-memory/kernel-memory/service/Core/DataFormats/Office/MsWordDecoder.cs:line 23
   at Microsoft.KernelMemory.Handlers.TextExtractionHandler.ExtractTextAsync(FileDetails uploadedFile, BinaryData fileContent, CancellationToken cancellationToken) in /sistemas/projetos/teste-llama/kernel-memory/kernel-memory/service/Core/Handlers/TextExtractionHandler.cs:line 140
   at Microsoft.KernelMemory.Handlers.TextExtractionHandler.InvokeAsync(DataPipeline pipeline, CancellationToken cancellationToken) in /sistemas/projetos/teste-llama/kernel-memory/kernel-memory/service/Core/Handlers/TextExtractionHandler.cs:line 78
   at Microsoft.KernelMemory.Pipeline.DistributedPipelineOrchestrator.RunPipelineStepAsync(DataPipeline pipeline, IPipelineStepHandler handler, CancellationToken cancellationToken) in /sistemas/projetos/teste-llama/kernel-memory/kernel-memory/service/Core/Pipeline/DistributedPipelineOrchestrator.cs:line 214
   at Microsoft.KernelMemory.Pipeline.DistributedPipelineOrchestrator.<>c__DisplayClass3_0.<<AddHandlerAsync>b__0>d.MoveNext() in /sistemas/projetos/teste-llama/kernel-memory/kernel-memory/service/Core/Pipeline/DistributedPipelineOrchestrator.cs:line 148
@amomra amomra added bug Something isn't working triage labels Feb 8, 2024
@dluc
Copy link
Collaborator

dluc commented Feb 9, 2024

@amomra how would you prefer handling the legacy DOC format? ignore and skip the files, logging an error maybe?

@amomra
Copy link
Contributor Author

amomra commented Feb 10, 2024

@dluc I observed that the app enters in an error loop when you uses RabbitMQ. This happens because the extractor throws an error with the DOC file, requeues the message and immediately retrieves it again causing the loop.
So, a least for now, I think DOC files should be ignored until there is a way to handle theses files since there isn't a portable way to do that (there is the Office Interop library but I think it works only in Windows).

@dluc dluc closed this as completed in #356 Mar 13, 2024
dluc pushed a commit that referenced this issue Mar 13, 2024
## Motivation and Context (Why the change? What's the scenario?)

Proposal for fix #303: Can't decode DOC files.

## High level description (Approach, Design)

Like described in the issue, the legacy Word 97-2003 files will be
ignored. Also the MIME types for the XML formats are changed to reflect
the actual content type for those files.
@dluc
Copy link
Collaborator

dluc commented Apr 16, 2024

The old format is now automatically ignored, unless a specific decoder is provided. The same approach is used for old formats of Word, Excel and PowerPoint

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants