fix: document extractor node incorrectly handles doc and ppt files #12885

AugNSo · 2025-01-20T15:33:55Z

Summary

In the current code, document extractor node was using python-docx library to handle doc files, which it cannot process. this commit is trying to use unstructured API to handle doc files.
In the current code, document extractor node is using partititon_ppt to handle ppt files, probably a leftover when unstructured[ppt] was still in pyproject.toml. This commit is trying to use unstructured API to handle ppt files.

Tip

Close issue syntax: Fixes #<issue number> or Resolves #<issue number>, see documentation for more details.

Screenshots

Before	After
...	...

Checklist

Important

Please review the checklist below before submitting your pull request.

This change requires a documentation update, included: Dify Document
I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
I've updated the documentation accordingly.
I ran dev/reformat(backend) and cd web && npx lint-staged(frontend) to appease the lint gods

AugNSo · 2025-01-21T00:09:40Z

I'm a bit uncertain whether to remove partition_ppt from the code, it does not work anymore but not sure what type of error should be returned if UNSTRUCTURED_API_URL and UNSTRUCTURED_API_KEY is not provided, the same goes with doc processing logic.

crazywoola · 2025-01-21T02:03:38Z

Please fix the lint errors by running dev/reformat.

AugNSo · 2025-01-21T02:41:24Z

Please fix the lint errors by running dev/reformat.

Did not manage to run dev/reformat but I assume running ruff check --fix ./api and ruff format ./api should have the same effect since I did not edit .env files (ruff version 0.9.2)? Anyway the two python files I edit has been ruff formatted。

AugNSo marked this pull request as ready for review January 20, 2025 15:39

dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. 🐞 bug Something isn't working labels Jan 20, 2025

crazywoola requested a review from JohnJyong January 21, 2025 02:01

AugNSo closed this Jan 21, 2025

AugNSo force-pushed the dev branch from 7739808 to 9d86147 Compare January 21, 2025 02:37

dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Jan 21, 2025

fix: document extractor node incorrectly handles doc and ppt files

60d6b0d

AugNSo reopened this Jan 21, 2025

dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:XS This PR changes 0-9 lines, ignoring generated files. labels Jan 21, 2025

AugNSo closed this Jan 21, 2025

AugNSo reopened this Jan 21, 2025

Merge branch 'langgenius:main' into dev

4d793d5

AugNSo closed this Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: document extractor node incorrectly handles doc and ppt files #12885

fix: document extractor node incorrectly handles doc and ppt files #12885

AugNSo commented Jan 20, 2025 •

edited

Loading

AugNSo commented Jan 21, 2025

crazywoola commented Jan 21, 2025

AugNSo commented Jan 21, 2025 •

edited

Loading

fix: document extractor node incorrectly handles doc and ppt files #12885

fix: document extractor node incorrectly handles doc and ppt files #12885

Conversation

AugNSo commented Jan 20, 2025 • edited Loading

Summary

Screenshots

Checklist

AugNSo commented Jan 21, 2025

crazywoola commented Jan 21, 2025

AugNSo commented Jan 21, 2025 • edited Loading

AugNSo commented Jan 20, 2025 •

edited

Loading

AugNSo commented Jan 21, 2025 •

edited

Loading