Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: document extractor node incorrectly handles doc and ppt files #12885

Closed
wants to merge 2 commits into from

Conversation

AugNSo
Copy link
Contributor

@AugNSo AugNSo commented Jan 20, 2025

Summary

Fixes #12889

  1. In the current code, document extractor node was using python-docx library to handle doc files, which it cannot process. this commit is trying to use unstructured API to handle doc files.
  2. In the current code, document extractor node is using partititon_ppt to handle ppt files, probably a leftover when unstructured[ppt] was still in pyproject.toml. This commit is trying to use unstructured API to handle ppt files.

Tip

Close issue syntax: Fixes #<issue number> or Resolves #<issue number>, see documentation for more details.

Screenshots

Before After
... ...

Checklist

Important

Please review the checklist below before submitting your pull request.

  • This change requires a documentation update, included: Dify Document
  • I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
  • I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
  • I've updated the documentation accordingly.
  • I ran dev/reformat(backend) and cd web && npx lint-staged(frontend) to appease the lint gods

@AugNSo AugNSo marked this pull request as ready for review January 20, 2025 15:39
@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. 🐞 bug Something isn't working labels Jan 20, 2025
@AugNSo
Copy link
Contributor Author

AugNSo commented Jan 21, 2025

I'm a bit uncertain whether to remove partition_ppt from the code, it does not work anymore but not sure what type of error should be returned if UNSTRUCTURED_API_URL and UNSTRUCTURED_API_KEY is not provided, the same goes with doc processing logic.

@crazywoola crazywoola requested a review from JohnJyong January 21, 2025 02:01
@crazywoola
Copy link
Member

Please fix the lint errors by running dev/reformat.

@AugNSo AugNSo closed this Jan 21, 2025
@dosubot dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Jan 21, 2025
@AugNSo AugNSo reopened this Jan 21, 2025
@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:XS This PR changes 0-9 lines, ignoring generated files. labels Jan 21, 2025
@AugNSo
Copy link
Contributor Author

AugNSo commented Jan 21, 2025

Please fix the lint errors by running dev/reformat.

Did not manage to run dev/reformat but I assume running ruff check --fix ./api and ruff format ./api should have the same effect since I did not edit .env files (ruff version 0.9.2)? Anyway the two python files I edit has been ruff formatted。

@AugNSo AugNSo closed this Jan 21, 2025
@AugNSo AugNSo reopened this Jan 21, 2025
@AugNSo AugNSo closed this Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐞 bug Something isn't working size:M This PR changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Document extractor node incorrectly handles doc and ppt files
2 participants