-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: Tika converter not yielding page break tags (
\f
) (#8082)
* Fix TikaConverter not having \f page tag by using HTML mode of parsing and then parsing the HTML to text using the old Haystack 1.X integration as template. * Add Reno * Fix test by making Mock Tika return XML (before parsing) * refinements and test --------- Co-authored-by: anakin87 <stefanofiorucci@gmail.com>
- Loading branch information
1 parent
e0de423
commit 1c53aae
Showing
3 changed files
with
48 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
5 changes: 5 additions & 0 deletions
5
releasenotes/notes/fix-tika-page_number-2d600b2dc8a4faa7.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
--- | ||
enhancements: | ||
- | | ||
`TikaDocumentConverter` now returns page breaks ("\f") in the output. | ||
This only works for PDF files. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters