-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TikaDocumentConverter does not split content by page #7949
Comments
Thanks, @vaclcer! For context, |
Hello, |
I'd like to take this up if no one is working on it |
@AnushreeBannadabhavi feel free to work on this! 💙 (the user who had commented earlier removed his GitHub profile) |
Hello @AnushreeBannadabhavi , I had the exact same requirement today about Tika not yielding the page number. Tika however does not provide it, it provides And alternative I've seen is from this: https://stackoverflow.com/questions/5824867/is-it-possible-to-extract-text-by-page-for-word-pdf-files-using-apache-tika However in the current Haystack implementation it is not possible to ask the content in HTML format. (For reference, I've seen a lot of modern RAG implementation that prefer to extract and chunk text in HTML format rather than pure text, because LLM don't mind HTML I guess and you keep table structure ?) |
@lambda-science I generally agree with your idea. haystack/haystack/nodes/file_converter/tika.py Lines 158 to 164 in 883cd46
It seems that the Tika parser is aware of the pages... |
@anakin87 This updated version with previous parsing method from Haystack 1.X seem's to work well to add the Also yes there is an old bug in it: "title of document appearing in the first extracted page" # SPDX-FileCopyrightText: 2022-present deepset GmbH <info@deepset.ai>
#
# SPDX-License-Identifier: Apache-2.0
import io
from pathlib import Path
from typing import Any, Dict, List, Optional, Union
from html.parser import HTMLParser
from haystack import Document, component, logging
from haystack.components.converters.utils import get_bytestream_from_source, normalize_metadata
from haystack.dataclasses import ByteStream
from haystack.lazy_imports import LazyImport
with LazyImport("Run 'pip install tika'") as tika_import:
from tika import parser as tika_parser
logger = logging.getLogger(__name__)
class TikaXHTMLParser(HTMLParser):
# Use the built-in HTML parser with minimum dependencies
def __init__(self):
tika_import.check()
self.ingest = True
self.page = ""
self.pages: List[str] = []
super(TikaXHTMLParser, self).__init__()
def handle_starttag(self, tag, attrs):
# find page div
pagediv = [value for attr, value in attrs if attr == "class" and value == "page"]
if tag == "div" and pagediv:
self.ingest = True
def handle_endtag(self, tag):
# close page div, or a single page without page div, save page and open a new page
if (tag == "div" or tag == "body") and self.ingest:
self.ingest = False
# restore words hyphened to the next line
self.pages.append(self.page.replace("-\n", ""))
self.page = ""
def handle_data(self, data):
if self.ingest:
self.page += data
@component
class TikaDocumentConverter:
"""
Converts files of different types to Documents.
This component uses [Apache Tika](https://tika.apache.org/) for parsing the files and, therefore,
requires a running Tika server.
For more options on running Tika,
see the [official documentation](https://github.com/apache/tika-docker/blob/main/README.md#usage).
Usage example:
```python
from haystack.components.converters.tika import TikaDocumentConverter
converter = TikaDocumentConverter()
results = converter.run(
sources=["sample.docx", "my_document.rtf", "archive.zip"],
meta={"date_added": datetime.now().isoformat()}
)
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the docx file.'
```
"""
def __init__(self, tika_url: str = "http://localhost:9998/tika"):
"""
Create a TikaDocumentConverter component.
:param tika_url:
Tika server URL.
"""
tika_import.check()
self.tika_url = tika_url
@component.output_types(documents=List[Document])
def run(
self,
sources: List[Union[str, Path, ByteStream]],
meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None,
):
"""
Converts files to Documents.
:param sources:
List of HTML file paths or ByteStream objects.
:param meta:
Optional metadata to attach to the Documents.
This value can be either a list of dictionaries or a single dictionary.
If it's a single dictionary, its content is added to the metadata of all produced Documents.
If it's a list, the length of the list must match the number of sources, because the two lists will
be zipped.
If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
:returns:
A dictionary with the following keys:
- `documents`: Created Documents
"""
documents = []
meta_list = normalize_metadata(meta=meta, sources_count=len(sources))
for source, metadata in zip(sources, meta_list):
try:
bytestream = get_bytestream_from_source(source)
except Exception as e:
logger.warning("Could not read {source}. Skipping it. Error: {error}", source=source, error=e)
continue
try:
parsed = tika_parser.from_buffer(io.BytesIO(bytestream.data), serverEndpoint=self.tika_url,
xmlContent=True)
parser = TikaXHTMLParser()
parser.feed(parsed["content"])
except Exception as conversion_e:
logger.warning(
"Failed to extract text from {source}. Skipping it. Error: {error}",
source=source,
error=conversion_e,
)
continue
# Old Processing Code from Haystack 1.X Tika integration
cleaned_pages = []
# TODO investigate title of document appearing in the first extracted page
for page in parser.pages:
lines = page.splitlines()
cleaned_lines = []
for line in lines:
cleaned_lines.append(line)
page = "\n".join(cleaned_lines)
cleaned_pages.append(page)
text = "\f".join(cleaned_pages)
merged_metadata = {**bytestream.meta, **metadata}
document = Document(content=text, meta=merged_metadata)
documents.append(document)
return {"documents": documents} |
related: #8053 |
Proposed fix: #8082 |
The Documents generated by "TikaDocumentConverter" from PDF files do not contain "\f" page separators, so later in the pipeline the "DocumentSplitter" then cannot split them by page and generates one big "page" with all the text.
The page separation works as expected with using "PyPDFToDocument".
The text was updated successfully, but these errors were encountered: