Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TikaDocumentConverter does not split content by page #7949

Closed
vaclcer opened this issue Jun 28, 2024 · 9 comments · Fixed by #8082
Closed

TikaDocumentConverter does not split content by page #7949

vaclcer opened this issue Jun 28, 2024 · 9 comments · Fixed by #8082
Labels
Contributions wanted! Looking for external contributions pdf

Comments

@vaclcer
Copy link

vaclcer commented Jun 28, 2024

The Documents generated by "TikaDocumentConverter" from PDF files do not contain "\f" page separators, so later in the pipeline the "DocumentSplitter" then cannot split them by page and generates one big "page" with all the text.

The page separation works as expected with using "PyPDFToDocument".

@mrm1001 mrm1001 added the pdf label Jun 28, 2024
@anakin87
Copy link
Member

Thanks, @vaclcer!

For context, TikaDocumentConverter split documents by page in v1.x (v1.x TikaConverter), so it might make sense to see if the logic is still valid and port it to v2.x.

@anakin87 anakin87 added the Contributions wanted! Looking for external contributions label Jun 28, 2024
@ghost
Copy link

ghost commented Jul 8, 2024

Hello,
I'm new to contributing to open-source.
Can I take a shot at this?

@AnushreeBannadabhavi
Copy link
Contributor

I'd like to take this up if no one is working on it

@anakin87
Copy link
Member

@AnushreeBannadabhavi feel free to work on this! 💙

(the user who had commented earlier removed his GitHub profile)

@lambda-science
Copy link
Contributor

lambda-science commented Jul 24, 2024

I'd like to take this up if no one is working on it

Hello @AnushreeBannadabhavi , I had the exact same requirement today about Tika not yielding the page number.
Currently I have no idea how to get it properly.
The Splitter component for example count the \f tags to get the page number.

Tika however does not provide it, it provides \n\n or \n\n\n but these are not specific to end of page, they can also be in the middle of page so they are not reliable to use.

And alternative I've seen is from this: https://stackoverflow.com/questions/5824867/is-it-possible-to-extract-text-by-page-for-word-pdf-files-using-apache-tika
Actually Tika does handle pages (at least in pdf) by sending elements <div><p> before page starts and </p></div> after page ends. You can easily setup page count in your handler using this (just counting pages using only <p>):
So if we extract the content in HTML format we can count the </p></div> just like the Splitter behave.

However in the current Haystack implementation it is not possible to ask the content in HTML format.
Here: https://github.com/deepset-ai/haystack/blob/main/haystack/components/converters/tika.py#L88
We can see that this parameter is not used. We should add in the init() maybe the xmlContent=False by default paremeter.
From Tika python package this is the name of the parameter used to get the data in HTML format https://github.com/chrismattmann/tika-python/blob/master/tika/parser.py#L64C12-L64C22

(For reference, I've seen a lot of modern RAG implementation that prefer to extract and chunk text in HTML format rather than pure text, because LLM don't mind HTML I guess and you keep table structure ?)

@anakin87
Copy link
Member

@lambda-science I generally agree with your idea.
This 1.x code can help:

parsed = tika_parser.from_file(file_path.as_posix(), self.tika_url, xmlContent=True)
parser = TikaXHTMLParser()
parser.feed(parsed["content"])
cleaned_pages = []
# TODO investigate title of document appearing in the first extracted page
for page in parser.pages:

It seems that the Tika parser is aware of the pages...

@lambda-science
Copy link
Contributor

lambda-science commented Jul 24, 2024

@anakin87 This updated version with previous parsing method from Haystack 1.X seem's to work well to add the \f in the content so the Splitter can count them to know page number.
However the Cleaner component remove them in all situation because it uses .strip() it's a bit annoying you can't use Tika+Cleaner+Splitter.

Also yes there is an old bug in it: "title of document appearing in the first extracted page"

# SPDX-FileCopyrightText: 2022-present deepset GmbH <info@deepset.ai>
#
# SPDX-License-Identifier: Apache-2.0

import io
from pathlib import Path
from typing import Any, Dict, List, Optional, Union
from html.parser import HTMLParser

from haystack import Document, component, logging
from haystack.components.converters.utils import get_bytestream_from_source, normalize_metadata
from haystack.dataclasses import ByteStream
from haystack.lazy_imports import LazyImport

with LazyImport("Run 'pip install tika'") as tika_import:
    from tika import parser as tika_parser

logger = logging.getLogger(__name__)

class TikaXHTMLParser(HTMLParser):
    # Use the built-in HTML parser with minimum dependencies
    def __init__(self):
        tika_import.check()
        self.ingest = True
        self.page = ""
        self.pages: List[str] = []
        super(TikaXHTMLParser, self).__init__()

    def handle_starttag(self, tag, attrs):
        # find page div
        pagediv = [value for attr, value in attrs if attr == "class" and value == "page"]
        if tag == "div" and pagediv:
            self.ingest = True

    def handle_endtag(self, tag):
        # close page div, or a single page without page div, save page and open a new page
        if (tag == "div" or tag == "body") and self.ingest:
            self.ingest = False
            # restore words hyphened to the next line
            self.pages.append(self.page.replace("-\n", ""))
            self.page = ""

    def handle_data(self, data):
        if self.ingest:
            self.page += data

@component
class TikaDocumentConverter:
    """
    Converts files of different types to Documents.

    This component uses [Apache Tika](https://tika.apache.org/) for parsing the files and, therefore,
    requires a running Tika server.
    For more options on running Tika,
    see the [official documentation](https://github.com/apache/tika-docker/blob/main/README.md#usage).

    Usage example:
    ```python
    from haystack.components.converters.tika import TikaDocumentConverter

    converter = TikaDocumentConverter()
    results = converter.run(
        sources=["sample.docx", "my_document.rtf", "archive.zip"],
        meta={"date_added": datetime.now().isoformat()}
    )
    documents = results["documents"]
    print(documents[0].content)
    # 'This is a text from the docx file.'
    ```
    """

    def __init__(self, tika_url: str = "http://localhost:9998/tika"):
        """
        Create a TikaDocumentConverter component.

        :param tika_url:
            Tika server URL.
        """
        tika_import.check()
        self.tika_url = tika_url

    @component.output_types(documents=List[Document])
    def run(
            self,
            sources: List[Union[str, Path, ByteStream]],
            meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None,
    ):
        """
        Converts files to Documents.

        :param sources:
            List of HTML file paths or ByteStream objects.
        :param meta:
            Optional metadata to attach to the Documents.
            This value can be either a list of dictionaries or a single dictionary.
            If it's a single dictionary, its content is added to the metadata of all produced Documents.
            If it's a list, the length of the list must match the number of sources, because the two lists will
            be zipped.
            If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.

        :returns:
            A dictionary with the following keys:
            - `documents`: Created Documents
        """
        documents = []
        meta_list = normalize_metadata(meta=meta, sources_count=len(sources))

        for source, metadata in zip(sources, meta_list):
            try:
                bytestream = get_bytestream_from_source(source)
            except Exception as e:
                logger.warning("Could not read {source}. Skipping it. Error: {error}", source=source, error=e)
                continue
            try:
                parsed = tika_parser.from_buffer(io.BytesIO(bytestream.data), serverEndpoint=self.tika_url,
                                               xmlContent=True)
                parser = TikaXHTMLParser()
                parser.feed(parsed["content"])
            except Exception as conversion_e:
                logger.warning(
                    "Failed to extract text from {source}. Skipping it. Error: {error}",
                    source=source,
                    error=conversion_e,
                )
                continue

            # Old Processing Code from Haystack 1.X Tika integration
            cleaned_pages = []
            # TODO investigate title of document appearing in the first extracted page
            for page in parser.pages:
                lines = page.splitlines()
                cleaned_lines = []
                for line in lines:
                    cleaned_lines.append(line)

                page = "\n".join(cleaned_lines)
                cleaned_pages.append(page)
            text = "\f".join(cleaned_pages)
            merged_metadata = {**bytestream.meta, **metadata}
            document = Document(content=text, meta=merged_metadata)
            documents.append(document)
        return {"documents": documents}

@anakin87
Copy link
Member

related: #8053

@lambda-science
Copy link
Contributor

Proposed fix: #8082

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Contributions wanted! Looking for external contributions pdf
Projects
Development

Successfully merging a pull request may close this issue.

5 participants