TikaDocumentConverter does not split content by page #7949

vaclcer · 2024-06-28T08:36:05Z

The Documents generated by "TikaDocumentConverter" from PDF files do not contain "\f" page separators, so later in the pipeline the "DocumentSplitter" then cannot split them by page and generates one big "page" with all the text.

The page separation works as expected with using "PyPDFToDocument".

anakin87 · 2024-06-28T08:40:51Z

Thanks, @vaclcer!

For context, TikaDocumentConverter split documents by page in v1.x (v1.x TikaConverter), so it might make sense to see if the logic is still valid and port it to v2.x.

ghost · 2024-07-08T12:10:42Z

Hello,
I'm new to contributing to open-source.
Can I take a shot at this?

AnushreeBannadabhavi · 2024-07-14T17:10:49Z

I'd like to take this up if no one is working on it

anakin87 · 2024-07-15T08:51:05Z

@AnushreeBannadabhavi feel free to work on this! 💙

(the user who had commented earlier removed his GitHub profile)

lambda-science · 2024-07-24T13:45:16Z

I'd like to take this up if no one is working on it

Hello @AnushreeBannadabhavi , I had the exact same requirement today about Tika not yielding the page number.
Currently I have no idea how to get it properly.
The Splitter component for example count the \f tags to get the page number.

Tika however does not provide it, it provides \n\n or \n\n\n but these are not specific to end of page, they can also be in the middle of page so they are not reliable to use.

And alternative I've seen is from this: https://stackoverflow.com/questions/5824867/is-it-possible-to-extract-text-by-page-for-word-pdf-files-using-apache-tika
Actually Tika does handle pages (at least in pdf) by sending elements <div><p> before page starts and </p></div> after page ends. You can easily setup page count in your handler using this (just counting pages using only <p>):
So if we extract the content in HTML format we can count the </p></div> just like the Splitter behave.

However in the current Haystack implementation it is not possible to ask the content in HTML format.
Here: https://github.com/deepset-ai/haystack/blob/main/haystack/components/converters/tika.py#L88
We can see that this parameter is not used. We should add in the init() maybe the xmlContent=False by default paremeter.
From Tika python package this is the name of the parameter used to get the data in HTML format https://github.com/chrismattmann/tika-python/blob/master/tika/parser.py#L64C12-L64C22

(For reference, I've seen a lot of modern RAG implementation that prefer to extract and chunk text in HTML format rather than pure text, because LLM don't mind HTML I guess and you keep table structure ?)

anakin87 · 2024-07-24T13:52:01Z

@lambda-science I generally agree with your idea.
This 1.x code can help:

haystack/haystack/nodes/file_converter/tika.py

Lines 158 to 164 in 883cd46

    
           parsed = tika_parser.from_file(file_path.as_posix(), self.tika_url, xmlContent=True) 
        
           parser = TikaXHTMLParser() 
        
           parser.feed(parsed["content"]) 
        
           cleaned_pages = [] 
        
           # TODO investigate title of document appearing in the first extracted page 
        
           for page in parser.pages:

It seems that the Tika parser is aware of the pages...

lambda-science · 2024-07-24T14:47:24Z

@anakin87 This updated version with previous parsing method from Haystack 1.X seem's to work well to add the \f in the content so the Splitter can count them to know page number.
However the Cleaner component remove them in all situation because it uses .strip() it's a bit annoying you can't use Tika+Cleaner+Splitter.

Also yes there is an old bug in it: "title of document appearing in the first extracted page"

# SPDX-FileCopyrightText: 2022-present deepset GmbH <info@deepset.ai>
#
# SPDX-License-Identifier: Apache-2.0

import io
from pathlib import Path
from typing import Any, Dict, List, Optional, Union
from html.parser import HTMLParser

from haystack import Document, component, logging
from haystack.components.converters.utils import get_bytestream_from_source, normalize_metadata
from haystack.dataclasses import ByteStream
from haystack.lazy_imports import LazyImport

with LazyImport("Run 'pip install tika'") as tika_import:
    from tika import parser as tika_parser

logger = logging.getLogger(__name__)

class TikaXHTMLParser(HTMLParser):
    # Use the built-in HTML parser with minimum dependencies
    def __init__(self):
        tika_import.check()
        self.ingest = True
        self.page = ""
        self.pages: List[str] = []
        super(TikaXHTMLParser, self).__init__()

    def handle_starttag(self, tag, attrs):
        # find page div
        pagediv = [value for attr, value in attrs if attr == "class" and value == "page"]
        if tag == "div" and pagediv:
            self.ingest = True

    def handle_endtag(self, tag):
        # close page div, or a single page without page div, save page and open a new page
        if (tag == "div" or tag == "body") and self.ingest:
            self.ingest = False
            # restore words hyphened to the next line
            self.pages.append(self.page.replace("-\n", ""))
            self.page = ""

    def handle_data(self, data):
        if self.ingest:
            self.page += data

@component
class TikaDocumentConverter:
    """
    Converts files of different types to Documents.

    This component uses [Apache Tika](https://tika.apache.org/) for parsing the files and, therefore,
    requires a running Tika server.
    For more options on running Tika,
    see the [official documentation](https://github.com/apache/tika-docker/blob/main/README.md#usage).

    Usage example:
    ```python
    from haystack.components.converters.tika import TikaDocumentConverter

    converter = TikaDocumentConverter()
    results = converter.run(
        sources=["sample.docx", "my_document.rtf", "archive.zip"],
        meta={"date_added": datetime.now().isoformat()}
    )
    documents = results["documents"]
    print(documents[0].content)
    # 'This is a text from the docx file.'
    ```
    """

    def __init__(self, tika_url: str = "http://localhost:9998/tika"):
        """
        Create a TikaDocumentConverter component.

        :param tika_url:
            Tika server URL.
        """
        tika_import.check()
        self.tika_url = tika_url

    @component.output_types(documents=List[Document])
    def run(
            self,
            sources: List[Union[str, Path, ByteStream]],
            meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None,
    ):
        """
        Converts files to Documents.

        :param sources:
            List of HTML file paths or ByteStream objects.
        :param meta:
            Optional metadata to attach to the Documents.
            This value can be either a list of dictionaries or a single dictionary.
            If it's a single dictionary, its content is added to the metadata of all produced Documents.
            If it's a list, the length of the list must match the number of sources, because the two lists will
            be zipped.
            If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.

        :returns:
            A dictionary with the following keys:
            - `documents`: Created Documents
        """
        documents = []
        meta_list = normalize_metadata(meta=meta, sources_count=len(sources))

        for source, metadata in zip(sources, meta_list):
            try:
                bytestream = get_bytestream_from_source(source)
            except Exception as e:
                logger.warning("Could not read {source}. Skipping it. Error: {error}", source=source, error=e)
                continue
            try:
                parsed = tika_parser.from_buffer(io.BytesIO(bytestream.data), serverEndpoint=self.tika_url,
                                               xmlContent=True)
                parser = TikaXHTMLParser()
                parser.feed(parsed["content"])
            except Exception as conversion_e:
                logger.warning(
                    "Failed to extract text from {source}. Skipping it. Error: {error}",
                    source=source,
                    error=conversion_e,
                )
                continue

            # Old Processing Code from Haystack 1.X Tika integration
            cleaned_pages = []
            # TODO investigate title of document appearing in the first extracted page
            for page in parser.pages:
                lines = page.splitlines()
                cleaned_lines = []
                for line in lines:
                    cleaned_lines.append(line)

                page = "\n".join(cleaned_lines)
                cleaned_pages.append(page)
            text = "\f".join(cleaned_pages)
            merged_metadata = {**bytestream.meta, **metadata}
            document = Document(content=text, meta=merged_metadata)
            documents.append(document)
        return {"documents": documents}

anakin87 · 2024-07-24T15:11:30Z

related: #8053

lambda-science · 2024-07-25T11:20:43Z

Proposed fix: #8082

mrm1001 added the pdf label Jun 28, 2024

anakin87 added the Contributions wanted! Looking for external contributions label Jun 28, 2024

lambda-science mentioned this issue Jul 25, 2024

fix: Tika converter not yielding page break tags (\f) #8082

Merged

anakin87 closed this as completed in #8082 Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TikaDocumentConverter does not split content by page #7949

TikaDocumentConverter does not split content by page #7949

vaclcer commented Jun 28, 2024

anakin87 commented Jun 28, 2024

ghost commented Jul 8, 2024

AnushreeBannadabhavi commented Jul 14, 2024

anakin87 commented Jul 15, 2024

lambda-science commented Jul 24, 2024 •

edited

Loading

anakin87 commented Jul 24, 2024

lambda-science commented Jul 24, 2024 •

edited

Loading

anakin87 commented Jul 24, 2024

lambda-science commented Jul 25, 2024

TikaDocumentConverter does not split content by page #7949

TikaDocumentConverter does not split content by page #7949

Comments

vaclcer commented Jun 28, 2024

anakin87 commented Jun 28, 2024

ghost commented Jul 8, 2024

AnushreeBannadabhavi commented Jul 14, 2024

anakin87 commented Jul 15, 2024

lambda-science commented Jul 24, 2024 • edited Loading

anakin87 commented Jul 24, 2024

lambda-science commented Jul 24, 2024 • edited Loading

anakin87 commented Jul 24, 2024

lambda-science commented Jul 25, 2024

lambda-science commented Jul 24, 2024 •

edited

Loading

lambda-science commented Jul 24, 2024 •

edited

Loading