Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: Create own middlewares #43

Open
1 task done
mrtj opened this issue Jul 25, 2024 · 7 comments
Open
1 task done

Docs: Create own middlewares #43

mrtj opened this issue Jul 25, 2024 · 7 comments
Assignees
Labels
documentation Improvements or additions to documentation triage

Comments

@mrtj
Copy link

mrtj commented Jul 25, 2024

What were you searching in the docs?

Currently in the FAQ section of the documentation it is stated that "we are currently working on a developer handbook to help you write your own middlewares. Stay tuned!". But this phrase is unchanged since at least a couple of months. I would like to have more feedback on how to create own middlewares, and suggest to track this documentation enhancement in this issue.

Is this related to an existing documentation section?

https://awslabs.github.io/project-lakechain/general/faq/

How can we improve?

Release the handbook explaining how to create own middlewares.

Acknowledgment

  • I understand the final update might be different from my proposed suggestion, or refused.
@mrtj mrtj added the triage label Jul 25, 2024
@HQarroum HQarroum self-assigned this Jul 25, 2024
@HQarroum HQarroum added the documentation Improvements or additions to documentation label Jul 25, 2024
@HQarroum
Copy link
Contributor

Hi @mrtj ! Thanks for your feedback, and yes, indeed, we've been working on this but not prioritizing it immediately. Our target is to deliver the first beta release candidate version of Project Lakechain (it is currently in Alpha) in September, which will contain information as to how developers can create their own middlewares using a stable API.

Out of curiosity, are there some ideas of middlewares that you are able to share with us ?

Thanks!

@mrtj
Copy link
Author

mrtj commented Jul 29, 2024

Hello,

I have a particular pipeline in mind for parsing PDF files with complex layouts. Example documents might include product brochures, maintenance manuals, and technical guides. These documents tend to contain very mixed content with complex layouts: texts with multiple columns, different kinds of lists, intricate tables with in-table sections and headers, product photos, and technical or wiring diagrams.

I tried various PDF parser Python libraries (pdfminer, pdfplumber, pypdf, pymupdf, etc.), but they often mess up the natural reading order and the table layouts. I also tried passing the page as an image to multimodal LLMs like Claude v3.5; it works quite well but still struggles with complex tables, likely due to insufficient topographic capabilities. Finally, I found that Amazon Textract with the layout feature, combined with the textractor library, works best for converting a PDF page into HTML or Markdown format. However, it does not return useful results from the figures on the page; parsing the text in a technical diagram does not yield anything useful.

So, I came up with the idea of cropping the figure from the image version of the page, passing it to a multimodal LLM with a prompt asking to describe the image as detailed as possible, and then injecting the description back into the page contents as returned by textractor. This text description should contain as much information as possible from the original page and would be most useful in downstream applications like a RAG Q&A agent.

I was wondering if this pipeline can be implemented in Lakechain. I saw that several components are already present, but I did not find anything related to calling Textract yet, as well as some other steps seem to miss as well.

NB: I plan to write an article about this pipeline once I find a robust way of implementing it (hopefully in Lakechain). Please keep this idea within this issue until then.

@HQarroum HQarroum moved this to In review in Project Lakechain Jul 29, 2024
@HQarroum
Copy link
Contributor

I can echo, across all of our internal experiments, everything you mentioned about PDF parsing, which is quite complicated. For a customer we ended up doing exactly what you mentioned about cropping a portion of a PDF page, and combining Textract with a vision model and it worked quite well in this specific case.

To answer your question, there aren't any textract middlewares because it is quite difficult to abstract away the results provided by Textract into something that other middlewares can consume, but I don't give up the idea.

Regarding your idea of a pipeline, the difficulty would be to identify the figure in a reliable way from the PDF page, especialy if the figure layout tends to change across PDFs (it does not reside within fixed bounding box coordinates), which was our case (hand-written, or scanned pages). We ended up using a table detection model, extracted the bounding box as an image, and passed it to a multimodal model for data extraction (you can do that reliably using a Tool now using the Bedrock API).

I don't think you can do all of that natively using the existing middlewares as you found out. I'd recommend you use a more custom approach to implement this logic (sorry for that!).

@mrtj
Copy link
Author

mrtj commented Jul 29, 2024

I would like to add some more information about my experiments. First, I cropped the figures based on the bounding box coordinates returned by Textract, further processed by the textractor library. It was really nothing complicated; some pseudo-code (in Python) looked like this:

from textractor import Textractor
from textractor.data.constants import TextractFeatures

def crop_figure(page, figure):
    bbox = figure.bbox
    width, height = page.image.size
    return page.image.crop((
        bbox.x * width,
        bbox.y * height,
        (bbox.x + bbox.width) * width,
        (bbox.y + bbox.height) * height
    ))

extractor = Textractor()
document = extractor.start_document_analysis(
    file_source="./complex-layout.pdf",
    features=[TextractFeatures.LAYOUT, TextractFeatures.TABLES],
    save_image=True
)

for page_idx, page in enumerate(document.pages):
  for fig_idx, fig in enumerate(page.page_layout.figures):
    img = crop_figure(page, fig)
    img.save(f"page{page_idx:04d}-fig{fig_idx:04d}.png")

Regarding saving the Textract output: I would simply save the raw JSON that the Textract service returns in an S3 bucket. Later on, users could parse it, maybe with funclets, or it would be even more convenient to be able to use the textractor library. However, to do so, there should be a way to inject custom, user-written Python code into the Lambda workers, and I have no idea how to do that.

As an alternative solution, I think it would be already very useful to have a middleware that calls Textract, parses the result with textractor, and saves the result in Markdown/HTML format as an output. It would work definitely better then the python pdf parsing libraries.

from textractor.data.html_linearization_config import HTMLLinearizationConfig
config = HTMLLinearizationConfig()
# maybe allow users to customize the linearization config using middleware params
html_text = document.get_text(config)
# save the html_text into s3

@HQarroum
Copy link
Contributor

HQarroum commented Jul 30, 2024

I tested textractor this evening and it is pretty cool, works very well. I think that creating a Textract middleware based on textractor makes a lot of sense.

I came up with the following temporary design for an API for this middleware. I think it covers most of the capabilities offered by textractor. What do you think ?

Table data extraction.
Input(s) : PDF, Images
Output(s) : 'markdown' and/or 'text' and/or 'excel' and/or 'csv' and/or 'html'

const textract = new TextractProcessor.Builder()
  .withScope(this)
  .withIdentifier('Trigger')
  .withCacheStorage(cache)
  .withTask(new TableExtractionTask.Builder()
    .withOutputType('markdown' | 'text' | 'excel' | 'csv' | 'html')
    // Defines whether a document will be created for each table,
    // or whether to group them all in one document.
    .withGroupOutput(false)
    .build())
  .build();

Key value pair extraction.
Input(s) : PDF, Images
Output(s) : 'json' | 'csv'

const textract = new TextractProcessor.Builder()
  .withScope(this)
  .withIdentifier('Trigger')
  .withCacheStorage(cache)
  .withTask(new KvExtractionTask.Builder()
    .withOutputType('json' | 'csv')
    .build())
  .build();

Visualize task.
Input(s) : PDF, Images
Output(s) : One or multiple images

const textract = new TextractProcessor.Builder()
  .withScope(this)
  .withIdentifier('Trigger')
  .withCacheStorage(cache)
  .withTask(new ImageVisualizationTask.Builder()
    .withCheckboxes(true)
    .withKeyValues(true)
    .withTables(true)
    .withSearch('rent', { top_k: 10 })
    .build())
  .build();

Expense analysis.
Input(s) : PDF, Images
Output(s) : CSV

const textract = new TextractProcessor.Builder()
  .withScope(this)
  .withIdentifier('Trigger')
  .withCacheStorage(cache)
  .withTask(new ExpenseAnalysisTask.Builder()
    .withOutputType('csv')
    .build())
  .build();

ID Analysis.
Input(s) : PDF, Images
Output(s) : JSON, CSV

const textract = new TextractProcessor.Builder()
  .withScope(this)
  .withIdentifier('Trigger')
  .withCacheStorage(cache)
  .withTask(new IdAnalysisTask.Builder()
    .withOutputType('json' | 'csv')
    .build())
  .build();

Layout Analysis.
Input(s) : PDF, Images
Output(s) : PDF, Images + Metadata
Exports layout information in a structured way in the document metadata.

const textract = new TextractProcessor.Builder()
  .withScope(this)
  .withIdentifier('Trigger')
  .withCacheStorage(cache)
  .withTask(new LayoutAnalysisTask.Builder()
    .build())
  .build();

@mrtj
Copy link
Author

mrtj commented Jul 30, 2024

This would be a really great feature! May I suggest to add also the text linearization function of textractor as described here?

Also, this conversation seems to deviate from the original "create own middlewares" topic, maybe we should continue it in a new issue?

@HQarroum
Copy link
Contributor

Follow up discussion here - #46.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation triage
Projects
Status: In review
Development

No branches or pull requests

2 participants