Skip to content

Unstructured-IO/pipeline-document-layout

Pre-Processing Pipeline for Layout Detection

The description for the pipeline repository goes here. The API is hosted at https://api.unstructured.io.

Developer Quick Start

  • Using pyenv to manage virtualenv's is recommended

    • Mac install instructions. See here for more detailed instructions.

      • brew install pyenv-virtualenv
      • pyenv install 3.8.15
    • Linux instructions are available here.

    • Create a virtualenv to work in and activate it, e.g. for one named document_layout:

      pyenv virtualenv 3.8.15 document_layout
      pyenv activate document_layout

  • Run make install

  • Run pip install 'git+https://github.com/facebookresearch/detectron2.git@v0.4#egg=detectron2'

  • Start a local jupyter notebook server with make run-jupyter
    OR
    just start the fast-API locally with make run-web-app

Extracting whatever from some type of document

For example:

curl -X 'POST' \
  'http://localhost:8000/document-layout/v1.0.0/layout' \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -F 'files=@sample-docs/example.png' -F 'model_type=yolox'| jq -C . | less -R

Where files includes the file to process, model_type can be 'default' (or blank) or 'yolox', also is possible to use force_ocr to auto in order to try text extraction from your file, or 'true', in which case OCR will be used.

Generating Python files from the pipeline notebooks

You can generate the FastAPI APIs from your pipeline notebooks by running make generate-api.

Security Policy

See our security policy for information on how to report security vulnerabilities.

Learn more

Section Description
Unstructured Community Github Information about Unstructured.io community projects
Unstructured Github Unstructured.io open source repositories
Company Website Unstructured.io product and company info