DS4SD · dolfim-ibm · Oct 17, 2024 · Oct 16, 2024 · Oct 16, 2024 · Oct 17, 2024
diff --git a/.github/workflows/cd.yml b/.github/workflows/cd.yml
@@ -10,19 +10,14 @@ env:
   PYTHON_KEYRING_BACKEND: keyring.backends.null.Keyring
 
 jobs:
-  # To be enabled when we add docs
-  # docs:
-  #   permissions:
-  #     contents: write
-  #   runs-on: ubuntu-latest
-  #   steps:
-  #     - uses: actions/checkout@v3
-  #     - uses: ./.github/actions/setup-poetry
-  #     - name: Build and push docs
-  #       run: poetry run mkdocs gh-deploy --force
-
   code-checks:
     uses: ./.github/workflows/checks.yml
+  build-deploy-docs:
+    uses: ./.github/workflows/docs.yml
+    with:
+      deploy: true
+    permissions:
+      contents: write
   pre-release-check:
     runs-on: ubuntu-latest
     outputs:

diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -16,13 +16,7 @@ env:
 jobs:
   code-checks:
     uses: ./.github/workflows/checks.yml
-
-    # To enable when we add the ./docs
-  # build-docs:
-  #   runs-on: ubuntu-latest
-  #   steps:
-  #     - uses: actions/checkout@v3
-  #     - uses: ./.github/actions/setup-poetry
-  #     - name: Build docs
-  #       run: poetry run mkdocs build --verbose --clean
-
+  build-docs:
+    uses: ./.github/workflows/docs.yml
+    with:
+      deploy: false
diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml
@@ -0,0 +1,28 @@
+on:
+    workflow_call:
+        inputs:
+            deploy:
+                type: boolean
+                description: "If true, the docs will be deployed."
+                default: false
+
+jobs:
+    run-docs:
+        runs-on: ubuntu-latest
+        steps:
+        - uses: actions/checkout@v4
+        - name: Install poetry
+          run: pipx install poetry==1.8.3
+          shell: bash
+        - uses: actions/setup-python@v5
+          with:
+              cache: 'poetry'
+        - name: Install dependencies
+          run: poetry install --only docs
+          shell: bash
+        - name: Build docs
+          run: poetry run mkdocs build --verbose --clean
+        - name: Build and push docs
+          if: inputs.deploy
+          run: poetry run mkdocs gh-deploy --force
+
diff --git a/README.md b/README.md
@@ -53,7 +53,6 @@ source = "https://arxiv.org/pdf/2408.09869"  # PDF path or URL
 converter = DocumentConverter()
 result = converter.convert(source)
 print(result.document.export_to_markdown())  # output: "## Docling Technical Report[...]"
-print(result.document.export_to_document_tokens())  # output: "<document><title><page_1><loc_20>..."
 ```
 
 

diff --git a/docs/concepts/docling_format.md → docs/concepts/docling_document.md b/docs/concepts/docling_format.md → docs/concepts/docling_document.md
@@ -1,4 +1,4 @@
-With Docling v2, we introduce a unified document representation format called `DoclingDocument`. It is defined as a 
+With Docling v2, we introduce a unified document representation format called `DoclingDocument`. It is defined as a
 pydantic datatype, which can express several features common to documents, such as:
 
 * Text, Tables, Pictures, and more
@@ -9,15 +9,16 @@ pydantic datatype, which can express several features common to documents, such
 
 It also brings a set of document construction APIs to build up a `DoclingDocument` from scratch.
 
-# Example document structures
+## Example document structures
 
-To illustrate the features of the `DoclingDocument` format, consider the following side-by-side comparison of a
-`DoclingDocument` converted from `test/data/word_sample.docx`. Left side shows snippets from the converted document 
-serialized as YAML, right side shows the corresponding visual parts in MS Word.
+To illustrate the features of the `DoclingDocument` format, in the subsections below we consider the
+`DoclingDocument` converted from `tests/data/word_sample.docx` and we present some side-by-side comparisons,
+where the left side shows snippets from the converted document
+serialized as YAML and the right one shows the corresponding parts of the original MS Word.
 
-## Basic structure
+### Basic structure
 
-A `DoclingDocument` exposes top-level fields for the document content, organized in two categories. 
+A `DoclingDocument` exposes top-level fields for the document content, organized in two categories.
 The first category is the _content items_, which are stored in these fields:
 
 - `texts`: All items that have a text representation (paragraph, section heading, equation, ...). Base class is `TextItem`.
@@ -34,32 +35,34 @@ The second category is _content structure_, which is encapsualted in:
 - `furniture`: The root node of a tree-structure for all items that don't belong into the body (headers, footers, ...)
 - `groups`: A set of items that don't represent content, but act as containers for other content items (e.g. a list, a chapter)
 
-All of the above fields are only storing `NodeItem` instances, which reference children and parents 
-through JSON pointers. 
+All of the above fields are only storing `NodeItem` instances, which reference children and parents
+through JSON pointers.
 
 The reading order of the document is encapsulated through the `body` tree and the order of _children_ in each item
 in the tree.
 
-Below example shows how all items in the first page are nested below the `title` item (`#/texts/1`). 
+Below example shows how all items in the first page are nested below the `title` item (`#/texts/1`).
 
 ![doc_hierarchy_1](../assets/docling_doc_hierarchy_1.png)
 
-## Grouping
+### Grouping
 
 Below example shows how all items under the heading "Let's swim" (`#/texts/5`) are nested as chilrden. The children of
-"Let's swim" are both text items and groups, which contain the list elements. The group items are stored in the 
+"Let's swim" are both text items and groups, which contain the list elements. The group items are stored in the
 top-level `groups` field.
 
 ![doc_hierarchy_2](../assets/docling_doc_hierarchy_2.png)
 
-## Tables
+<!--
+### Tables
 
 TBD
 
-## Pictures
+### Pictures
 
 TBD
 
-## Provenance
+### Provenance
 
-TBD
+TBD
+ -->
diff --git a/docs/usage.md b/docs/usage.md
@@ -0,0 +1,171 @@
+## Conversion
+
+### Convert a single document
+
+To convert invidual PDF documents, use `convert()`, for example:
+
+```python
+from docling.document_converter import DocumentConverter
+
+source = "https://arxiv.org/pdf/2408.09869"  # PDF path or URL
+converter = DocumentConverter()
+result = converter.convert(source)
+print(result.document.export_to_markdown())  # output: "### Docling Technical Report[...]"
+```
+
+### CLI
+
+You can also use Docling directly from your command line to convert individual files —be it local or by URL— or whole directories.
+
+A simple example would look like this:
+```console
+docling https://arxiv.org/pdf/2206.01062
+```
+
+To see all available options (export formats etc.) run `docling --help`.
+
+<details>
+  <summary><b>CLI reference</b></summary>
+
+  Here are the available options as of this writing (for an up-to-date listing, run `docling --help`):
+
+  ```console
+  $ docling --help
+
+ Usage: docling [OPTIONS] source
+
+╭─ Arguments ───────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
+│ *    input_sources      source  PDF files to convert. Can be local file / directory paths or URL. [default: None]         │
+│                                 [required]                                                                                │
+╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
+╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
+│ --from                                     [docx|pptx|html|image|pdf]         Specify input formats to convert from.      │
+│                                                                               Defaults to all formats.                    │
+│                                                                               [default: None]                             │
+│ --to                                       [md|json|text|doctags]             Specify output formats. Defaults to         │
+│                                                                               Markdown.                                   │
+│                                                                               [default: None]                             │
+│ --ocr               --no-ocr                                                  If enabled, the bitmap content will be      │
+│                                                                               processed using OCR.                        │
+│                                                                               [default: ocr]                              │
+│ --ocr-engine                               [easyocr|tesseract_cli|tesseract]  The OCR engine to use. [default: easyocr]   │
+│ --abort-on-error    --no-abort-on-error                                       If enabled, the bitmap content will be      │
+│                                                                               processed using OCR.                        │
+│                                                                               [default: no-abort-on-error]                │
+│ --output                                   PATH                               Output directory where results are saved.   │
+│                                                                               [default: .]                                │
+│ --version                                                                     Show version information.                   │
+│ --help                                                                        Show this message and exit.                 │
+╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
+  ```
+</details>
+
+
+
+### Advanced options
+
+#### Adjust pipeline features
+
+The example file [custom_convert.py](./examples/custom_convert.py) contains multiple ways
+one can adjust the conversion pipeline and features.
+
+
+##### Control PDF table extraction options
+
+You can control if table structure recognition should map the recognized structure back to PDF cells (default) or use text cells from the structure prediction itself.
+This can improve output quality if you find that multiple columns in extracted tables are erroneously merged into one.
+
+
+```python
+from docling.datamodel.base_models import InputFormat
+from docling.document_converter import DocumentConverter, PdfFormatOption
+from docling.datamodel.pipeline_options import PdfPipelineOptions
+
+pipeline_options = PdfPipelineOptions(do_table_structure=True)
+pipeline_options.table_structure_options.do_cell_matching = False  # uses text cells predicted from table structure model
+
+doc_converter = DocumentConverter(
+    format_options={
+        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
+    }
+)
+```
+
+Since docling 1.16.0: You can control which TableFormer mode you want to use. Choose between `TableFormerMode.FAST` (default) and `TableFormerMode.ACCURATE` (better, but slower) to receive better quality with difficult table structures.
+
+```python
+from docling.datamodel.base_models import InputFormat
+from docling.document_converter import DocumentConverter, PdfFormatOption
+from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode
+
+pipeline_options = PdfPipelineOptions(do_table_structure=True)
+pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE  # use more accurate TableFormer model
+
+doc_converter = DocumentConverter(
+    format_options={
+        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
+    }
+)
+```
+
+#### Impose limits on the document size
+
+You can limit the file size and number of pages which should be allowed to process per document:
+
+```python
+from pathlib import Path
+from docling.document_converter import DocumentConverter
+
+source = "https://arxiv.org/pdf/2408.09869"
+converter = DocumentConverter()
+result = converter.convert(source, max_num_pages=100, max_file_size=20971520)
+```
+
+#### Convert from binary PDF streams
+
+You can convert PDFs from a binary stream instead of from the filesystem as follows:
+
+```python
+from io import BytesIO
+from docling.datamodel.base_models import DocumentStream
+from docling.document_converter import DocumentConverter
+
+buf = BytesIO(your_binary_stream)
+source = DocumentStream(filename="my_doc.pdf", stream=buf)
+converter = DocumentConverter()
+result = converter.convert(source)
+```
+
+#### Limit resource usage
+
+You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
+
+
+## Chunking
+
+You can perform a hierarchy-aware chunking of a Docling document as follows:
+
+```python
+from docling.document_converter import DocumentConverter
+from docling_core.transforms.chunker import HierarchicalChunker
+
+conv_res = DocumentConverter().convert("https://arxiv.org/pdf/2206.01062")
+doc = conv_res.document
+chunks = list(HierarchicalChunker().chunk(doc))
+
+print(chunks[30])
+# {
+#   "text": "Lately, new types of ML models for document-layout analysis have emerged [...]",
+#   "meta": {
+#     "doc_items": [{
+#       "self_ref": "#/texts/40",
+#       "label": "text",
+#       "prov": [{
+#         "page_no": 2,
+#         "bbox": {"l": 317.06, "t": 325.81, "r": 559.18, "b": 239.97, ...},
+#       }]
+#     }],
+#     "headings": ["2 RELATED WORK"],
+#   }
+# }
+```