feat: add figure in markdown (#98)

* feat: add figures in markdown Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * update to new docling-core and update test results with figures Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * update with improved docling-core Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
DS4SD · Sep 24, 2024 · 6a03c20 · 6a03c20
1 parent 001d214
commit 6a03c20
Show file tree

Hide file tree

Showing 9 changed files with 284 additions and 58 deletions.
diff --git a/docling/datamodel/document.py b/docling/datamodel/document.py
@@ -324,15 +324,18 @@ def render_as_markdown(
             "paragraph",
             "caption",
             "table",
+            "figure",
         ],
         strict_text: bool = False,
+        image_placeholder: str = "<!-- image -->",
     ):
         return self.output.export_to_markdown(
             delim=delim,
             main_text_start=main_text_start,
             main_text_stop=main_text_stop,
             main_text_labels=main_text_labels,
             strict_text=strict_text,
+            image_placeholder=image_placeholder,
         )
 
     def render_as_text(

diff --git a/poetry.lock b/poetry.lock
diff --git a/pyproject.toml b/pyproject.toml
@@ -23,7 +23,7 @@ packages = [{include = "docling"}]
 [tool.poetry.dependencies]
 python = "^3.10"
 pydantic = "^2.0.0"
-docling-core = "^1.5.0"
+docling-core = "^1.6.2"
 docling-ibm-models = "^1.2.0"
 deepsearch-glm = "^0.21.1"
 filetype = "^1.2.0"

diff --git a/tests/data/2203.01017v2.md b/tests/data/2203.01017v2.md
@@ -14,16 +14,19 @@ The occurrence of tables in documents is ubiquitous. They often summarise quanti
 
 Tables organize valuable content in a concise and compact representation. This content is extremely valuable for systems such as search engines, Knowledge Graph's, etc, since they enhance their predictive capabilities. Unfortunately, tables come in a large variety of shapes and sizes. Furthermore, they can have complex column/row-header configurations, multiline rows, different variety of separation lines, missing entries, etc. As such, the correct identification of the table-structure from an image is a nontrivial task. In this paper, we present a new table-structure identification model. The latter improves the latest end-toend deep learning model (i.e. encoder-dual-decoder from PubTabNet) in two significant ways. First, we introduce a new object detection decoder for table-cells. In this way, we can obtain the content of the table-cells from programmatic PDF's directly from the PDF source and avoid the training of the custom OCR decoders. This architectural change leads to more accurate table-content extraction and allows us to tackle non-english tables. Second, we replace the LSTM decoders with transformer based decoders. This upgrade improves significantly the previous state-of-the-art tree-editing-distance-score (TEDS) from 91% to 98.5% on simple tables and from 88.7% to 95% on complex tables.
 
+Tables organize valuable content in a concise and compact representation. This content is extremely valuable for systems such as search engines, Knowledge Graph's, etc, since they enhance their predictive capabilities. Unfortunately, tables come in a large variety of shapes and sizes. Furthermore, they can have complex column/row-header configurations, multiline rows, different variety of separation lines, missing entries, etc. As such, the correct identification of the table-structure from an image is a nontrivial task. In this paper, we present a new table-structure identification model. The latter improves the latest end-toend deep learning model (i.e. encoder-dual-decoder from PubTabNet) in two significant ways. First, we introduce a new object detection decoder for table-cells. In this way, we can obtain the content of the table-cells from programmatic PDF's directly from the PDF source and avoid the training of the custom OCR decoders. This architectural change leads to more accurate table-content extraction and allows us to tackle non-english tables. Second, we replace the LSTM decoders with transformer based decoders. This upgrade improves significantly the previous state-of-the-art tree-editing-distance-score (TEDS) from 91% to 98.5% on simple tables and from 88.7% to 95% on complex tables.
 |    | 3   | 1   |
 |----|-----|-----|
 |  2 |     |     |
 
 b. Red-annotation of bounding boxes, Blue-predictions by TableFormer
 
+
+<!-- image -->
+
 c. Structure predicted by TableFormer:
 
 Figure 1: Picture of a table with subtle, complex features such as (1) multi-column headers, (2) cell with multi-row text and (3) cells with no content. Image from PubTabNet evaluation set, filename: 'PMC2944238 004 02'.
-
 | 0   |   1 | 1   |   2 1 |   2 1 |    |
 |-----|-----|-----|-------|-------|----|
 | 3   |   4 | 5 3 |     6 |     7 |    |
@@ -76,6 +79,7 @@ Hybrid Deep Learning-Rule-Based approach : A popular current model for table-str
 We rely on large-scale datasets such as PubTabNet [37], FinTabNet [36], and TableBank [17] datasets to train and evaluate our models. These datasets span over various appearance styles and content. We also introduce our own synthetically generated SynthTabNet dataset to fix an im-
 
 Figure 2: Distribution of the tables across different table dimensions in PubTabNet + FinTabNet datasets
+<!-- image -->
 
 balance in the previous datasets.
 
@@ -94,7 +98,6 @@ Motivated by those observations we aimed at generating a synthetic table dataset
 In this regard, we have prepared four synthetic datasets, each one containing 150k examples. The corpora to generate the table text consists of the most frequent terms appearing in PubTabNet and FinTabNet together with randomly generated text. The first two synthetic datasets have been fine-tuned to mimic the appearance of the original datasets but encompass more complicated table structures. The third
 
 Table 1: Both "Combined-Tabnet" and "CombinedTabnet" are variations of the following: (*) The CombinedTabnet dataset is the processed combination of PubTabNet and Fintabnet. (**) The combined dataset is the processed combination of PubTabNet, Fintabnet and TableBank.
-
 |                    |   Tags |   Bbox | Size   | Format   |
 |--------------------|--------|--------|--------|----------|
 | PubTabNet          |      3 |      3 | 509k   | PNG      |
@@ -119,8 +122,10 @@ We now describe in detail the proposed method, which is composed of three main c
 CNN Backbone Network. A ResNet-18 CNN is the backbone that receives the table image and encodes it as a vector of predefined length. The network has been modified by removing the linear and pooling layer, as we are not per-
 
 Figure 3: TableFormer takes in an image of the PDF and creates bounding box and HTML structure predictions that are synchronized. The bounding boxes grabs the content from the PDF and inserts it in the structure.
+<!-- image -->
 
 Figure 4: Given an input image of a table, the Encoder produces fixed-length features that represent the input image. The features are then passed to both the Structure Decoder and Cell BBox Decoder . During training, the Structure Decoder receives 'tokenized tags' of the HTML code that represent the table structure. Afterwards, a transformer encoder and decoder architecture is employed to produce features that are received by a linear layer, and the Cell BBox Decoder. The linear layer is applied to the features to predict the tags. Simultaneously, the Cell BBox Decoder selects features referring to the data cells (' < td > ', ' < ') and passes them through an attention network, an MLP, and a linear layer to predict the bounding boxes.
+<!-- image -->
 
 forming classification, and adding an adaptive pooling layer of size 28*28. ResNet by default downsamples the image resolution by 32 and then the encoded image is provided to both the Structure Decoder , and Cell BBox Decoder .
 
@@ -175,7 +180,6 @@ where T$_{a}$ and T$_{b}$ represent tables in tree structure HTML format. EditDi
 Structure. As shown in Tab. 2, TableFormer outperforms all SOTA methods across different datasets by a large margin for predicting the table structure from an image. All the more, our model outperforms pre-trained methods. During the evaluation we do not apply any table filtering. We also provide our baseline results on the SynthTabNet dataset. It has been observed that large tables (e.g. tables that occupy half of the page or more) yield poor predictions. We attribute this issue to the image resizing during the preprocessing step, that produces downsampled images with indistinguishable features. This problem can be addressed by treating such big tables with a separate model which accepts a large input image size.
 
 Table 2: Structure results on PubTabNet (PTN), FinTabNet (FTN), TableBank (TB) and SynthTabNet (STN).
-
 | Model       | Dataset   | Simple   | TEDS Complex   |   All |
 |-------------|-----------|----------|----------------|-------|
 | EDD         | PTN       | 91.1     | 88.7           | 89.9  |
@@ -196,7 +200,6 @@ Cell Detection. Like any object detector, our Cell BBox Detector provides boundi
 our Cell BBox Decoder accuracy for cells with a class label of 'content' only using the PASCAL VOC mAP metric for pre-processing and post-processing. Note that we do not have post-processing results for SynthTabNet as images are only provided. To compare the performance of our proposed approach, we've integrated TableFormer's Cell BBox Decoder into EDD architecture. As mentioned previously, the Structure Decoder provides the Cell BBox Decoder with the features needed to predict the bounding box predictions. Therefore, the accuracy of the Structure Decoder directly influences the accuracy of the Cell BBox Decoder . If the Structure Decoder predicts an extra column, this will result in an extra column of predicted bounding boxes.
 
 Table 3: Cell Bounding Box detection results on PubTabNet, and FinTabNet. PP: Post-processing.
-
 | Model       | Dataset     |   mAP | mAP (PP)   |
 |-------------|-------------|-------|------------|
 | EDD+BBox    | PubTabNet   |  79.2 | 82.7       |
@@ -206,7 +209,6 @@ Table 3: Cell Bounding Box detection results on PubTabNet, and FinTabNet. PP: Po
 Cell Content. In this section, we evaluate the entire pipeline of recovering a table with content. Here we put our approach to test by capitalizing on extracting content from the PDF cells rather than decoding from images. Tab. 4 shows the TEDs score of HTML code representing the structure of the table along with the content inserted in the data cell and compared with the ground-truth. Our method achieved a 5.3% increase over the state-of-the-art, and commercial solutions. We believe our scores would be higher if the HTML ground-truth matched the extracted PDF cell content. Unfortunately, there are small discrepancies such as spacings around words or special characters with various unicode representations.
 
 Table 4: Results of structure with content retrieved using cell detection on PubTabNet. In all cases the input is PDF documents with cropped tables.
-
 | Model       |   Simple |   TEDS Complex |   All |
 |-------------|----------|----------------|-------|
 | Tabula      |     78   |           57.8 |  67.9 |
@@ -222,8 +224,15 @@ Japanese language (previously unseen by TableFormer):
 
 Example table from FinTabNet:
 
+
+<!-- image -->
+
+
+<!-- image -->
+
 b. Structure predicted by TableFormer, with superimposed matched PDF cell text:
 
+
 |                                                    |             | 論文ファイル   | 論文ファイル   | 参考文献   | 参考文献   |
 |----------------------------------------------------|-------------|----------------|----------------|------------|------------|
 | 出典                                               | ファイル 数 | 英語           | 日本語         | 英語       | 日本語     |
@@ -237,7 +246,6 @@ b. Structure predicted by TableFormer, with superimposed matched PDF cell text:
 |                                                    | 945         | 294            | 651            | 1122       | 955        |
 
 Text is aligned to match original for ease of viewing
-
 |                          | Shares (in millions)   | Shares (in millions)   | Weighted Average Grant Date Fair Value   | Weighted Average Grant Date Fair Value   |
 |--------------------------|------------------------|------------------------|------------------------------------------|------------------------------------------|
 |                          | RS U s                 | PSUs                   | RSUs                                     | PSUs                                     |
@@ -248,8 +256,13 @@ Text is aligned to match original for ease of viewing
 | Nonvested on December 31 | 1.0                    | 0.3                    | 104.85 $                                 | $ 104.51                                 |
 
 Figure 5: One of the benefits of TableFormer is that it is language agnostic, as an example, the left part of the illustration demonstrates TableFormer predictions on previously unseen language (Japanese). Additionally, we see that TableFormer is robust to variability in style and content, right side of the illustration shows the example of the TableFormer prediction from the FinTabNet dataset.
+<!-- image -->
+
+
+<!-- image -->
 
 Figure 6: An example of TableFormer predictions (bounding boxes and structure) from generated SynthTabNet table.
+<!-- image -->
 
 ## 5.5. Qualitative Analysis
 
@@ -380,6 +393,7 @@ The process of generating a synthetic dataset can be decomposed into the followi
 Although TableFormer can predict the table structure and the bounding boxes for tables recognized inside PDF documents, this is not enough when a full reconstruction of the original table is required. This happens mainly due the following reasons:
 
 Figure 7: Distribution of the tables across different dimensions per dataset. Simple vs complex tables per dataset and split, strict vs non strict html structures per dataset and table complexity, missing bboxes per dataset and table complexity.
+<!-- image -->
 
 · TableFormer output does not include the table cell content.
 
@@ -432,19 +446,33 @@ Aditional images with examples of TableFormer predictions and post-processing ca
 Figure 8: Example of a table with multi-line header.
 
 Figure 9: Example of a table with big empty distance between cells.
+<!-- image -->
 
 Figure 10: Example of a complex table with empty cells.
+<!-- image -->
+
+
+<!-- image -->
 
 Figure 11: Simple table with different style and empty cells.
+<!-- image -->
 
 Figure 12: Simple table predictions and post processing.
+<!-- image -->
 
 Figure 13: Table predictions example on colorful table.
 
 Figure 14: Example with multi-line text.
+<!-- image -->
 
 Figure 16: Example of how post-processing helps to restore mis-aligned bounding boxes prediction artifact.
+<!-- image -->
+
+
+<!-- image -->
 
 Figure 15: Example with triangular table.
+<!-- image -->
 
-Figure 17: Example of long table. End-to-end example from initial PDF cells to prediction of bounding boxes, post processing and prediction of structure.
+Figure 17: Example of long table. End-to-end example from initial PDF cells to prediction of bounding boxes, post processing and prediction of structure.
+<!-- image -->