Skip to content

Commit

Permalink
fix(docx): merged table cells not properly converted (#857)
Browse files Browse the repository at this point in the history
* fix(docx): merged cells not properly converted

Fix conversion issue of merged cells in Word tables leading to repeated text.
Simplify Word table conversion code.
Add docx file with several table formats for regression tests.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* chore: add type hinting to docx backend

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
  • Loading branch information
ceberam authored Feb 3, 2025
1 parent eff16b6 commit 0cd81a8
Show file tree
Hide file tree
Showing 8 changed files with 2,715 additions and 124 deletions.
265 changes: 142 additions & 123 deletions docling/backend/msword_backend.py

Large diffs are not rendered by default.

Binary file added tests/data/docx/word_tables.docx
Binary file not shown.
75 changes: 75 additions & 0 deletions tests/data/groundtruth/docling_v2/word_tables.docx.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
<!DOCTYPE html>
<html lang="en">
<head>
<link rel="icon" type="image/png"
href="https://ds4sd.github.io/docling/assets/logo.png"/>
<meta charset="UTF-8">
<title>
Powered by Docling
</title>
<style>
html {
background-color: LightGray;
}
body {
margin: 0 auto;
width:800px;
padding: 30px;
background-color: White;
font-family: Arial, sans-serif;
box-shadow: 10px 10px 10px grey;
}
figure{
display: block;
width: 100%;
margin: 0px;
margin-top: 10px;
margin-bottom: 10px;
}
img {
display: block;
margin: auto;
margin-top: 10px;
margin-bottom: 10px;
max-width: 640px;
max-height: 640px;
}
table {
min-width:500px;
background-color: White;
border-collapse: collapse;
cell-padding: 5px;
margin: auto;
margin-top: 10px;
margin-bottom: 10px;
}
th, td {
border: 1px solid black;
padding: 8px;
}
th {
font-weight: bold;
}
table tr:nth-child(even) td{
background-color: LightGray;
}
</style>
</head>
<h2>Test with tables</h2>
<p>A uniform table</p>
<table><tbody><tr><td>Header 0.0</td><td>Header 0.1</td><td>Header 0.2</td></tr><tr><td>Cell 1.0</td><td>Cell 1.1</td><td>Cell 1.2</td></tr><tr><td>Cell 2.0</td><td>Cell 2.1</td><td>Cell 2.2</td></tr></tbody></table>
<p></p>
<p>A non-uniform table with horizontal spans</p>
<table><tbody><tr><td>Header 0.0</td><td>Header 0.1</td><td>Header 0.2</td></tr><tr><td>Cell 1.0</td><td colspan="2">Merged Cell 1.1 1.2</td></tr><tr><td>Cell 2.0</td><td colspan="2">Merged Cell 2.1 2.2</td></tr></tbody></table>
<p></p>
<p>A non-uniform table with horizontal spans in inner columns</p>
<table><tbody><tr><td>Header 0.0</td><td>Header 0.1</td><td>Header 0.2</td><td>Header 0.3</td></tr><tr><td>Cell 1.0</td><td colspan="2">Merged Cell 1.1 1.2</td><td>Cell 1.3</td></tr><tr><td>Cell 2.0</td><td colspan="2">Merged Cell 2.1 2.2</td><td>Cell 2.3</td></tr></tbody></table>
<p></p>
<p>A non-uniform table with vertical spans</p>
<table><tbody><tr><td>Header 0.0</td><td>Header 0.1</td><td>Header 0.2</td></tr><tr><td>Cell 1.0</td><td rowspan="2">Merged Cell 1.1 2.1</td><td>Cell 1.2</td></tr><tr><td>Cell 2.0</td><td>Cell 2.2</td></tr><tr><td>Cell 3.0</td><td rowspan="2">Merged Cell 3.1 4.1</td><td>Cell 3.2</td></tr><tr><td>Cell 4.0</td><td>Cell 4.2</td></tr></tbody></table>
<p></p>
<p>A non-uniform table with all kinds of spans and empty cells</p>
<table><tbody><tr><td>Header 0.0</td><td>Header 0.1</td><td>Header 0.2</td><td></td><td></td></tr><tr><td>Cell 1.0</td><td rowspan="2">Merged Cell 1.1 2.1</td><td>Cell 1.2</td><td></td><td></td></tr><tr><td>Cell 2.0</td><td>Cell 2.2</td><td></td><td></td></tr><tr><td>Cell 3.0</td><td rowspan="2">Merged Cell 3.1 4.1</td><td>Cell 3.2</td><td rowspan="3"></td><td></td></tr><tr><td>Cell 4.0</td><td>Cell 4.2</td><td rowspan="2">Merged Cell 4.4 5.4</td></tr><tr><td></td><td></td><td></td></tr><tr><td></td><td></td><td></td><td></td><td></td></tr><tr><td colspan="5"></td></tr><tr><td></td><td></td><td></td><td></td><td>Cell 8.4</td></tr></tbody></table>
<p></p>
<p></p>
</html>
19 changes: 19 additions & 0 deletions tests/data/groundtruth/docling_v2/word_tables.docx.itxt
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
item-0 at level 0: unspecified: group _root_
item-1 at level 1: section: group header-0
item-2 at level 2: section_header: Test with tables
item-3 at level 3: paragraph: A uniform table
item-4 at level 3: table with [3x3]
item-5 at level 3: paragraph:
item-6 at level 3: paragraph: A non-uniform table with horizontal spans
item-7 at level 3: table with [3x3]
item-8 at level 3: paragraph:
item-9 at level 3: paragraph: A non-uniform table with horizontal spans in inner columns
item-10 at level 3: table with [3x4]
item-11 at level 3: paragraph:
item-12 at level 3: paragraph: A non-uniform table with vertical spans
item-13 at level 3: table with [5x3]
item-14 at level 3: paragraph:
item-15 at level 3: paragraph: A non-uniform table with all kinds of spans and empty cells
item-16 at level 3: table with [9x5]
item-17 at level 3: paragraph:
item-18 at level 3: paragraph:
Loading

0 comments on commit 0cd81a8

Please sign in to comment.