Merge pull request #2 from camelot-dev/master

Update fork
camelot-dev · Dec 8, 2020 · 644e17e · 644e17e
2 parents eadc54a + 7709e58
commit 644e17e
Show file tree

Hide file tree

Showing 18 changed files with 353 additions and 185 deletions.
diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md
@@ -0,0 +1,48 @@
+---
+name: Bug report
+about: Please follow this template to submit bug reports.
+title: ''
+labels: bug
+assignees: ''
+
+---
+
+<!-- Please read the filing issues section of the contributor's guide first: https://camelot-py.readthedocs.io/en/master/dev/contributing.html -->
+
+**Describe the bug**
+A clear and concise description of what the bug is.
+
+**Steps to reproduce the bug**
+Steps used to install `camelot`:
+1. Add step here (you can add more steps too)
+
+Steps to reproduce the behavior:
+1. Add step here (you can add more steps too)
+
+**Expected behavior**
+A clear and concise description of what you expected to happen.
+
+**Code**
+Add the Camelot code snippet that you used.
+```
+import camelot
+
+# add your code here
+```
+
+**PDF**
+Add the PDF file that you want to extract tables from.
+
+**Screenshots**
+If applicable, add screenshots to help explain your problem.
+
+**Environment**
+ - OS: [e.g. MacOS]
+ - Python version:
+ - Numpy version:
+ - OpenCV version:
+ - Ghostscript version:
+ - Camelot version:
+
+**Additional context**
+Add any other context about the problem here.
diff --git a/LICENSE b/LICENSE
@@ -1,12 +1,7 @@
 MIT License
 
-Modifications:
-
-Copyright (c) 2019 Camelot Developers
-
-Original project:
-
-Copyright (c) 2018 Peeply Private Ltd (Singapore)
+Copyright (c) 2019-2020 Camelot Developers
+Copyright (c) 2018-2019 Peeply Private Ltd (Singapore)
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

diff --git a/README.md b/README.md
@@ -10,13 +10,13 @@
 [![image](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/ambv/black) [![image](https://img.shields.io/badge/continous%20quality-deepsource-lightgrey)](https://deepsource.io/gh/camelot-dev/camelot/?ref=repository-badge)
 
 
-**Camelot** is a Python library that makes it easy for *anyone* to extract tables from PDF files!
+**Camelot** is a Python library that can help you extract tables from PDFs!
 
-**Note:** You can also check out [Excalibur](https://github.com/camelot-dev/excalibur), which is a web interface for Camelot!
+**Note:** You can also check out [Excalibur](https://github.com/camelot-dev/excalibur), the web interface to Camelot!
 
 ---
 
-**Here's how you can extract tables from PDF files.** Check out the PDF used in this example [here](https://github.com/camelot-dev/camelot/blob/master/docs/_static/pdf/foo.pdf).
+**Here's how you can extract tables from PDFs.** You can check out the PDF used in this example [here](https://github.com/camelot-dev/camelot/blob/master/docs/_static/pdf/foo.pdf).
 
 <pre>
 >>> import camelot
@@ -46,32 +46,35 @@
 | 2032_2     | 0.17      | 57.8          | 21.7%                | 0.3%            | 2.7%            | 1.2%           |
 | 4171_1     | 0.07      | 173.9         | 58.1%                | 1.6%            | 2.1%            | 0.5%           |
 
-There's a [command-line interface](https://camelot-py.readthedocs.io/en/master/user/cli.html) too!
+Camelot also comes packaged with a [command-line interface](https://camelot-py.readthedocs.io/en/master/user/cli.html)!
 
 **Note:** Camelot only works with text-based PDFs and not scanned documents. (As Tabula [explains](https://github.com/tabulapdf/tabula#why-tabula), "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)
 
 ## Why Camelot?
 
-- **You are in control.**: Unlike other libraries and tools which either give a nice output or fail miserably (with no in-between), Camelot gives you the power to tweak table extraction. (This is important since everything in the real world, including PDF table extraction, is fuzzy.)
-- *Bad* tables can be discarded based on **metrics** like accuracy and whitespace, without ever having to manually look at each table.
-- Each table is a **pandas DataFrame**, which seamlessly integrates into [ETL and data analysis workflows](https://gist.github.com/vinayak-mehta/e5949f7c2410a0e12f25d3682dc9e873).
-- **Export** to multiple formats, including JSON, Excel, HTML and Sqlite.
+- **Configurability**: Camelot gives you control over the table extraction process with its [tweakable settings](https://camelot-py.readthedocs.io/en/master/user/advanced.html).
+- **Metrics**: Bad tables can be discarded based on metrics like accuracy and whitespace, without having to manually look at each table.
+- **Output**: Each table is extracted into a **pandas DataFrame**, which seamlessly integrates into [ETL and data analysis workflows](https://gist.github.com/vinayak-mehta/e5949f7c2410a0e12f25d3682dc9e873). You can also export tables to multiple formats, which include CSV, JSON, Excel, HTML and Sqlite.
 
-See [comparison with other PDF table extraction libraries and tools](https://github.com/camelot-dev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools).
+See [comparison with similar libraries and tools](https://github.com/camelot-dev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools).
+
+## Support the development
+
+If Camelot has helped you, please consider supporting its development with a one-time or monthly donation [on OpenCollective](https://opencollective.com/camelot).
 
 ## Installation
 
 ### Using conda
 
-The easiest way to install Camelot is to install it with [conda](https://conda.io/docs/), which is a package manager and  environment management system for the [Anaconda](http://docs.continuum.io/anaconda/) distribution.
+The easiest way to install Camelot is with [conda](https://conda.io/docs/), which is a package manager and environment management system for the [Anaconda](http://docs.continuum.io/anaconda/) distribution.
 
 <pre>
 $ conda install -c conda-forge camelot-py
 </pre>
 
 ### Using pip
 
-After [installing the dependencies](https://camelot-py.readthedocs.io/en/master/user/install-deps.html) ([tk](https://packages.ubuntu.com/bionic/python/python-tk) and [ghostscript](https://www.ghostscript.com/)), you can simply use pip to install Camelot:
+After [installing the dependencies](https://camelot-py.readthedocs.io/en/master/user/install-deps.html) ([tk](https://packages.ubuntu.com/bionic/python/python-tk) and [ghostscript](https://www.ghostscript.com/)), you can also just use pip to install Camelot:
 
 <pre>
 $ pip install "camelot-py[cv]"
@@ -94,50 +97,20 @@ $ pip install ".[cv]"
 
 ## Documentation
 
-Great documentation is available at [http://camelot-py.readthedocs.io/](http://camelot-py.readthedocs.io/).
-
-## Development
-
-The [Contributor's Guide](https://camelot-py.readthedocs.io/en/master/dev/contributing.html) has detailed information about contributing code, documentation, tests and more. We've included some basic information in this README.
-
-### Source code
-
-You can check the latest sources with:
-
-<pre>
-$ git clone https://www.github.com/camelot-dev/camelot
-</pre>
-
-### Setting up a development environment
-
-You can install the development dependencies easily, using pip:
-
-<pre>
-$ pip install "camelot-py[dev]"
-</pre>
-
-### Testing
-
-After installation, you can run tests using:
-
-<pre>
-$ python setup.py test
-</pre>
+The documentation is available at [http://camelot-py.readthedocs.io/](http://camelot-py.readthedocs.io/).
 
 ## Wrappers
 
 - [camelot-php](https://github.com/randomstate/camelot-php) provides a [PHP](https://www.php.net/) wrapper on Camelot.
 
+## Contributing
+
+The [Contributor's Guide](https://camelot-py.readthedocs.io/en/master/dev/contributing.html) has detailed information about contributing issues, documentation, code, and tests.
+
 ## Versioning
 
 Camelot uses [Semantic Versioning](https://semver.org/). For the available versions, see the tags on this repository. For the changelog, you can check out [HISTORY.md](https://github.com/camelot-dev/camelot/blob/master/HISTORY.md).
 
 ## License
 
 This project is licensed under the MIT License, see the [LICENSE](https://github.com/camelot-dev/camelot/blob/master/LICENSE) file for details.
-
-## Support the development
-
-You can support our work on Camelot with a one-time or monthly donation [on OpenCollective](https://opencollective.com/camelot). Organizations who use camelot can also sponsor the project for an acknowledgement on [our documentation site](https://camelot-py.readthedocs.io/en/master/) and this README.
-
-Special thanks to all the users, organizations and contributors that support Camelot!
diff --git a/camelot/handlers.py b/camelot/handlers.py
@@ -70,7 +70,8 @@ def _get_pages(self, filepath, pages):
         if pages == "1":
             page_numbers.append({"start": 1, "end": 1})
         else:
-            infile = PdfFileReader(open(filepath, "rb"), strict=False)
+            instream = open(filepath, "rb")
+            infile = PdfFileReader(instream, strict=False)
             if infile.isEncrypted:
                 infile.decrypt(self.password)
             if pages == "all":
@@ -84,6 +85,7 @@ def _get_pages(self, filepath, pages):
                         page_numbers.append({"start": int(a), "end": int(b)})
                     else:
                         page_numbers.append({"start": int(r), "end": int(r)})
+            instream.close()
         P = []
         for p in page_numbers:
             P.extend(range(p["start"], p["end"] + 1))
@@ -122,7 +124,8 @@ def _save_page(self, filepath, page, temp):
             if rotation != "":
                 fpath_new = "".join([froot.replace("page", "p"), "_rotated", fext])
                 os.rename(fpath, fpath_new)
-                infile = PdfFileReader(open(fpath_new, "rb"), strict=False)
+                instream = open(fpath_new, "rb")
+                infile = PdfFileReader(instream, strict=False)
                 if infile.isEncrypted:
                     infile.decrypt(self.password)
                 outfile = PdfFileWriter()
@@ -134,6 +137,7 @@ def _save_page(self, filepath, page, temp):
                 outfile.addPage(p)
                 with open(fpath, "wb") as f:
                     outfile.write(f)
+                instream.close()
 
     def parse(
         self, flavor="lattice", suppress_stdout=False, layout_kwargs={}, **kwargs

diff --git a/camelot/parsers/stream.py b/camelot/parsers/stream.py
@@ -121,6 +121,7 @@ def _group_rows(text, row_tol=2):
         row_y = 0
         rows = []
         temp = []
+
         for t in text:
             # is checking for upright necessary?
             # if t.get_text().strip() and all([obj.upright for obj in t._objs if
@@ -131,8 +132,10 @@ def _group_rows(text, row_tol=2):
                     temp = []
                     row_y = t.y0
                 temp.append(t)
+
         rows.append(sorted(temp, key=lambda t: t.x0))
-        __ = rows.pop(0)  # TODO: hacky
+        if len(rows) > 1:
+            __ = rows.pop(0)  # TODO: hacky
         return rows
 
     @staticmethod
@@ -345,43 +348,46 @@ def _generate_columns_and_rows(self, table_idx, tk):
         else:
             # calculate mode of the list of number of elements in
             # each row to guess the number of columns
-            ncols = max(set(elements), key=elements.count)
-            if ncols == 1:
-                # if mode is 1, the page usually contains not tables
-                # but there can be cases where the list can be skewed,
-                # try to remove all 1s from list in this case and
-                # see if the list contains elements, if yes, then use
-                # the mode after removing 1s
-                elements = list(filter(lambda x: x != 1, elements))
-                if len(elements):
-                    ncols = max(set(elements), key=elements.count)
-                else:
-                    warnings.warn(
-                        f"No tables found in table area {table_idx + 1}"
+            if not len(elements):
+                cols = [(text_x_min, text_x_max)]
+            else:
+                ncols = max(set(elements), key=elements.count)
+                if ncols == 1:
+                    # if mode is 1, the page usually contains not tables
+                    # but there can be cases where the list can be skewed,
+                    # try to remove all 1s from list in this case and
+                    # see if the list contains elements, if yes, then use
+                    # the mode after removing 1s
+                    elements = list(filter(lambda x: x != 1, elements))
+                    if len(elements):
+                        ncols = max(set(elements), key=elements.count)
+                    else:
+                        warnings.warn(
+                            f"No tables found in table area {table_idx + 1}"
+                        )
+                cols = [(t.x0, t.x1) for r in rows_grouped if len(r) == ncols for t in r]
+                cols = self._merge_columns(sorted(cols), column_tol=self.column_tol)
+                inner_text = []
+                for i in range(1, len(cols)):
+                    left = cols[i - 1][1]
+                    right = cols[i][0]
+                    inner_text.extend(
+                        [
+                            t
+                            for direction in self.t_bbox
+                            for t in self.t_bbox[direction]
+                            if t.x0 > left and t.x1 < right
+                        ]
                     )
-            cols = [(t.x0, t.x1) for r in rows_grouped if len(r) == ncols for t in r]
-            cols = self._merge_columns(sorted(cols), column_tol=self.column_tol)
-            inner_text = []
-            for i in range(1, len(cols)):
-                left = cols[i - 1][1]
-                right = cols[i][0]
-                inner_text.extend(
-                    [
-                        t
-                        for direction in self.t_bbox
-                        for t in self.t_bbox[direction]
-                        if t.x0 > left and t.x1 < right
-                    ]
-                )
-            outer_text = [
-                t
-                for direction in self.t_bbox
-                for t in self.t_bbox[direction]
-                if t.x0 > cols[-1][1] or t.x1 < cols[0][0]
-            ]
-            inner_text.extend(outer_text)
-            cols = self._add_columns(cols, inner_text, self.row_tol)
-            cols = self._join_columns(cols, text_x_min, text_x_max)
+                outer_text = [
+                    t
+                    for direction in self.t_bbox
+                    for t in self.t_bbox[direction]
+                    if t.x0 > cols[-1][1] or t.x1 < cols[0][0]
+                ]
+                inner_text.extend(outer_text)
+                cols = self._add_columns(cols, inner_text, self.row_tol)
+                cols = self._join_columns(cols, text_x_min, text_x_max)
 
         return cols, rows