Skip to content

Commit

Permalink
Merge pull request #2 from camelot-dev/master
Browse files Browse the repository at this point in the history
Update fork
  • Loading branch information
anakin87 authored Dec 8, 2020
2 parents eadc54a + 7709e58 commit 644e17e
Show file tree
Hide file tree
Showing 18 changed files with 353 additions and 185 deletions.
48 changes: 48 additions & 0 deletions .github/ISSUE_TEMPLATE/bug_report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
---
name: Bug report
about: Please follow this template to submit bug reports.
title: ''
labels: bug
assignees: ''

---

<!-- Please read the filing issues section of the contributor's guide first: https://camelot-py.readthedocs.io/en/master/dev/contributing.html -->

**Describe the bug**
A clear and concise description of what the bug is.

**Steps to reproduce the bug**
Steps used to install `camelot`:
1. Add step here (you can add more steps too)

Steps to reproduce the behavior:
1. Add step here (you can add more steps too)

**Expected behavior**
A clear and concise description of what you expected to happen.

**Code**
Add the Camelot code snippet that you used.
```
import camelot
# add your code here
```

**PDF**
Add the PDF file that you want to extract tables from.

**Screenshots**
If applicable, add screenshots to help explain your problem.

**Environment**
- OS: [e.g. MacOS]
- Python version:
- Numpy version:
- OpenCV version:
- Ghostscript version:
- Camelot version:

**Additional context**
Add any other context about the problem here.
9 changes: 2 additions & 7 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -1,12 +1,7 @@
MIT License

Modifications:

Copyright (c) 2019 Camelot Developers

Original project:

Copyright (c) 2018 Peeply Private Ltd (Singapore)
Copyright (c) 2019-2020 Camelot Developers
Copyright (c) 2018-2019 Peeply Private Ltd (Singapore)

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
65 changes: 19 additions & 46 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,13 @@
[![image](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/ambv/black) [![image](https://img.shields.io/badge/continous%20quality-deepsource-lightgrey)](https://deepsource.io/gh/camelot-dev/camelot/?ref=repository-badge)


**Camelot** is a Python library that makes it easy for *anyone* to extract tables from PDF files!
**Camelot** is a Python library that can help you extract tables from PDFs!

**Note:** You can also check out [Excalibur](https://github.com/camelot-dev/excalibur), which is a web interface for Camelot!
**Note:** You can also check out [Excalibur](https://github.com/camelot-dev/excalibur), the web interface to Camelot!

---

**Here's how you can extract tables from PDF files.** Check out the PDF used in this example [here](https://github.com/camelot-dev/camelot/blob/master/docs/_static/pdf/foo.pdf).
**Here's how you can extract tables from PDFs.** You can check out the PDF used in this example [here](https://github.com/camelot-dev/camelot/blob/master/docs/_static/pdf/foo.pdf).

<pre>
>>> import camelot
Expand Down Expand Up @@ -46,32 +46,35 @@
| 2032_2 | 0.17 | 57.8 | 21.7% | 0.3% | 2.7% | 1.2% |
| 4171_1 | 0.07 | 173.9 | 58.1% | 1.6% | 2.1% | 0.5% |

There's a [command-line interface](https://camelot-py.readthedocs.io/en/master/user/cli.html) too!
Camelot also comes packaged with a [command-line interface](https://camelot-py.readthedocs.io/en/master/user/cli.html)!

**Note:** Camelot only works with text-based PDFs and not scanned documents. (As Tabula [explains](https://github.com/tabulapdf/tabula#why-tabula), "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)

## Why Camelot?

- **You are in control.**: Unlike other libraries and tools which either give a nice output or fail miserably (with no in-between), Camelot gives you the power to tweak table extraction. (This is important since everything in the real world, including PDF table extraction, is fuzzy.)
- *Bad* tables can be discarded based on **metrics** like accuracy and whitespace, without ever having to manually look at each table.
- Each table is a **pandas DataFrame**, which seamlessly integrates into [ETL and data analysis workflows](https://gist.github.com/vinayak-mehta/e5949f7c2410a0e12f25d3682dc9e873).
- **Export** to multiple formats, including JSON, Excel, HTML and Sqlite.
- **Configurability**: Camelot gives you control over the table extraction process with its [tweakable settings](https://camelot-py.readthedocs.io/en/master/user/advanced.html).
- **Metrics**: Bad tables can be discarded based on metrics like accuracy and whitespace, without having to manually look at each table.
- **Output**: Each table is extracted into a **pandas DataFrame**, which seamlessly integrates into [ETL and data analysis workflows](https://gist.github.com/vinayak-mehta/e5949f7c2410a0e12f25d3682dc9e873). You can also export tables to multiple formats, which include CSV, JSON, Excel, HTML and Sqlite.

See [comparison with other PDF table extraction libraries and tools](https://github.com/camelot-dev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools).
See [comparison with similar libraries and tools](https://github.com/camelot-dev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools).

## Support the development

If Camelot has helped you, please consider supporting its development with a one-time or monthly donation [on OpenCollective](https://opencollective.com/camelot).

## Installation

### Using conda

The easiest way to install Camelot is to install it with [conda](https://conda.io/docs/), which is a package manager and environment management system for the [Anaconda](http://docs.continuum.io/anaconda/) distribution.
The easiest way to install Camelot is with [conda](https://conda.io/docs/), which is a package manager and environment management system for the [Anaconda](http://docs.continuum.io/anaconda/) distribution.

<pre>
$ conda install -c conda-forge camelot-py
</pre>

### Using pip

After [installing the dependencies](https://camelot-py.readthedocs.io/en/master/user/install-deps.html) ([tk](https://packages.ubuntu.com/bionic/python/python-tk) and [ghostscript](https://www.ghostscript.com/)), you can simply use pip to install Camelot:
After [installing the dependencies](https://camelot-py.readthedocs.io/en/master/user/install-deps.html) ([tk](https://packages.ubuntu.com/bionic/python/python-tk) and [ghostscript](https://www.ghostscript.com/)), you can also just use pip to install Camelot:

<pre>
$ pip install "camelot-py[cv]"
Expand All @@ -94,50 +97,20 @@ $ pip install ".[cv]"

## Documentation

Great documentation is available at [http://camelot-py.readthedocs.io/](http://camelot-py.readthedocs.io/).

## Development

The [Contributor's Guide](https://camelot-py.readthedocs.io/en/master/dev/contributing.html) has detailed information about contributing code, documentation, tests and more. We've included some basic information in this README.

### Source code

You can check the latest sources with:

<pre>
$ git clone https://www.github.com/camelot-dev/camelot
</pre>

### Setting up a development environment

You can install the development dependencies easily, using pip:

<pre>
$ pip install "camelot-py[dev]"
</pre>

### Testing

After installation, you can run tests using:

<pre>
$ python setup.py test
</pre>
The documentation is available at [http://camelot-py.readthedocs.io/](http://camelot-py.readthedocs.io/).

## Wrappers

- [camelot-php](https://github.com/randomstate/camelot-php) provides a [PHP](https://www.php.net/) wrapper on Camelot.

## Contributing

The [Contributor's Guide](https://camelot-py.readthedocs.io/en/master/dev/contributing.html) has detailed information about contributing issues, documentation, code, and tests.

## Versioning

Camelot uses [Semantic Versioning](https://semver.org/). For the available versions, see the tags on this repository. For the changelog, you can check out [HISTORY.md](https://github.com/camelot-dev/camelot/blob/master/HISTORY.md).

## License

This project is licensed under the MIT License, see the [LICENSE](https://github.com/camelot-dev/camelot/blob/master/LICENSE) file for details.

## Support the development

You can support our work on Camelot with a one-time or monthly donation [on OpenCollective](https://opencollective.com/camelot). Organizations who use camelot can also sponsor the project for an acknowledgement on [our documentation site](https://camelot-py.readthedocs.io/en/master/) and this README.

Special thanks to all the users, organizations and contributors that support Camelot!
8 changes: 6 additions & 2 deletions camelot/handlers.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,8 @@ def _get_pages(self, filepath, pages):
if pages == "1":
page_numbers.append({"start": 1, "end": 1})
else:
infile = PdfFileReader(open(filepath, "rb"), strict=False)
instream = open(filepath, "rb")
infile = PdfFileReader(instream, strict=False)
if infile.isEncrypted:
infile.decrypt(self.password)
if pages == "all":
Expand All @@ -84,6 +85,7 @@ def _get_pages(self, filepath, pages):
page_numbers.append({"start": int(a), "end": int(b)})
else:
page_numbers.append({"start": int(r), "end": int(r)})
instream.close()
P = []
for p in page_numbers:
P.extend(range(p["start"], p["end"] + 1))
Expand Down Expand Up @@ -122,7 +124,8 @@ def _save_page(self, filepath, page, temp):
if rotation != "":
fpath_new = "".join([froot.replace("page", "p"), "_rotated", fext])
os.rename(fpath, fpath_new)
infile = PdfFileReader(open(fpath_new, "rb"), strict=False)
instream = open(fpath_new, "rb")
infile = PdfFileReader(instream, strict=False)
if infile.isEncrypted:
infile.decrypt(self.password)
outfile = PdfFileWriter()
Expand All @@ -134,6 +137,7 @@ def _save_page(self, filepath, page, temp):
outfile.addPage(p)
with open(fpath, "wb") as f:
outfile.write(f)
instream.close()

def parse(
self, flavor="lattice", suppress_stdout=False, layout_kwargs={}, **kwargs
Expand Down
80 changes: 43 additions & 37 deletions camelot/parsers/stream.py
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,7 @@ def _group_rows(text, row_tol=2):
row_y = 0
rows = []
temp = []

for t in text:
# is checking for upright necessary?
# if t.get_text().strip() and all([obj.upright for obj in t._objs if
Expand All @@ -131,8 +132,10 @@ def _group_rows(text, row_tol=2):
temp = []
row_y = t.y0
temp.append(t)

rows.append(sorted(temp, key=lambda t: t.x0))
__ = rows.pop(0) # TODO: hacky
if len(rows) > 1:
__ = rows.pop(0) # TODO: hacky
return rows

@staticmethod
Expand Down Expand Up @@ -345,43 +348,46 @@ def _generate_columns_and_rows(self, table_idx, tk):
else:
# calculate mode of the list of number of elements in
# each row to guess the number of columns
ncols = max(set(elements), key=elements.count)
if ncols == 1:
# if mode is 1, the page usually contains not tables
# but there can be cases where the list can be skewed,
# try to remove all 1s from list in this case and
# see if the list contains elements, if yes, then use
# the mode after removing 1s
elements = list(filter(lambda x: x != 1, elements))
if len(elements):
ncols = max(set(elements), key=elements.count)
else:
warnings.warn(
f"No tables found in table area {table_idx + 1}"
if not len(elements):
cols = [(text_x_min, text_x_max)]
else:
ncols = max(set(elements), key=elements.count)
if ncols == 1:
# if mode is 1, the page usually contains not tables
# but there can be cases where the list can be skewed,
# try to remove all 1s from list in this case and
# see if the list contains elements, if yes, then use
# the mode after removing 1s
elements = list(filter(lambda x: x != 1, elements))
if len(elements):
ncols = max(set(elements), key=elements.count)
else:
warnings.warn(
f"No tables found in table area {table_idx + 1}"
)
cols = [(t.x0, t.x1) for r in rows_grouped if len(r) == ncols for t in r]
cols = self._merge_columns(sorted(cols), column_tol=self.column_tol)
inner_text = []
for i in range(1, len(cols)):
left = cols[i - 1][1]
right = cols[i][0]
inner_text.extend(
[
t
for direction in self.t_bbox
for t in self.t_bbox[direction]
if t.x0 > left and t.x1 < right
]
)
cols = [(t.x0, t.x1) for r in rows_grouped if len(r) == ncols for t in r]
cols = self._merge_columns(sorted(cols), column_tol=self.column_tol)
inner_text = []
for i in range(1, len(cols)):
left = cols[i - 1][1]
right = cols[i][0]
inner_text.extend(
[
t
for direction in self.t_bbox
for t in self.t_bbox[direction]
if t.x0 > left and t.x1 < right
]
)
outer_text = [
t
for direction in self.t_bbox
for t in self.t_bbox[direction]
if t.x0 > cols[-1][1] or t.x1 < cols[0][0]
]
inner_text.extend(outer_text)
cols = self._add_columns(cols, inner_text, self.row_tol)
cols = self._join_columns(cols, text_x_min, text_x_max)
outer_text = [
t
for direction in self.t_bbox
for t in self.t_bbox[direction]
if t.x0 > cols[-1][1] or t.x1 < cols[0][0]
]
inner_text.extend(outer_text)
cols = self._add_columns(cols, inner_text, self.row_tol)
cols = self._join_columns(cols, text_x_min, text_x_max)

return cols, rows

Expand Down
Loading

0 comments on commit 644e17e

Please sign in to comment.