Skip to content

Commit

Permalink
docs: reflect supported Python versions, add badges (#10)
Browse files Browse the repository at this point in the history
* docs: reflect supported Python versions, add badges

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* minor HTML fix

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

---------

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
  • Loading branch information
vagenas authored Jul 17, 2024
1 parent 0dfa454 commit 2baa35c
Showing 1 changed file with 30 additions and 25 deletions.
55 changes: 30 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,18 @@
<p align="center">
<a href="https://github.com/ds4sd/docling"> <img loading="lazy" alt="Docling" src="https://github.com/DS4SD/docling/raw/main/logo.png" width="150" /> </a>
<a href="https://github.com/ds4sd/docling"> <img loading="lazy" alt="Docling" src="https://github.com/DS4SD/docling/raw/main/logo.png" width="150" />
</p>

# Docling

[![PyPI version](https://img.shields.io/pypi/v/docling)](https://pypi.org/project/docling/)
![Python](https://img.shields.io/badge/python-3.11%20%7C%203.12-blue)
[![Poetry](https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json)](https://python-poetry.org/)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)
[![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://pydantic.dev)
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
[![License MIT](https://img.shields.io/github/license/ds4sd/deepsearch-toolkit)](https://opensource.org/licenses/MIT)

Docling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package.

## Features
Expand All @@ -12,25 +21,20 @@ Docling bundles PDF document conversion to JSON and Markdown in an easy, self-co
* 📝 Extracts metadata from the document, such as title, authors, references and language
* 🔍 Optionally applies OCR (use with scanned PDFs)

## Setup
## Installation

For general usage, you can simply install `docling` through `pip` from the pypi package index.
```
To use Docling, simply install `docling` from your package manager, e.g. pip:
```bash
pip install docling
```

**Notes**:
* Works on macOS and Linux environments. Windows platforms are currently not tested.
> [!NOTE]
> Works on macOS and Linux environments. Windows platforms are currently not tested.
### Development setup

To develop for `docling`, you need Python 3.11 and `poetry`. Install poetry from [here](https://python-poetry.org/docs/#installing-with-the-official-installer).

Once you have `poetry` installed and cloned this repo, create an environment and install `docling` from the repo root:

To develop for Docling, you need Python 3.11 / 3.12 and Poetry. You can then install from your local clone's root dir:
```bash
poetry env use $(which python3.11)
poetry shell
poetry install
```

Expand All @@ -45,23 +49,24 @@ The output of the above command will be written to `./scratch`.

### Enable or disable pipeline features

You can control if table structure recognition or OCR should be performed by arguments passed to `DocumentConverter`
You can control if table structure recognition or OCR should be performed by arguments passed to `DocumentConverter`:
```python
doc_converter = DocumentConverter(
artifacts_path=artifacts_path,
pipeline_options=PipelineOptions(do_table_structure=False, # Controls if table structure is recovered.
do_ocr=True), # Controls if OCR is applied (ignores programmatic content)
pipeline_options=PipelineOptions(
do_table_structure=False, # controls if table structure is recovered
do_ocr=True, # controls if OCR is applied (ignores programmatic content)
),
)
```

### Impose limits on the document size

You can limit the file size and number of pages which should be allowed to process per document.
You can limit the file size and number of pages which should be allowed to process per document:
```python
paths = [Path("./test/data/2206.01062.pdf")]

input = DocumentConversionInput.from_paths(
paths, limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)
conv_input = DocumentConversionInput.from_paths(
paths=[Path("./test/data/2206.01062.pdf")],
limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)
)
```

Expand All @@ -71,12 +76,12 @@ You can convert PDFs from a binary stream instead of from the filesystem as foll
```python
buf = BytesIO(your_binary_stream)
docs = [DocumentStream(filename="my_doc.pdf", stream=buf)]
input = DocumentConversionInput.from_streams(docs)
converted_docs = doc_converter.convert(input)
conv_input = DocumentConversionInput.from_streams(docs)
converted_docs = doc_converter.convert(conv_input)
```
### Limit resource usage

You can limit the CPU threads used by `docling` by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.


## Contributing
Expand All @@ -86,7 +91,7 @@ Please read [Contributing to Docling](https://github.com/DS4SD/docling/blob/main

## References

If you use `Docling` in your projects, please consider citing the following:
If you use Docling in your projects, please consider citing the following:

```bib
@software{Docling,
Expand All @@ -101,5 +106,5 @@ year = {2024}

## License

The `Docling` codebase is under MIT license.
The Docling codebase is under MIT license.
For individual model usage, please refer to the model licenses found in the original packages.

0 comments on commit 2baa35c

Please sign in to comment.