GROBID Client Python

A simple, efficient Python client for GROBID REST services that provides concurrent processing capabilities for PDF documents, reference strings, and patents.

📋 Table of Contents

Features
Prerequisites
Installation
Quick Start
Usage
- Command Line Interface
- Python Library
Configuration
Services
Testing
Performance
Development
License

✨ Features

Concurrent Processing: Efficiently process multiple documents in parallel
Flexible Input: Process PDF files, text files with references, and XML patents
Configurable: Customizable server settings, timeouts, and processing options
Command Line & Library: Use as a standalone CLI tool or import into your Python projects
Coordinate Extraction: Optional PDF coordinate extraction for precise element positioning
Sentence Segmentation: Layout-aware sentence segmentation capabilities

📋 Prerequisites

Python: 3.8 - 3.13 (tested versions)
GROBID Server: A running GROBID service instance
- Local installation: GROBID Documentation
- Docker: docker run -t --rm -p 8070:8070 lfoppiano/grobid:0.8.2
- Default server: http://localhost:8070
- Online demo: https://lfoppiano-grobid.hf.space (usage limits apply), more details here.

Important

GROBID supports Windows only through Docker containers. See the Docker documentation for details.

🚀 Installation

Choose one of the following installation methods:

PyPI (Recommended)

pip install grobid-client-python

Development Version

pip install git+https://github.com/kermitt2/grobid_client_python.git

Local Development

git clone https://github.com/kermitt2/grobid_client_python
cd grobid_client_python
pip install -e .

⚡ Quick Start

Command Line

# Process PDFs in a directory
grobid_client --input ./pdfs --output ./output processFulltextDocument

# Process with custom server
grobid_client --server https://your-grobid-server.com --input ./pdfs processFulltextDocument

Python Library

from grobid_client.grobid_client import GrobidClient

# Create client instance
client = GrobidClient(config_path="./config.json")

# Process documents
client.process("processFulltextDocument", "/path/to/pdfs", n=10)

📖 Usage

Command Line Interface

The client provides a comprehensive CLI with the following syntax:

grobid_client [OPTIONS] SERVICE

Available Services

Service	Description	Input Format
`processFulltextDocument`	Extract full document structure	PDF files
`processHeaderDocument`	Extract document metadata	PDF files
`processReferences`	Extract bibliographic references	PDF files
`processCitationList`	Parse citation strings	Text files (one citation per line)
`processCitationPatentST36`	Process patent citations	XML ST36 format
`processCitationPatentPDF`	Process patent PDFs	PDF files

Common Options

Option	Description	Default
`--input`	Input directory path	Required
`--output`	Output directory path	Same as input
`--server`	GROBID server URL	`http://localhost:8070`
`--n`	Concurrency level	10
`--config`	Config file path	Optional
`--force`	Overwrite existing files	False
`--verbose`	Enable verbose logging	False

Processing Options

Option	Description
`--generateIDs`	Generate random XML IDs
`--consolidate_header`	Consolidate header metadata
`--consolidate_citations`	Consolidate bibliographic references
`--include_raw_citations`	Include raw citation text
`--include_raw_affiliations`	Include raw affiliation text
`--teiCoordinates`	Add PDF coordinates to XML
`--segmentSentences`	Segment sentences with coordinates
`--flavor`	Processing flavor for fulltext extraction

Examples

# Basic fulltext processing
grobid_client --input ~/documents --output ~/results processFulltextDocument

# High concurrency with coordinates
grobid_client --input ~/pdfs --output ~/tei --n 20 --teiCoordinates processFulltextDocument

# Process citations with custom server
grobid_client --server https://grobid.example.com --input ~/citations.txt processCitationList

# Force reprocessing with sentence segmentation
grobid_client --input ~/docs --force --segmentSentences processFulltextDocument

Python Library

Basic Usage

from grobid_client.grobid_client import GrobidClient

# Initialize with default localhost server
client = GrobidClient()

# Initialize with custom server
client = GrobidClient(grobid_server="https://your-server.com")

# Initialize with config file
client = GrobidClient(config_path="./config.json")

# Process documents
client.process(
    service="processFulltextDocument",
    input_path="/path/to/pdfs",
    output_path="/path/to/output",
    n=20
)

Advanced Usage

# Process with specific options
client.process(
    service="processFulltextDocument",
    input_path="/path/to/pdfs",
    output_path="/path/to/output",
    n=10,
    generateIDs=True,
    consolidate_header=True,
    teiCoordinates=True,
    segmentSentences=True
)

# Process citation lists
client.process(
    service="processCitationList",
    input_path="/path/to/citations.txt",
    output_path="/path/to/output"
)

⚙️ Configuration

Configuration can be provided via a JSON file. When using the CLI, the --server argument overrides the config file settings.

Default Configuration

{
    "grobid_server": "http://localhost:8070",
    "batch_size": 1000,
    "sleep_time": 5,
    "timeout": 60,
    "coordinates": ["persName", "figure", "ref", "biblStruct", "formula", "s"]
}

Configuration Parameters

Parameter	Description	Default
`grobid_server`	GROBID server URL	`http://localhost:8070`
`batch_size`	Thread pool size. Tune carefully: a large batch size will result in the data being written less frequently	1000
`sleep_time`	Wait time when server is busy (seconds)	5
`timeout`	Client-side timeout (seconds)	180
`coordinates`	XML elements for coordinate extraction	See above

Tip

Since version 0.0.12, the config file is optional. The client will use default localhost settings if no configuration is provided.

🔬 Services

Fulltext Document Processing

Extracts complete document structure including headers, body text, figures, tables, and references.

grobid_client --input pdfs/ --output results/ processFulltextDocument

Header Document Processing

Extracts only document metadata (title, authors, abstract, etc.).

grobid_client --input pdfs/ --output headers/ processHeaderDocument

Reference Processing

Extracts and structures bibliographic references from documents.

grobid_client --input pdfs/ --output refs/ processReferences

Citation List Processing

Parses raw citation strings from text files.

grobid_client --input citations.txt --output parsed/ processCitationList

Tip

For citation lists, input should be text files with one citation string per line.

🧪 Testing

The project includes comprehensive unit and integration tests using pytest.

Running Tests

# Install development dependencies
pip install -e .[dev]

# Run all tests
pytest

# Run with coverage
pytest --cov=grobid_client

# Run specific test file
pytest tests/test_client.py

# Run with verbose output
pytest -v

Test Structure

tests/test_client.py - Unit tests for the base API client
tests/test_grobid_client.py - Unit tests for the GROBID client
tests/test_integration.py - Integration tests with real GROBID server
tests/conftest.py - Test configuration and fixtures

Continuous Integration

Tests are automatically run via GitHub Actions on:

Push to main branch
Pull requests
Multiple Python versions (3.8-3.13)

📊 Performance

Benchmark results for processing 136 PDFs (3,443 pages total, ~25 pages per PDF) on Intel Core i7-4790K CPU 4.00GHz:

Concurrency	Runtime (s)	s/PDF	PDF/s
1	209.0	1.54	0.65
2	112.0	0.82	1.21
3	80.4	0.59	1.69
5	62.9	0.46	2.16
8	55.7	0.41	2.44
10	55.3	0.40	2.45

Additional Benchmarks

Header processing: 3.74s for 136 PDFs (36 PDF/s) with n=10
Reference extraction: 26.9s for 136 PDFs (5.1 PDF/s) with n=10
Citation parsing: 4.3s for 3,500 citations (814 citations/s) with n=10

🛠️ Development

Setting Up Development Environment

# Clone the repository
git clone https://github.com/kermitt2/grobid_client_python
cd grobid_client_python

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode with test dependencies
pip install -e .[dev]

# Install pre-commit hooks (optional)
pre-commit install

Creating a New Release

The project uses bump-my-version for version management:

# Install bump-my-version
pip install bump-my-version

# Bump version (patch, minor, or major)
bump-my-version bump patch

# The release will be automatically published to PyPI

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes
Add tests for new functionality
Run the test suite (pytest)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

Distributed under the Apache 2.0 License. See LICENSE for more information.

👥 Authors & Contact

Main Author: Patrice Lopez (patrice.lopez@science-miner.com)
Maintainer: Luca Foppiano (luca@sciencialab.com)

Name		Name	Last commit message	Last commit date
Latest commit History 146 Commits
.github/workflows		.github/workflows
grobid_client		grobid_client
resources		resources
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Readme.md		Readme.md
config.json		config.json
example.py		example.py
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt

License

kermitt2/grobid_client_python

Folders and files

Latest commit

History

Repository files navigation

GROBID Client Python

📋 Table of Contents

✨ Features

📋 Prerequisites

🚀 Installation

PyPI (Recommended)

Development Version

Local Development

⚡ Quick Start

Command Line

Python Library

📖 Usage

Command Line Interface

Available Services

Common Options

Processing Options

Examples

Python Library

Basic Usage

Advanced Usage

⚙️ Configuration

Default Configuration

Configuration Parameters

🔬 Services

Fulltext Document Processing

Header Document Processing

Reference Processing

Citation List Processing

🧪 Testing

Running Tests

Test Structure

Continuous Integration

📊 Performance

Additional Benchmarks

🛠️ Development

Setting Up Development Environment

Creating a New Release

Contributing

📄 License

👥 Authors & Contact

🔗 Links

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Contributors 11

Languages

Packages