Skip to content

Releases: aphp/edspdf

v0.9.1

19 Mar 13:37
Compare
Choose a tag to compare

Changelog

Fixed

  • It is now possible to recursively retrieve pdf files in a directory using edspdf.data.read_files

What's Changed

Full Changelog: v0.9.0...v0.9.1

v0.9.0

26 Feb 10:42
Compare
Choose a tag to compare

What's Changed ?

Added

  • New unified edspdf.data api (pdf files, pandas, parquet) and LazyCollection object
    to efficiently read / write data from / to different formats & sources. This API is
    has been heavily inspired by the edsnlp.data API.
  • New unified processing API to select the execution backend via data.set_processing(...)
    to replace the old accelerators API (which is now deprecated, but still available).
  • huggingface-embedding now supports quantization and other AutoModel.from_pretrained kwargs
  • It is now possible to add convert a label to multiple labels in the simple-aggregator component :
# To build the "text" field, we will aggregate "title", "body" and "table" lines,
# and output "title" lines in a separate field as well.
label_map = {
    "text" : [ "title", "body", "table" ],
    "title": "title",
    }

Fixed

  • huggingface-embedding now resize bbox features for large PDFs, instead of making the model crash
  • huggingface-embedding and sub-box-cnn-pooler now handle empty PDFs correctly

Pull Requests

Full Changelog: v0.8.1...v0.9.0

v0.8.1

26 Sep 08:42
Compare
Choose a tag to compare

Changelog

Fixed

  • Fix typing to allow passing an accelerator dict to Pipeline.pipe(...)
  • Removed multiprocessing accelerator debug output
  • Fixed absolute links in github-pages docs (e.g. image assets)

Changed

  • Added auto-links to components in the docs (by comparing span contents with entry points)

Pull Requests

Full Changelog: v0.8.0...v0.8.1

v0.8.0

07 Sep 16:04
Compare
Choose a tag to compare

What's changed

Added

  • Add multi-modal transformers (huggingface-embedding) with windowing options
  • Add render_page option to pdfminer extractor, for multi-modal PDF features
  • Add inference utilities (accelerators), with simple mono process support and multi gpu / cpu support
  • Packaging utils (pipeline.package(...)) to make a pip installable package from a pipeline

Changed

  • Updated API to follow EDS-NLP's refactoring
  • Updated confit to 0.4.2 (better errors) and foldedtensor to 0.3.0 (better multiprocess support)
  • Removed pipeline.score. You should use pipeline.pipe, a custom scorer and pipeline.select_pipes instead.
  • Better test coverage
  • Use hatch instead of setuptools to build the package / docs and run the tests

Fixed

  • Fixed attrs dependency only being installed in dev mode

Pull Requests

New Contributors

Full Changelog: v0.7.0...v0.8.0

v0.7.0

09 Jun 14:11
Compare
Choose a tag to compare

What's changed

This public release comes with a major overhaul of the library since v0.5.3

Core features

  • new pipeline system whose API is inspired by spaCy
  • first-class support for pytorch
  • hybrid model inference and training (rules + deep learning)
  • moved from pandas DataFrame to attrs dataclasses (PDFDoc, Page, Box, ...) for representing PDF documents
  • new configuration system based on confit, with support for instantiation of complex deep learning models, off-the-shelf CLI, ...

Functional features

  • new extractors: pymupdf and poppler (separate packages for licensing reasons)
  • many deep learning layers (box-transformer, 2d attention with relative position information, ...)
  • trainable deep learning classifier
  • training recipes for deep learning models

Full Changelog: v0.5.3...v0.7.0

v0.5.3

31 Aug 10:01
Compare
Choose a tag to compare

What's Changed

Added

  • Add label mapping parameter to aggregators (to merge different types of blocks such as title and body)
  • Improved line aggregation formula

Full Changelog: v0.5.2...v0.5.3

v0.5.2

30 Aug 09:50
Compare
Choose a tag to compare

What's Changed

  • ci: remove unnecessary poppler dependency by @bdura in #7
  • Fix aggregation for empty documents by @percevalw in #8

Full Changelog: v0.5.1...v0.5.2

v0.5.1

26 Jul 09:00
Compare
Choose a tag to compare

Changelog

Changed

  • Drop the pdf2image dependency, replacing it with pypdfium2 (easier installation)

Pull Requests

Full Changelog: v0.5.0...v0.5.1

v0.5.0

25 Jul 17:05
Compare
Choose a tag to compare

EDS-PDF is a generic, pure-Python facility for text extraction from PDF documents. It provides the machinery to use rule- or machine-learning-based approaches to classify text blocs between body and meta-data.