Release v0.5.0 (2024-04-07) · Filimoa/open-parse

0.5.0 (2024-04-01)

What's Changed

SemanticProcessing! This is the recommended processing pipeline.
Add optional annotations to the pdf draw functions
Fixed reading order bug

Breaking Changes

Renaming

Node.aggregate_position renamed to Node.reading_order.
RemoveStubs to RemoveNodesBelowNTokens

Refactored processing pipelines to use a class to promote ease of reuse

Previously

from openparse import ProcessingStep, default_pipeline, Node
from typing import List


class CustomCombineTables(ProcessingStep):
    def process(self, nodes: List[Node]) -> List[Node]:
        return nodes


# copy the default pipeline (or create a new one)
custom_pipeline = default_pipeline.copy()
custom_pipeline.append(CustomCombineTables())

parser = openparse.DocumentParser(
    table_args={"parsing_algorithm": "pymupdf"}, processing_pipeline=custom_pipeline
)
custom_10k = parser.parse(meta10k_path)

Now becomes

from openparse import processing, Node
from typing import List


class CustomCombineTables(processing.ProcessingStep):
    def process(self, nodes: List[Node]) -> List[Node]:
        return nodes


# copy the default pipeline (or create a new one)
custom_pipeline = processing.BasicIngestionPipeline()
custom_pipeline.append_transform(CustomCombineTables())

parser = openparse.DocumentParser(
    table_args={"parsing_algorithm": "pymupdf"}, processing_pipeline=custom_pipeline
)
custom_10k = parser.parse(meta10k_path)

openai and numpy as now required dependencies, will likely split this out in the future.

Full Changelog: v0.4.1...v0.5.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.5.0 (2024-04-07)

What's Changed

Breaking Changes