Skip to content

refactor: pdf extractor #18

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Jun 16, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
65f0a52
feat: Update langfuse dependency to version 3.0.0 and adjust related …
a-klos Jun 10, 2025
5a7b69f
Add comprehensive tests for PDFExtractor functionality
a-klos Jun 11, 2025
a1b3701
feat: Update dependencies and modify PDF extractor import
a-klos Jun 11, 2025
e6042ec
Merge branch 'main' into fix/orophaned-threads-issue
a-klos Jun 11, 2025
1a9d814
feat: add pytest-asyncio support for asynchronous testing
a-klos Jun 11, 2025
19d545b
feat: update langfuse dependency to version 3.0.0 and adjust related …
a-klos Jun 11, 2025
8ac26db
Refactor PDF extractor tests: remove old test files and implement com…
a-klos Jun 12, 2025
ef51597
refactor: Moved tests from test_pdf_extractor.py to pdf_extractor_tes…
a-klos Jun 12, 2025
5442463
refactor: update flake8 exclusions and clean up PDFExtractor tests fo…
a-klos Jun 12, 2025
8da09dd
chore: add pdf files using git lfs
a-klos Jun 13, 2025
4b08f1b
refactor: update parameter names in PDFExtractor class for clarity an…
a-klos Jun 13, 2025
6b55ef7
Merge branch 'main' into refactor/pdf-extractor
a-klos Jun 13, 2025
1d9d71d
chore: remove PyTorch and related dependencies from pyproject.toml
a-klos Jun 13, 2025
8a19347
refactor: remove unused text-based PDF document from test data
a-klos Jun 13, 2025
7cbe521
chore: add sample PDF document for testing in extractor-api-lib
a-klos Jun 13, 2025
5b63ab2
refactor: remove unused test methods and main execution block from pd…
a-klos Jun 13, 2025
d966a0f
chore: add pytest-asyncio as a development dependency
a-klos Jun 13, 2025
7d3fa64
Remove unused dependencies: tabula and easyocr from pyproject.toml
a-klos Jun 13, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions extractor-api-lib/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
__pycache__/
*.py[cod]
*$py.class
**/.DS_Store

# C extensions
*.so
Expand Down
2,064 changes: 1,059 additions & 1,005 deletions extractor-api-lib/poetry.lock

Large diffs are not rendered by default.

9 changes: 8 additions & 1 deletion extractor-api-lib/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,13 @@ description = "Extracts the content of documents, websites, etc and maps it to a
authors = ["STACKIT Data and AI Consulting <data-ai-consulting@stackit.cloud>"]
packages = [{ include = "extractor_api_lib", from = "src" }]

[[tool.poetry.source]]
name = "pytorch_cpu"
url = "https://download.pytorch.org/whl/cpu"
priority = "explicit"

[tool.flake8]
exclude = [".eggs", "./src/extractor_api_lib/models/*", ".git", ".hg", ".mypy_cache", ".tox", ".venv", ".devcontainer", "venv", "_build", "buck-out", "build", "dist", "**/__init__.py"]
exclude = [".eggs", "./src/extractor_api_lib/models/*", ".git", ".hg", ".mypy_cache", ".tox", ".venv", ".devcontainer", "venv", "_build", "buck-out", "build", "dist", "**/__init__.py", "tests/test_data/generate_test_pdfs.py"]
statistics = true
show-source = false
max-complexity = 10
Expand Down Expand Up @@ -93,10 +98,12 @@ langchain-community = "^0.3.23"
atlassian-python-api = "^4.0.3"
markdownify = "^1.1.0"
langchain-core = "0.3.63"
camelot-py = {extras = ["cv"], version = "^1.0.0"}
fake-useragent = "^2.2.0"

[tool.poetry.group.dev.dependencies]
pytest = "^8.3.5"
pytest-asyncio = "^0.26.0"
coverage = "^7.8.0"
flake8 = "^7.2.0"
flake8-black = "^0.3.6"
Expand Down
Loading
Loading