PDF to Markdown Conversion Tool

Description

This tool is designed to efficiently convert PDF documents into Markdown format, making it especially useful for generating training material for Large Language Models (LLMs). It features advanced text and font extraction, font ranking for accurate heading detection, and efficient processing on CPU architectures.

Features

LLM-Focused Extraction: Optimized for LLM training material with attention to complex language structures and technical terminologies.
Efficient CPU Performance: High efficiency on various CPU systems, ensuring wide accessibility and practicality.
Font Ranking System: Sophisticated system for font size-based ranking, crucial for accurate Markdown heading levels.
Batch Processing: Ability to process multiple PDF files simultaneously for increased efficiency.
Advanced Logging: Comprehensive logging system for effective monitoring and troubleshooting.
MD5 Hash Management: Efficient handling of document processing to avoid redundancy.
JSON Data Handling: Capability to write detailed text and font information into JSON files.

Installation

Installation is easy and got minimal dependencies, it works very well with CPU and can scale.

Clone the Repository

Start by cloning the repository to your local machine:

git clone https://github.com/venkycs/convert_pdf_to_md.git
cd convert_pdf_to_md
python -m venv .
source bin/activate
python convert_pdf_to_md.py

Contributing

Contributions are welcome. Please follow these steps to contribute:

Fork the repository.
Create a new branch: git checkout -b <your-branch-name>.
Make your changes and commit them: git commit -am 'Add some feature'.
Push to the original branch: git push origin <your-branch-name>.
Create the pull request.

License

This project is licensed under the MIT License.

Acknowledgments

PyMuPDF for PDF processing capabilities.
Contributors who participate in the development of this tool.

Version History

1.0
- Initial Release

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
README.md		README.md
convert_pdf_to_md.py		convert_pdf_to_md.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF to Markdown Conversion Tool

Description

Features

Installation

Clone the Repository

Contributing

License

Acknowledgments

Version History

About

Releases

Packages

Languages

ZySec-AI/convert_pdf_to_md

Folders and files

Latest commit

History

Repository files navigation

PDF to Markdown Conversion Tool

Description

Features

Installation

Clone the Repository

Contributing

License

Acknowledgments

Version History

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages