GitHub - samayer12/Interleave: Explore github CI through PDF processing shenanigans.

Messing around with Github CI and PDF processing.

This program will take two PDF files and match all paragraphs (which SHOULD be numbered) and store the matched pairs in a .csv. file.

Example usage python interleave.py file1.pdf file2.pdf output.csv

Example output

Document1,Document2
1. First Entry.,1. First Entry
2. Second Entry.,2. Second Entry
3. Third Entry,3. Third Entry

Current Status

102/1088 paragraphs in test data have an anomaly. A complete list of observed errors is in Errors.csv Error types:

Here's a tabular representation of the anomalies.

	1/2	3/4*	5/6	7/8	Total
EJ	17	6	8	17	48
EPA	15	3	21	15	54
Total	32	9	29	32	102

Name		Name	Last commit message	Last commit date
Latest commit History 122 Commits
.github/workflows		.github/workflows
.idea		.idea
src		src
test		test
.darglint		.darglint
.gitignore		.gitignore
.pylintrc		.pylintrc
Errors.csv		Errors.csv
README.md		README.md
requirements.txt		requirements.txt

Provide feedback