ECBC2024-5

Pipeline for turning raw full PDF to OCR-ed text is stored in the folder "PDFImageOCRPipeline." In it, use the files in this order:
1 - "separatePDFintoChunks.py": this separates a PDF file into chunks, just in case the user's local machine does not have enough memory to process the entire PDF.
2 - "ConvertPDFToIndividualImage.py": this further separates PDF chunks into individual JPG images, one JPG per PDF page.
3 - "tesseractOCR.py": this iterates through the individual JPG images, performs OCR using Tesseract, and stores the OCR-ed content in a JSON file. One JSON file per full text.
4 - optional: "processOCRBetweenTXTandJSON.py": if needed, this converts the JSON into a TXT file.\

The folder "OCRJSON" stores the OCR-ed text of Records of the Virginia Company, vols 2-4, in JSON format. The "OCRTXT" folder stores the same content but in TXT format.\

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
AuxiliaryPrograms/XMLParserProgram		AuxiliaryPrograms/XMLParserProgram
MassGoogleSearchTools		MassGoogleSearchTools
OCRJSON		OCRJSON
OCRTXT		OCRTXT
PDFImageOCRPipeline		PDFImageOCRPipeline
.DS_Store		.DS_Store
README.md		README.md
Vol3NameList.py		Vol3NameList.py
keywordSearchOCR.ipynb		keywordSearchOCR.ipynb
scratchWork.ipynb		scratchWork.ipynb
specialCharacter.py		specialCharacter.py
test.json		test.json
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ECBC2024-5

About

Releases

Packages

Languages

zzou21/ECBC2024-5

Folders and files

Latest commit

History

Repository files navigation

ECBC2024-5

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages