GitHub - EDS-APHP-legacy/SparkPdfExtractor: A way to distribute pdf to text extractions over spark and pdfbox

GOAL

50 Million of pdf of 3 pages average were transformed and dumped to text for 2 hours of runtime

transform the pdf to avro (see PdfAvro folder)
push 2 jars on the spark computer cluster
spark-submit --jars wind-pdf-extractor-1.0-SNAPSHOT-jar-with-dependencies.jar --driver-java-options "-Dlog4j.configuration=file:log4jmaster" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:log4jslave" --num-executors 120 --executor-cores 1 --master yarn pdfextractor_2.11-0.1.0-SNAPSHOT.jar inputAvroHdfsFolder/ outputCsvHdfsFolder/ 400`
it is crucial to put only one executor core

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
PdfAvro		PdfAvro
PdfboxPojo		PdfboxPojo
SparkPdfExtractor		SparkPdfExtractor
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md