Skip to content

A way to distribute pdf to text extractions over spark and pdfbox

Notifications You must be signed in to change notification settings

EDS-APHP-legacy/SparkPdfExtractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GOAL

  • pdfs are serialized into AVRO
  • AVRO si distributed as a spark RDD in X partitions
  • each partition is collected and stored as a csv part
  • csv are then merged, and compressed
  • archive goes back to application serveur that load the postgresql table

PERFORMANCES

  • 50 Million of pdf of 3 pages average were transformed and dumped to text for 2 hours of runtime

BUILD

  • make build

USE (yarn)

  1. transform the pdf to avro (see PdfAvro folder)
  2. push 2 jars on the spark computer cluster
  3. spark-submit --jars wind-pdf-extractor-1.0-SNAPSHOT-jar-with-dependencies.jar --driver-java-options "-Dlog4j.configuration=file:log4jmaster" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:log4jslave" --num-executors 120 --executor-cores 1 --master yarn pdfextractor_2.11-0.1.0-SNAPSHOT.jar inputAvroHdfsFolder/ outputCsvHdfsFolder/ 400`
  4. it is crucial to put only one executor core

CONFIGURATION

  • ulimit -n 64000 (default is 1024, way too low)

READING

About

A way to distribute pdf to text extractions over spark and pdfbox

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published