- pdfs are serialized into AVRO
- AVRO si distributed as a spark RDD in X partitions
- each partition is collected and stored as a csv part
- csv are then merged, and compressed
- archive goes back to application serveur that load the postgresql table
- 50 Million of pdf of 3 pages average were transformed and dumped to text for 2 hours of runtime
make build
- transform the pdf to avro (see PdfAvro folder)
- push 2 jars on the spark computer cluster
- spark-submit --jars wind-pdf-extractor-1.0-SNAPSHOT-jar-with-dependencies.jar --driver-java-options "-Dlog4j.configuration=file:log4jmaster" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:log4jslave" --num-executors 120 --executor-cores 1 --master yarn pdfextractor_2.11-0.1.0-SNAPSHOT.jar inputAvroHdfsFolder/ outputCsvHdfsFolder/ 400`
- it is crucial to put only one executor core
- ulimit -n 64000 (default is 1024, way too low)
- problem is hdfs does not manage well many little files
- avro is a better choice
- then the idea :
- put the pdf as bytes into an avro file
- transform the avro into a spark RDD
- run pdfbox with the bytes using a ByteArrayInputStream
- append the result as an avro file into hdfs OR