place pst files in pst-extract/pst/
bin/explode_psts.sh
- runs readpst to convert pst to mboxbin/normalize_mbox.sh
- mbox files to jsonbin/run_spark_tika.sh
- tika extract text of attachmentsbin/run_tika_content_join.sh
- join attachment text with email jsonbin/run_spark_content_split.sh
- removes base64 encoded attachment from emails json and puts the json in to a separate directorybin/run_spark_emailaddr.sh
- email address extraction and community assignmentbin/run_spark_email_community_assign.sh
- assign communities to email json objectsbin/run_spark_topic_clustering.sh
- assign topic clustering to email json objects output by community assignbin/run_spark_mitie.sh
- Run MITIE to generate entities for email and add to email json generated by topic clusteringbin/run_spark_es_ingest_emailaddr.sh
- ingest emailaddrs to ES indexbin/run_spark_es_ingest_attachments.sh
- ingest attachments to ES indexbin/run_spark_es_ingest_emails.sh
- ingest emails with entities to ES index
** Location Extraction **
Locations extracted from text
bin/build_clavin_index.sh
setup location index (only needs to be run once)bin/run_location_extract.sh
extracts locations from text body uses input frombin/run_spark_content_split
task
Locations extracted by IP
bin/setup_geo2ip.sh
setup geoip indexbin/run_spark_originating_location.sh
extracts location from ip address
This product includes GeoLite2 data created by MaxMind, available from http://www.maxmind.com.