Skip to content

Latest commit

 

History

History
72 lines (46 loc) · 4.24 KB

README.md

File metadata and controls

72 lines (46 loc) · 4.24 KB

icon

See original announcements on:

For more information, see gallia-core documentation, in particular:

Description

This is the Spark RDD-powered counterpart to the genemania parent repo (which was using Gallia's "poor man scaling" instead of Spark)

Test Run

You can test it by running the ./testrun.sh script at the root of the repo, provided you are set up with aws-cli and don't mind the cost (see below).

The script does the following:

  • Creates an S3 bucket for the code and data
  • Retrieves code and uploads it to the bucket (source+binaries)
  • Retrieves the data (or a subset thereof) and uploads it to the bucket
  • Creates an EMR Spark cluster and run the program as a single step
  • Awaits until termination and logs results

To run it on a small subset (expect ~$3[2] in AWS charges), use:

./testrun.sh 10 4 # process first 10 files, using 4 workers

To run it in full (expect ~$18[2] in AWS charges), use:

./testrun.sh ALL <number-of-workers> # eg 60 workers

The full EMR run will take about 120 minutes with 60 workers[1]. As one would expect, it follows the distribution below:

|distribution

Input

Same input as parent repo, except uploaded to an s3 bucket first: s3://<bucket>/input/

Output

Same output as parent repo, except made available on s3 bucket as s3://<bucket>/output/part-NNNNN.gz files

Limitations

Notable limitations are:

  • Only available for Scala 2.12 because:
    • sbt-assembly does not seem to be available for 2.13
    • Spark support for 2.13 is still immature
  • The I/O abstractions need to be aligned with the core's, they are somewhat hacky at the moment:

See list of spark-related tasks for more limitations.

Footnotes

  • [1] ~+1h to accumulate the input data and upload it on s3 bucket (using a 5 seconds courtesy delay in between each request)
  • [2] Cost estimates provided are not guaranteed at all, run it at own risk (but please let me know if yours are significantly different)

Contact

You may contact the author at cros.anthony@gmail.com