Skip to content

Troubleshooting Spark

Tom White edited this page Nov 3, 2017 · 1 revision

Illustrative commands

The following commands show some typical settings for running Spark pipelines.

See Spark evaluation scripts for more details about running Spark pipelines. Also, Spark Evaluation Results for performance numbers.

Reads pipeline on exome data running on a 10 node (n1-standard-16) GCS cluster

./gatk-launch ReadsPipelineSpark \
    -I hdfs:///user/tom/exome_spark_eval/NA12878.ga2.exome.maq.raw.bam \
    -O hdfs://tw-cluster-2-m:8020/user/tom/exome_spark_eval/out/NA12878.ga2.exome.maq.raw.vcf \
    -R hdfs:///user/tom/exome_spark_eval/Homo_sapiens_assembly18.2bit \
    --knownSites hdfs://tw-cluster-2-m:8020/user/tom/exome_spark_eval/dbsnp_138.hg18.vcf \
    -pairHMM AVX_LOGLESS_CACHING \
    -maxReadsPerAlignmentStart 10 \
    -apiKey /home/tom/.gcs/broad-gatk-collab-0853abf3a8f1.json \
    -- \
    --sparkRunner GCS --cluster tw-cluster-2 \
    --num-executors 20 --executor-cores 7 --executor-memory 28g \
    --driver-memory 4g \
    --conf spark.dynamicAllocation.enabled=false

Reads pipeline on WGS data running on a 20 node (n1-standard-16) GCS cluster

./gatk-launch ReadsPipelineSpark \
    -I hdfs:///user/tom/q4_spark_eval/WGS-G94982-NA12878-no-NC_007605.bam \
    -O hdfs://tw-cluster-2-m:8020/user/tom/q4_spark_eval/out/WGS-G94982-NA12878.vcf \
    -R hdfs:///user/tom/q4_spark_eval/human_g1k_v37.2bit \
    --knownSites hdfs://tw-cluster-2-m:8020/user/tom/q4_spark_eval/dbsnp_138.b37.vcf \
    -pairHMM AVX_LOGLESS_CACHING \
    -maxReadsPerAlignmentStart 10 \
    -apiKey /home/tom/.gcs/broad-gatk-collab-0853abf3a8f1.json \
    -- \
    --sparkRunner GCS --cluster tw-cluster-2 \
    --num-executors 20 --executor-cores 8 --executor-memory 46g \
    --driver-memory 8g \
    --conf spark.dynamicAllocation.enabled=false

Common Errors

AbstractMethodError or NoSuchMethodError

If the job fails quickly with a java.lang.AbstractMethodError or java.lang.NoSuchMethodError this probably means that you are using Spark 1.6 rather than Spark 2. When running on a CDH cluster you need to specify that Spark 2 is to be used by adding --sparkSubmitCommand spark2-submit to the Spark-specific arguments to gatk-launch (the ones after --).

OutOfMemoryError

Spark in general is sensitive to memory settings, and they will need tuning for any non-trivial job. The settings in the "Illustrative commands" section above should provide a good starting point.

  • Driver. Usually 4g or 8g for --driver-memory will suffice.
  • Executors. In general, prefer a smaller number of larger executors over a larger number of smaller executors, since this allows more complex GATK tools like BQSR to share resources (e.g. known sites) in the same JVM. In the case of ReadsPipelineSpark on exome-sized data it's better to use one executor with 7 cores and 28g of memory than 7 executors each with one thread and 4g of memory. The total number of executors should be determined by the cluster size. E.g. in the example above 20 executors (each with 7 cores and 28g memory) was chosen since that's the maximum that fits in the cluster. Alternatively, you might consider setting spark.dynamicAllocation.enabled to true to have Spark scale up the number of executors.

Couldn't write file ... because writing failed with exception ... Unable to find _SUCCESS file

Make sure that all file paths on HDFS are absolute paths. E.g. hdfs://tw-cluster-2-m:8020/user/tom/exome_spark_eval/NA12878.ga2.exome.maq.raw.bam. The hostname and port must also be specified in some cases, so if in doubt include them in the path.