Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Per position allele count reporting utility ( like bam-readcount) #297

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

jpdna
Copy link
Member

@jpdna jpdna commented Feb 20, 2018

Looking for general comments about utility and approach - not ready for final review yet.

This PR adds functionality to produce a per-position report of counts of reads reporting an A/C/G/T/N or ins/del at each genomic position covered by at least one read in an input alignmentRecordRDD

The functionality is intended to mirror that of the bam-readcount program https://github.com/genome/bam-readcount which is useful to get such per-position stats for use in downstream applications such as testing new variant calling models or evaluating background sequencing noise at a given position.

The approach of this implementation is to perform a alignmentRecordRDD.mapPartitions, in which each partition builds in memory a hash of type mutable.Map[ReferencePosition, AlleleCounts] which is then returned as RDD[ReferencePosition, AlleleCounts], then combined across partitions by reduceByKey adding counts at positions, to resolve partition overlapping reads. This per-partition approach seemed efficient given the size of the hash is limited to the fraction of the genome in a partition ( which can be small if genomic pos sorted), and avoids a flatmap that would produce a large RDD with every base of every read as elements.

The CIGAR/MD reading algorithm was adapted from existing DiscoverVariants, but the context of reporting every covered position and use of the mapPartitions is sufficiently different that I think a new function/object makes sense rather than adding options to DiscoverVariants

  • Before finishing up this PR I wanted to get feedback on:
  1. Just checking that this user facing per-position summary function doesn't already exist in Avocado or elsewhere in ADAM?
  2. Is a command line utility in Avocado a reasonable place for this functionality to live?

Todo:

  1. further validate, compare against Bam-readcount output, add tests.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@fnothaft
Copy link
Member

+1! This looks like a great start. I'll make more of a pass over this later this week. Thanks for the first cut @jpdna!

@fnothaft
Copy link
Member

Jenkins, add to whitelist.

@coveralls
Copy link

coveralls commented Feb 20, 2018

Coverage Status

Coverage decreased (-3.2%) to 76.705% when pulling d6137af on jpdna:posAlleleStats into e0979dd on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/avocado-prb/232/

Build result: FAILURE

GitHub pull request #297 of commit d6137af automatically merged.[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-staging-worker-02 (ubuntu staging-02 staging) in workspace /home/jenkins/workspace/avocado-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/avocado.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/avocado.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/avocado.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/297/merge^{commit} # timeout=10 > git branch -a -v --no-abbrev --contains 19dfbc2 # timeout=10Checking out Revision 19dfbc2 (origin/pr/297/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 19dfbc267513fdb14751f43d9775194719e3c13fFirst time build. Skipping changelog.Triggering avocado-prb » 2.6.0,2.11,2.0.0,centosTriggering avocado-prb » 2.3.0,2.10,2.0.0,centosTriggering avocado-prb » 2.3.0,2.11,2.0.0,centosTriggering avocado-prb » 2.6.0,2.10,2.0.0,centosavocado-prb » 2.6.0,2.11,2.0.0,centos completed with result FAILUREavocado-prb » 2.3.0,2.10,2.0.0,centos completed with result FAILUREavocado-prb » 2.3.0,2.11,2.0.0,centos completed with result FAILUREavocado-prb » 2.6.0,2.10,2.0.0,centos completed with result FAILURE
Test FAILed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants