Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for a Sequence Dictionary from BAM files #9

Closed

Commits on Dec 2, 2013

  1. Adding support for sequence dictionary extraction and handling.

    (1) Adding referenceLength and referenceUrl to the ADAMRecord schema.
    
    ADAMRecord now contains a (flattened) version of the sequence dictionary element
    corresponding to each read.
    
    An entire sequence dictionary will be able to be read out of the records associated
    with a single BAM or SAM file.
    
    (2) Added SequenceDictionary class
    
    Aggregates a consistent set of SequenceRecord objects into a single dictionary.
    
    SequenceRecords are derived from the reference{Id, Name, Length, Url} fields of the
    ADAMRecord.
    
    Most of the work here goes in to providing methods for combining and mapping between
    sequence dictionaries, which will be used for cross-BAM or cross-ADAM comparisons.
    
    (3) Extended Bam2Adam to create a SequenceDictionary .dict parquet file
    
    When bam2adam is run on a BAM, we extract the sequence dictionary from the BAM's header
    and insert its values into each ADAMRecord.
    
    Also updated the adamBamLoad method in ADAMContext to do something similar, even without
    the explicit parquet file.
    
    (4) Adding the ListDict command.
    
    ListDict reads a SequenceDictionary out of a set of ADAMRecords and prints out all the
    component SequenceRecord objects contained in it.
    
    (5) Adding the CompareAdam command
    
    CompareAdam is a command that takes two ADAM files as arguments.
    
    It assumes that the same read names are present in *both* input ADAM files --
    for example, if the two files were produced by different alignment tools run
    on the same set of reads.
    
    It calculates the number of reads that are unique to each ADAM file, as well as
    the number of shared reads which have the same alignment position across the two
    BAMS.
    
    This position is computed irrespective of the ordering of the Sequences in the
    sequence dictionaries of the two ADAM files, and therefore is a useful end-to-end
    test of the SequenceDictionary code, as well as being the basis for more useful
    pairwise comparisons of ADAM files in the future.
    
    (6) Extending AdamContext with methods to automate the loading of SequenceDictionary
    objects out of any conformant Avro schema.
    tdanford committed Dec 2, 2013
    Configuration menu
    Copy the full SHA
    06aca04 View commit details
    Browse the repository at this point in the history

Commits on Dec 3, 2013

  1. Made Carl's suggested comment change; eliminated other uses of ADAMSe…

    …quenceRecord in comments.
    tdanford committed Dec 3, 2013
    Configuration menu
    Copy the full SHA
    3ee0b55 View commit details
    Browse the repository at this point in the history