groups in slivar

motivation

groups allow a user to indicate aliases so that a single expression can be applied to many groups of samples.

examples

quartet

A simple example would be that we have 3 families, each with a mom, dad, proband, and unaffected sibling. Given sample ids of s1..s12 that appear in the vcf, we could create an alias file like:

#proband dad mom sibling
s1       s2  s3      s4
s5       s6  s7      s8
s9       s10 s11     s12

where the headers indicate the labels that we can use in a --group-expr. Then a --group-expr might look like:

--group-expr "denovo:mom.alts == 0 && dad.alts == 0 && sibling.alts == 0 \ # all unaffecteds are hom-ref
    && proband.alts == 1 \ # proband is heterozygous
    && mom.AD[1] == 0 && dad.AD[1] == 0 && sibling.AD[1] == 0 \ ## make sure no alternate alleles are seen in unaffecteds
    && kid.AB > 0.2 && kib.AB < 0.8 \ # make sure the allele balance is reasonable
    && INFO.gnomad_popmax_af < 1e-3 \ # variant must be rare in gnomad

This would add an INFO field of denovo=$proband to any variant that matches this criteria. The first column, in this case proband is used as the entry in the INFO field. Note that these labels are for human-readability, only, they can be whatever the user choose, for example, the above header could instead be: #affected mom dad unaffected if that makes the expressions more readable.

somatic variants

For somatic variants, the intuitive labels may be "tumor" and "normal", or for 4 patients, each with 3 tumor time-points, a file make look like:

#normal  tumor1   tumor2   tumor3
s1n      s1t1     s1t2     s1t3      
s2n      s2t1     s2t2     s2t3      
s3n      s3t1     s3t2     s3t3      
s4n      s4t1     s4t2     s4t3

Then, to find somatic variants that increase in allele frequency across the tumor time-points, we can specify an expression like:

--group-expr "increasing:normal.alts == 0 && normal.AD[1] == 0 \ # no evidence in normal
       && tumor1.AB > 0 && tumor2.AB > tumor1.AB && tumor3.AB > tumor2.AB

this will create a new INFO field increasing and it will have the list of normal (first column) samples that met that criteria for each variant.

multi-generational pedigree

For pedigrees with 3 generations, we may want to find *de novos in the F1 that are transmitted to the F2. The --alias file for this might look like:

#f1 spouse gma gpa kids
s1  s2     s3  s4  s5,s6,s7
s8  s9     s10 s11 s12,s13
s14 s15    s16 s17 s18,s19,s20,s21,s22,s23,s24

note that any column that ends with s will be available in the expression as a javascript array and multiple samples can be specified by commas. So, in this case, there are multiple kids and each family has a different number of kids. An expression for this might look like:

--group-expr "transmitted:f1.alts == 1 && f1.AB > 0.2 && f1.AB < 0.8 && \
    && gma.alts == 0 && gpa.alts == 0 \ # must be absent in the parents of f1 to be denovo
    && spouse.alts == 0 \ # make sure the variant did not come from spouse.
    && proportion_kids_with_alt(kids) > 0.25"

So, here, we have specified a de novo in f1 that must appear in at least 25% of its offspring. We must provide the implementation of the proportion_kids_with_alt function in a file and pass it via --load. An implementation might look like:

function proportion_kids_with_alt(kids) {
    var n = 0
    for(i=0; i < kids.length; i++){
      if(kids[i].alts == 1 && kid.AB < 0.2 && kid.AB < 0.8){
          n += 1
      }
    }
    return n / kids.length
}

This function can be as simple or complex as needed. Note that it does not assume anything about the length of the kids array, which is a requirement for this example. This function would be placed in a file and passed to slivar with the --load argument.

Provide feedback

Saved searches