-
Notifications
You must be signed in to change notification settings - Fork 40
Configuration
Most users should get started using one of the several configuration "presets" built-in for various gVCF variant callers, specified with glnexus_cli --config XXX
. See glnexus_cli --help
for the available list. GLnexus displays its effective configuration on the console log and in the output pVCF header.
The configuration presets are hard-coded in cli_utils.cc. The remainder of this page documents how to customize the configuration, if needed.
The configuration can be customized by supplying a YAML file glnexus_cli --config YYY.yml
, with contents like the following (reflecting the DeepVariantWES
preset).
unifier_config:
min_AQ1: 35
min_AQ2: 20
min_GQ: 20
monoallelic_sites_for_lost_alleles: true
genotyper_config:
required_dp: 0
revise_genotypes: true
liftover_fields:
- orig_names: [MIN_DP, DP]
name: DP
description: '##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">'
type: int
combi_method: min
number: basic
count: 1
ignore_non_variants: true
- orig_names: [AD]
name: AD
description: '##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">'
type: int
number: alleles
combi_method: min
default_type: zero
count: 0
- orig_names: [GQ]
name: GQ
description: '##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">'
type: int
number: basic
combi_method: min
count: 1
ignore_non_variants: true
- orig_names: [PL]
name: PL
description: '##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Phred-scaled genotype Likelihoods">'
type: int
number: genotype
combi_method: missing
count: 0
ignore_non_variants: true
The allele unifier configuration controls allele sensitivity and representation in the output pVCF.
-
min_AQ1: threshold for allele inclusion.
- The Allele Quality (AQ) for an allele A is defined in terms of the genotype likelihoods of a gVCF record as the likelihood ratio max(likelihoods of genotypes including allele A)/max(likelihoods of genotypes not including allele A), Phred-scaled.
- GLnexus includes an allele in the output pVCF if at least one individual in the cohort shows
AQ_i >= min_AQ1
.
-
min_AQ2: also includes an allele in the output pVCF if at least two indivduals show
AQ_i >= min_AQ2
.- Thus we may have a lower quality threshold for alleles observed in multiple individuals, compared to singletons.
- All else equal, increasing the min_AQ thresholds increases specificity and reduces sensitivity, and also speeds up the genotyper (by processing fewer weak sites).
- min_allele_copy_number: only include alleles with at least this many (estimated) copies in the cohort.
- drop_filtered (true/false): if true, exclude alleles lacking any observations which PASS all the defined VCF FILTERs; even if they pass other quality thresholds.
-
min_GQ: Genotype Quality (GQ) score threshold used in estimating cohort allele copy numbers.
- Increasing this will bias allele frequency estimates downwards (and conversely decreasing it biases upwards).
- This affects the output genotypes only insofar as allele frequency estimates factor into revising them. In particular, it is not a hard threshold on output GQ.
- Suggest setting equal to min_AQ2.
-
monoallelic_sites_for_lost_alleles (true/false): if false, suppress generation of "monoallelic" sites to capture alleles that don't unify cleanly into non-overlapping multi-allelic sites. See Reading GLnexus pVCFs for explanation.
- Recommend leaving this on, as the monoallelic sites may provide the only representation of certain alleles, and they're easily recognized by the FILTER field.
-
max_alleles_per_site: the maximum number of alleles to include in any one multiallelic site (counts the reference allele).
- Alleles exceeding this threshold will be "kicked out" into monoallelic sites
- preference ("common" or "small"): if set to "small", the unifier prefers to merge small alleles (editing the shortest portion of the reference) into multiallelic sites before longer ones, even if the latter are more common in the cohort. This controls the allelic representation and the proportion of alleles and genotypes involved in monoallelic sites.
The genotyper configuration controls genotype revision and many details of calculating the output QC values.
- revise_genotypes (true/false): enables frequency-based genotype revision
-
min_assumed_allele_frequency (float, default 0.0001): minimum assumed frequency of any allele to use in the revision calculations.
- Ensures consistent sensitivity once the cohort is large enough to distinguish common and rare variants.
- Increasing this tends to make the revision less aggressive.
- required_dp: any called allele will be revised to non-called if supported by fewer than this many reads (per AD or analogous field)
-
allele_dp_format (default "AD"): the gVCF FORMAT field from which to source the allele-specific read depths.
- Changing this usually requires special-case code to read some variant caller's unique way of recording this information.
- ref_dp_format (default "MIN_DP"): the gVCF FORMAT field from which to source read depth in reference bands.
- allow_partial_data (true/false): if true, present pVCF genotypes even if the gVCF records only partially cover the output pVCF site (these would be non-called by default)
-
squeeze (true/false): if true, suppress usually-unnecessary QC detail from output to reduce its size.
- In entries indicating zero non-reference reads (AD=*,0), report only GT and DP, rounding DP down to a power of two; leave all other FORMAT fields missing.
- Speeds up the genotyper since it does not spend time calculating these QC values.
- Use alone or in a pipeline to spVCF.
-
more_PL (true/false): if true, include PL values from reference bands and other cases omitted by default; also populate uninformative PL entries with 0 or 990 instead of missing values.
- This extra detail can be useful for downstream tools requiring 100.0% of PL values populated.
- But it inflates and slows down the output for marginal gain of useful information.
-
liftover_fields (list): a list of YAML objects specifying each FORMAT QC field, its entry in the header and how to calculate it from the input gVCF fields.
- TODO: detailed documentation here