Skip to content

Commit

Permalink
updating prancstr readme
Browse files Browse the repository at this point in the history
  • Loading branch information
gymreklab committed Nov 20, 2023
1 parent a89196a commit b96efde
Show file tree
Hide file tree
Showing 3 changed files with 43 additions and 38 deletions.
51 changes: 25 additions & 26 deletions trtools/prancSTR/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,11 +45,11 @@ Other general parameters:
* :code:`--region <string>`: Restrict to the region chr:start-end. VCF file must be bgzipped and indexed to use this option.
* :code:`--samples <string>`: Restrict to the given list of samples. Samples are comma separated.
* :code:`--vcftype <string>`: Specify the tool which generated the vcf call file for STRs. Currently this will fail if using anything other than :code:`hipstr` VCFs.
* :code:`--only-passing <action=store_true>`: Filters out the VCF records with non-passing FILTER column
* :code:`--output-all <action=store_true>`: Force tool to output results for all loci.
* :code:`--readfield <string>`: Specify which field to utilize for extracting read information.
* :code:`--debug <action=store_true>`: Print helpful debug messages.
* :code:`--quiet <action=store_true>`: Restrict printing of any messages.
* :code:`--only-passing: Filters out the VCF records with non-passing FILTER column
* :code:`--output-all: Force tool to output results for all loci.
* :code:`--readfield <string>`: Specify which VCF format field output by HipSTR to utilize for extracting read information. We recommend setting this to "MALLREADS". "ALLREADS" is also accepted but we have found produces unreliable results.
* :code:`--debug: Print helpful debug messages.
* :code:`--quiet: Restrict printing of any messages.
* :code:`--version`: Print the version of the tool

Notes:
Expand All @@ -64,27 +64,26 @@ Output files

The prancSTR output file contains mosaicism predictions generated for each locus.
Note, this file contains statistics for all tested loci and it is up to the user's discretion to filter out for high confidence mosaic allele calls.
The output generated is a tab-delimited file with 16 columns where each row represents predicted mosaicism at STR:
The output generated is a tab-delimited file with one row summarizing evidence of mosaicism for each call analyzed, with the following columns:

* :code:`sample`: The ID of the sample whose locus is being analyzed.
* :code:`chrom`: The chromosome on which the mosaic short tandem repeat (STR) lies.
* :code:`sample`: The ID of the sample being considered.
* :code:`chrom`: Chromsome of the STR being considered
* :code:`locus`: Reference ID for the short tandem repeat.
* :code:`motif`: The nucleotide sequence that the repeat is comprised of.
* :code:`A`: Represents the first genotype allele for the given STR in repeat units relative to the reference.
* :code:`B`: Represents the second genotype allele for the given STR in repeat units relative to the reference.
* :code:`C`: This is the mosaic allele inferred by prancSTR in repeat units relative to the reference.
* :code:`f`: This represents the mosaic allele fraction.
* :code:`pval`: Gives the p-value metric for how significant our findings are.
* :code:`reads`: Gives representation for how many reads support each allele.
* :code:`motif`: The nucleotide sequence of the repeat unit.
* :code:`A`: The first germline allele for the given STR in repeat units relative to the reference (copied from HipSTR).
* :code:`B`: The second germline allele for the given STR in repeat units relative to the reference (copied from HipSTR)
* :code:`C`: Candidate mosaic allele inferred by prancSTR in repeat units relative to the reference (inferred by prancSTR).
* :code:`f`: Estimated mosaic allele fraction (inferred by prancSTR.
* :code:`pval`: Gives the p-value testing the null hypothesis that f=0.
* :code:`reads`: Gives representation for how many reads support each allele (copied from HipSTR VCF field corresponding to the specified :code:`--readfield`).
* :code:`mosaic_support`: The number of reads that support the mosaic allele.
* :code:`stutter parameter u`: Gives the probability that stutter error causes an increase in obs. STR size.
* :code:`stutter parameter d`: Gives the probability that stutter error causes a decrease in obs. STR size.
* :code:`stutter parameter rho`: Parameter for geometric step size distribution.
* :code:`quality factor`: Reports the posterior probability of the genotype as a mesaure of quality of individual sample's genotype.
* :code:`read depth`: Reports the total depth/number of informative reads for all samples at the locus.
* :code:`stutter parameter u`: The probability that stutter error causes an increase in obs. STR size (copied from HipSTR INFRAME_UP field).
* :code:`stutter parameter d`: The probability that stutter error causes a decrease in obs. STR size (copied from HipSTR INFRAME_DOWN field).
* :code:`stutter parameter rho`: Parameter for geometric step size distribution of stutter errors (copied from HipSTR INFRAME_PGEOM field).
* :code:`quality factor`: Quality score of the germline genotype (copied from HipSTR Q field).
* :code:`read depth`: Reports the total depth/number of informative reads for all samples at the locus (copied from HipSTR DP field).

Stutter parameter u, stutter parameter d, stutter parameter rho, quality factor and read depth are the parameters that have been imported from what HipSTR outputs.
Below is an example file which contains 5 STR loci
Below shows several example output lines from running prancSTR:

+---------+-------+---------+---------------+-------+----+---+---+----------+--------------+------------------+----------------+---------------------+--------------------+----------------------+----------------+------------+
| sample | chrom | pos | locus | motif | A | B | C | f | pval | reads | mosaic_support | stutter parameter u | stutter paramter d | stutter paramter rho | quality factor | read depth |
Expand All @@ -106,7 +105,7 @@ As a starting point, we suggest filtering output on the following parameters to
* :code:`read depth`: of greater than or equal to 10
* :code:`quality factor` of greater than or equal to 0.8
* :code:`mosaic_support` of greater than or equal to 3
* :code:`f`: of less than equal to 0.3
* :code:`f`: of less than equal to 0.3. Higher f values are often indicative of a heterozygous genotype miscalled as homozygous.

Example Commands
----------------
Expand All @@ -116,7 +115,7 @@ Below are :code:`prancSTR` examples using HipSTR VCFs. Data files can be found a
# Example command running prancSTR for only one chromosome with hipstr output file
# --only-passing skips VCF records with non-passing filters
prancSTR \
--vcf CEU_subset.vcf.gz \
--vcf example-files/CEU_subset.vcf.gz \
--out CEU_chr1 \
--vcftype hipstr \
--only-passing \
Expand All @@ -125,9 +124,9 @@ Below are :code:`prancSTR` examples using HipSTR VCFs. Data files can be found a
# Example command running prancSTR for only one sample
# --only-passing skips VCF records with non-passing filters
prancSTR \
--vcf CEU_subset.vcf.gz \
--vcf example-files/CEU_subset.vcf.gz \
--only-passing \
--out NA12878_mosaicSTR \
--out NA12878_chr1 \
--samples NA12878


Expand Down
29 changes: 17 additions & 12 deletions trtools/prancSTR/prancSTR.py
Original file line number Diff line number Diff line change
Expand Up @@ -460,8 +460,6 @@ def main(args):
if args.samples is not None:
usesamples = args.samples.split(",")

start_time = time.time()
nrecords = 0

if args.out == "stdout":
outf = sys.stdout
Expand All @@ -476,22 +474,27 @@ def main(args):
"quality factor", "read depth"]
outf.write("\t".join(header_cols)+"\n")

start_time = time.time()
nrecords = 0 # Number STRs processed
ntests = 0 # Number total tests
for record in region:
nrecords += 1
trrecord = trh.HarmonizeRecord(vcftype, record)

if args.only_passing and not args.output_all and (record.FILTER is not None):
common.WARNING("Skipping record %s with non-passing VCF FILTER field." %
if args.debug:
common.WARNING("Skipping record %s with non-passing VCF FILTER field." %
str(trrecord))
continue

########### Extract necessary info from the VCF file #######
# Stutter params for the locus. These are the same for all samples
# First check we have all the fields we need
if args.readfield not in trrecord.format.keys():
common.WARNING("Could not find read field %s for %s" %
(args.readfield, str(trrecord)))
continue

########### Extract necessary info from the VCF file #######
nrecords += 1 # only increment if we're actually testing it
# Stutter params for the locus. These are the same for all samples
if "INFRAME_UP" not in trrecord.info.keys() or \
"INFRAME_DOWN" not in trrecord.info.keys() or \
"INFRAME_PGEOM" not in trrecord.info.keys():
Expand Down Expand Up @@ -556,10 +559,10 @@ def main(args):
# Discard locus if: only a single allele seen in the reads
if len(set(reads)) == 1 and not args.output_all:
continue

ntests += 1
locname = "%s:%s" % (record.CHROM, record.POS)
best_C, best_f = MaximizeMosaicLikelihoodBoth(reads, A, B, stutter_probs,
locname=locname, quiet=args.quiet)
locname=locname, quiet=not(args.debug))
pval = ComputePvalue(reads, A, B, best_C, best_f, stutter_probs)

outf.write('\t'.join([samples[i], record.CHROM, str(record.POS),
Expand All @@ -574,12 +577,14 @@ def main(args):
if args.debug:
common.WARNING("Inferred best_C=%s best_f=%s" %
(best_C, best_f))
#############################################################
if nrecords % 50 == 0 and not args.quiet:
common.MSG("Finished {} records, time/record={:.5}sec".format(nrecords,

if nrecords > 0 and nrecords % 50 == 0 and not args.quiet:
common.MSG("Finished {} records, {} total tests. "
" time/record={:.5}sec".format(nrecords, ntests,
(time.time() - start_time)/nrecords), debug=True)

if not args.quiet: common.MSG("Performed analysis on {} records".format(nrecords), debug=True)
if not args.quiet:
common.MSG("Performed analysis on {} records, {} total tests".format(nrecords, ntests), debug=True)

if outf is not None and args.out != "stdout":
outf.close()
Expand Down
1 change: 1 addition & 0 deletions trtools/prancSTR/tests/test_prancSTR.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ def test_RightFile(args, vcfdir):
assert retcode==0

args.quiet = False
args.debug = True
retcode = main(args)
assert retcode==0

Expand Down

0 comments on commit b96efde

Please sign in to comment.