-
Notifications
You must be signed in to change notification settings - Fork 20
User guide
The command to run small variant genotyping is:
graphtyper genotype <REFERENCE.fa> --sams=<BAMLIST_OR_CRAMLIST> --region=<chrA:begin-end>
where REFERENCE.fa
is the FASTA reference genome, BAMLIST_OR_CRAMLIST
are the input BAM/CRAM files (one per line), and T is the maximum amount of threads you wish to allocate. Note that T should be equal or lower than your number of input SAM files as there is always just one thread reading each SAM file. For more details and other options see the help page
./graphtyper genotype --help
./graphtyper genotype --advanced --help # To see advanced options
For SV genotyping you should instead use genotype_sv
subcommand.
./graphtyper genotype_sv <REFERENCE.fa> <input.vcf.gz> --sams=<BAMLIST_OR_CRAMLIST> --region=<chrA:begin-end> --threads=<T>
For more details and other options see the help page
./graphtyper genotype_sv --help
./graphtyper genotype --advanced --help # To see advanced options
In our experience, discovering structural variants (SVs) using Manta diploidSV.vcf.gz yields the best results as Manta provides very detailed information on the exact breakpoint sequence of SVs - some of which are essential for graphtyper. Use svimmer to merge candidate from multiple diploadSV files together.
Some regions of the genome abnormally high sequence depth, even ten times the average coverage or more. You may improve the overall time of graphtyper by subsampling reads in such regions using --avg_cov_by_readlen=FILE
The FILE in --avg_cov_by_readlen=FILE
which should contain the average coverage divided by read length for each BAM/CRAM (one value per line). The list provided in --sams=FILE
should be in the same order as the avg_cov_by_readlen
file. So for example if have a sample with 30x coverage with 100bp reads you should put the value "0.3". You may calculate the values for an input BAMLIST
using:
parallel -k "samtools idxstats {1} | head -n -1 | awk '{sum+=\$3+\$4; ref+=\$2} END{print sum/ref}'" :::: ${BAMLIST} > ${BAMLIST}.avg_cov_by_readlen
If graphtyper is using too much memory for your system you might try setting a lower --max_files_open=VAL
value (the trade off is a bit slower running speed). Set it to, for example, two times your CPU thread count. This advanced option will determine how many files each tread can open at the same time and a lower value forces graphtyper to processess fewer individuals at a time.
The graphtyper output files will be in small regions. If you only have one or a few sample you might want to concatenate them and for that you can use the bcftools concat --naive
command, for example
$ echo chr{1..22} chrX | tr ' ' '\n' | while read chrom; do if [[ ! -d results/${chrom} ]]; then continue; fi; find results/${chrom} -name "*.vcf.gz" | sort; done > vcf_file_list
$ bcftools concat --naive --file-list vcf_file_list -Oz -o large.vcf.gz
$ tabix -p vcf large.vcf.gz
Note: replace results
with sv_results
if you are genotyping SVs.
See SNP and indel filtering page
The INFO/AAScore has a score between 0 and 1, higher score indicates a better alternative allele quality. Consider removing anything from your analysis with a score below 0.5, or find another threshold that suits your application.