To be compiled, SurVIndel requires g++ 4.7.2.
htslib 1.7 is included in the source, zipped. First of all, you should build it by using the provided script
./build_htslib.sh
If htslib does not build correctly, please refer to https://github.com/samtools/htslib
Then, run
cmake -DCMAKE_BUILD_TYPE=Release . && make
The reference fasta file should be indexed by both bwa and samtools. For example, assuming the file is hg19.fa, you should run
bwa index hg19.fa
samtools faidx hg19.fa
Although not mandatory, SurVIndel will generally give higher quality results if a simple repeats file is provided. This can normally be downloaded from the simpleRepeats table in UCSC. The header must be removed and only the chromosome, the start, the end and the period columns must be retained, i.e.:
cat downloaded-file | grep -v "#" | cut -f2,3,4,6 > file-for-survindel.bed
Alternatively, you can run TRF (https://tandem.bu.edu/trf/trf.html) and use the provided trf-to-bed.sh, i.e.:
cat trf-output.dat | ./trf-to-bed.sh > file-for-survindel.bed
SurVIndel has currently only been tested using BAM files generated by BWA MEM, therefore we recommend its usage. It should also be run through Picard FixMateInformation (http://broadinstitute.github.io/picard/command-line-overview.html#FixMateInformation); in particular, it should have the MQ and MC tags. Finally, the file should be sorted and indexed ad usual using samtools.
Supposing file.bam is the file resulting from the alignment:
java -jar picard.jar FixMateInformation I=file.bam
samtools sort file.bam > sorted.bam
samtools index sorted.bam
Once the c++ code is compiled, SurVIndel can be run. Python and libraries NumPy (http://www.numpy.org/), PyFaidx (https://github.com/mdshw5/pyfaidx) and PySam (https://github.com/pysam-developers/pysam) are required. Python 2.7, NumPy 1.10, PyFaidx 0.4 and PySam 0.12 are the recommended versions.
The bare minimum command for running SurVIndel is
python surveyor.py /path/to/bamfile /an/empty/working/directory /path/to/reference/fasta
Other parameters which may be important are the number of threads, the location of the bwa and samtools executables and a simple repeats catalogue (can be downloaded from UCSC Genome Browser, or generated by TRF is not present).
python surveyor.py /path/to/bamfile /an/empty/working/directory /path/to/reference/fasta --threads 40 --samtools /path/to/samtools --bwa /path/to/bwa --simple-rep /path/to/simple/repeats/file
After SurVIndel has been successfully run, the calls can be retrieved with the command
./filter /path/to/working/directory alpha-value score-cutoff min-size simple-repeats
Where alpha-value is the maximum p-value for an indel to be accepted, score-cutoff is the positive-to-negative ratio cutoff, min-size is the minimum size for an indel to be reported and simple-repeats is the simple-repeats file. The recommended values are 0.01 for alpha-value and 0.33 for score-cutoff. Larger alpha-values and lower score-cutoffs will yield more predictions, but at the expense of precision.