Version 0.21
Weaver is a tool which, given a whole genome sequencing sample from a tumour sample (in BAM format) and reference FASTA file as input, returns allele-specific copy numbers of regions and phased breakpoints of structural variants. The principal model in the framework is a Markov Random Field, which takes as auxiliary inputs germline SNP data from the 1000 Genomes database, GC content and mappability, and has the phase and copy number of genomic loci as hidden states.
Installing Weaver requires the following dependencies.
- CMake
- Bamtools libraries are needed, included in Weaver_SV/lib and Weaver_SV/inc
- Parallel::ForkManager perl package is needed
- Bedtools
- Samtools
- BOOST C++ library
- BWA
- Bowtie
- libz required //-lz flag
We recommended that the system on which Weaver is installed have more than 40 GB of memory, since Weaver incorporates variant calling. For small inputs, depending on the number of process threads, the memory used can be much lower.
In order to install Weaver, we need to run the following commands.
``export LD_LIBRARY_PATH=<PREFIX>/Weaver/Weaver_SV/lib/:$LD_LIBRARY_PATH``
Define variables $BOOST
and ${BOOST_OPT}
as the locations of your Boost
install and the linker file for the program options library in boost. The
latter is traditionally located at:
``$BOOST/bin.v2/libs/program_options/build/gcc-4.8/release/link-static/threading-multi/libboost_program_options.a``
Then run the following command.
``./INSTALL.sh $BOOST ${BOOST_OPT}``
The Weaver executable will be located in bin/
within the installation directory.
Weaver requires input data that is available for download.
``wget http://genome.compbio.cs.cmu.edu/~ashokr/data/Weaver_data.tar.gz``
The individual files must be stored in the folder data/
in the installation directory.
An example for running the core Weaver program is given at the following link.
Download the entire folder into the Weaver directory. To run the example, just run . cmd.sh
``http://genome.compbio.cs.cmu.edu/~ashokr/data/Weaver_example.tar``
We will assume the following input variables, which we need for all the scripts.
${BAM}
: Sorted and indexed BAM file. Index must be present in same directory.${REFDIR}
: The address and prefix to the reference FASTA file. Exclude the extension (.fa
or.fasta
). The location must also contain the Bowtie (*.ebwt
) and BWA indices, with the same prefix. For the sake of exposition, we will assume that the reference FASTA file has extension.fa
.${GAPALPHA}
: The fileGAP_20140416
in thedata/
directory.${GAP}
: The fileGAP_20140416_num
in thedata/
directory. This is the same as$GAPALPHA
, but does not have thechr
prefix preceding chromosome names.${1000G}
: The 1000 Genomes Phase 1 haplotypes data. This can be obtained from the 1000 Genomes project website.${SEX}
: The sex of the subject from whom the sample is obtained.
${MAP}
: The filewgEncodeCrgMapabilityAlign100mer_number.bd
in- the
data/
directory.${THREADS}
: User-specified number of threads the program will run on.
We will use ${BIN}
as the default directory under which the Weaver
executables are stored. This will be the full path to the directory bin/
in the installation.
In order to run Weaver, we must first call the different single nucleotide and
structural variants in the sample. To do this, we will use the perl script
bin/Weaver_pipeline.pl
.
Usage: $BIN/Weaver_pipeline.pl ALL <mode> \ -p/--thread number of cores -f/--fa [MANDATORY] bowtie and bwa reference dir/name -g/--gap [MANDATORY] Gap file -b/--bam bam file -o/--output output dir -k/--onekg 1000 Gemomes Project data dir -s/--sex Female (F) or Male (M). Y chromosome will not be used if the bam is from female tissue. -h/--help
<mode>
takes one of the following arguments: SV, SNP, WIG
.
perl $BIN/Weaver_pipeline.pl ALL SV \ -f ${REFDIR} \ -g ${GAPALPHA} \ -b ${BAM} \ -k ${T1000} \ -s ${SEX} \ -p ${THREADS}
perl $BIN/Weaver_pipeline.pl ALL SNP \ -f ${REFDIR} \ -g ${GAPALPHA} \ -b ${BAM} \ -k ${T1000} \ -s ${SEX} \ -p ${THREADS}
perl $BIN/Weaver_pipeline.pl ALL WIG \ -f ${REFDIR} \ -g ${GAPALPHA} \ -b ${BAM} \ -k ${T1000} \ -s ${SEX}
The core Weaver program needs haplotype level coverage for the cancer and
normal genomes as input. We can estimate this using the following command
from the same directory that Weaver_pipline.pl
was executed. Assume
that the variable ${NEWGAP}
is equal to $GAPALPHA
if the reference
FASTA and BAM file have chromosome names with chr
prefixed, and equal to
$GAP
otherwise.
$BIN/Weaver PLOIDY -f ${REFDIR}.fa \ -S ${BAM}.Weaver.GOOD \ -s SNP_dens \ -g ${NEWGAP} \ -w ${BAM}.wig \ -z ${TILESIZE} \ -r 1 \ -m $MAP \ -p $THREADS
Inputs:
- -f reference file (fasta), should match the reference used in original bam file. Especially for most TCGA datasets, the alignment was performed on //www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta, which does not have "chr" prefix [MANDATORY]
- -S SV file, with format consistent with Weaver_SV. [MANDATORY]
- -s SNP file, with ref and alt mappings [MANDATORY]
- -w wig file from bam, storing the coverage information [MANDATORY]
- -z tile size argument: positive integer size to partition genome. [default 5000]
- -r 1, if first time running (generating temp files); 0 if want to use existing temp files. [default 1]
- -m mappability file, extract from
http://genome.compbio.cs.cmu.edu/~ashokr/data/Weaver_data.tar.gz
[MANDATORY] - -p number of cores [default 1]
- Output:
- TARGET: File containing haplotype level coverage of different regions
The tile size argument must be set by trial and error, so that the TARGETi
file is properly populated. The argument varies from sample to sample, but the usual range is from 500 to 5000, with 1000 being common for TCGA samples.
Once this is obtained, we use the following command to obtain the haplotype level coverage.
$BIN/solo_ploidy TARGET 2
The 2
here indicates a diploid normal genome. This will write the estimated
haplotype level normal and tumour coverage to STDOUT
.
Finally, in order to obtain the main result, we run the following script. Here,
we assume that ${TUMOUR_COV}
and ${NORMAL_COV}
are the tumour and
normal haplotype level coverage obtained in the previous step respectively.
$BIN/Weaver LITE -f ${REFDIR}.fa \ -S ${BAM}.Weaver.GOOD \ -s SNP_dens \ -g ${NEWGAP} \ -w ${BAM}.wig \ -z ${TILESIZE} \ -r 1 \ -m $MAP \ -t ${TUMOUR_COV} \ -n ${NORMAL_COV} \ -p $THREADS
Wiggle file need to be declared with fixedStep, step 1 and span 1 fixedStep chrom=chr1 start=9994 step=1 span=1 if a chromosome has multiple declaration lines, they need to be sorted based on position: fixedStep chrom=chr1 start=9994 step=1 span=1 X X X fixedStep chrom=chr1 start=100 step=1 span=1 X X X Is not allowed
Must be sorted and indexed.
SNP file:
NGS SNP link file
1KGP SNP link
Genome region file:
GAP regions in assembly are annotated.
Storing phased allele specific copy number of genome
CHR BEGIN END ALLELE_1_CN ALLELE_2_CN
Structural variation copy number and phasing, catagory
CHR_1 POS_1 ORI_1 ALLELE_ CHR_2 POS_2 ORI_2 ALLELE_ CN germline/somatic_post_aneuploidy/somatic_pre_aneuploidy
Yang Li Jian Ma's Computational Genomics Lab at Carnegie Mellon The code was developed by Yang Li when the Ma lab was at the University of Illinois at Urbana-Champaign