These are scripts I wrote (mostly in perl) to do some basic bioinformatics munging. They are no longer supported, but you are welcome to try using them. Archived readme follows
To get a copy of them in your directory, run:
git clone https://github.com/johnlees/bioinformatics/
Scripts generally to create and deal with de novo assemblies, or assembly only methods of calling
Annotates a vcf with the features they fall in as well as gene names (using a gff file)
./annotate_vcf_from_gff.pl <gff_in> <vcf_file>
Simulates SNPs in a fasta file using the JC69 rate matrix and the specified mutation rate
./mutate_fasta.pl
Usage: ./mutate_fast.pl -f <fasta_file> -r <mutation_rate> -o <output_file> (-s <seed>) > mutations.txt
Creates SNPs in a multifasta using JC69 rate matrix
Options:
-f --fasta Multi-fasta file of sequence to create SNPs in
-r --rate Per-base mutation rate
-s --seed Optional seed for random number generator
-o --output Output file name
-h --help This help message
Prints mutations tab separated to stdout and logs to stderr
Simple wrapper to run quake on pairs of read files (fastq(.gz)) - which error corrects reads based on kmer frequencies. For the kmer to use see the help
This should be bsubbed with bsub -o quake.%J.o -n4 -R "span[hosts=1]" -R "select[mem>3000] rusage[mem=3000]" -M3000 ./run_quake.pl <options>
./run_quake.pl
One of the options was not defined. All are required
Usage: ./run_quake.pl -r <reads_file> -k <kmer_size> -t <threads>
Runs quake error correction on paired end reads provided as fastq(.gz) files.
Options:
-r --reads Tab delimited file of fastq or fastq.gz file locations.
One line per sample. First column sample name, second
column forward reads, third column reverse reads.
It is best to give absolute paths
-k --kmer The kmer size to use with quake. Set to
ceil[log(200*G)/log(4)] where G is the genome length
-t --threads The number of threads to allow quake to use
-s --separate Error correct each sample separately. Probably required
for simulated data
-h --help This help message
Output will be in cwd/quake
Calls variants (at the moment only SNPs, but soon INDELs too) between two samples without needing to provide a reference. The output is an annotated vcf with just the variants
You'll need to provide the assembly of one of the samples as input (it is only used to place the variants, not in calling). See the extended help with -h for advice on how to prepare this.
bsub commands can be found in the extended help by running with the flag -h. I'd advise using 2 or 4 cores
You can use either cortex or sga to do this. Early tests suggests cortex has better power. Reads are error corrected with quake first
/reference_free_variant_caller.pl
Usage: ./reference_free_variant_caller.pl [--sga|--cortex] -a <assembly.fasta> -g <annotation.gff> -r <reads_file>
Uses cortex_var or sga to call variants between two samples without providing a reference
sequence
Options
--cortex Use cortex for variant calling
--sga Use sga for variant calling
-a, --assembly Contigs of a de-novo assembly of one of the samples in
multi-fasta format. See help for more info.
-g, --annotation Annotation of the -a assembly in gff v3.
-r, --reads Tab delimited file of fastq or fastq.gz file locations.
One line per sample. First column sample name, second
column forward reads, third column reverse reads.
It is best to give absolute paths
-o, --output Prefix for output vcfs
-c, --med-cov Median coverage of reads over the assembly. Required by
sga, unused by cortex
--no-error-correct Don't run quake on reads. Input reads are fed directly
into cortex
--separate-correct Error correct each sample separately. Probably required
for simulated data
--dirty Don't clean up temporary files
-h, --help Shows more detailed help.
For cortex, requires: cortex_var, bcftools, quake and jellyfish.
For sga, requires: sga
See help for more details
Run ./batch_assemble.pl
This takes a list of run_lane#tag and runs (over LSF):
- SPAdes assembler (using --careful mode which include hammer read error correction)
- improvement pipeline
- annotation pipeline
This script does not need to be bsub'd. The last three options can be used to continue failed jobs (e.g. with more memory)
./batch_assemble.pl
Usage: ./batch_assemble.pl <options> --lanes lane_file.txt
Submit assembly and annotation jobs across lsf
Takes a file which contains ids in the form run_lane#tag, one per line
Options
--lanes A list, one per line, of sequencing ids of the form
run_lane#tag
--genus The genus of the bacteria (used for annotation)
--output Output directory (default ./)
--tmp Directory to write temporary files to (default /tmp)
--threads Number of threads to use (default 1)
--mem Comma separated list of memory to submit each step with
Default: 4000,2000,1200
--nosymlinks Do not create subdirectories with symlinks, as they
already exist
--improve Run from improvement step
--annotate Run from annotation step
--help Displays this help message
Anything else
Does a blastn of nucleotides in a 300bp window around a variant. This can be used to confirm variants are the same when they have been called after mapping onto different reference sequences
./compare_variants.pl --vcf1 1.vcf.gz --vcf2 2.vcf.gz --ref1 ref1.fa --ref2 ref2.fa
Strep. pneumo. specific
Extracts the allele type for the hsdS gene in the type I R-M system in S. pneumo which is one of six types (see Manso et al, Nat. Comms. 2014).
You can try and do this from assemblies, or using mapping. They work in different ways, but I think mapping is better
Run with:
./ivr-typer.pl --map --ref_dir /lustre/scratch108/bacteria/jl11/pairs_analysis/RM-locus/mapping -f forward_reads.fastq.gz -r reverse_reads.fastq.gz > ivr_allele.txt
./ivr-typer.pl [--map|--assembly] <options>
Given S. pneumo reads (via mapping mode) or annotated assembly (via assembly mode),
returns information on the likely allele type for the ivr/hsd R-M system locus
Using mapping mode will take around 1Gb memory for bam sorting, and around 5 mins
of CPU time if mapping is required
STDOUT can be piped to get tab separated output with a header
Options
--map Infer by mapping reads to reference alleles
--assembly Infer by annotation of gene, then a blast with reference
genes
--ref_dir Directory containing sequences of reference alleles
(default ./)
<map mode>
-m, --mapping A bam file which already has reads mapped to the D39 ref
OR (to create the bam)
-f, --forward_reads Location of forward reads (fastq(.gz))
-r, --reverse_reads Location of reverse reads (fastq(.gz))
<assembly mode>
-a, --annotation Annotation of the -a assembly in gff v3.
-b, --batch Batch mode. Interprets the annotation option as a
list of gff files instead of a single file. A second
(tab separated) field with samples names can be used
STDOUT can be piped to get list of alleles by sample
-h, --help Displays this message
To make sense of the output of this script run on many samples you can use ivr_format.pl
Very similar to ~sh16/scripts/multiple_mappings_to_bam.py
, developed originally to map two very similar sequences to a draft assembly. More features have been added recently
Designed for two samples at a time (c.f. reference_free_variant_caller.pl), but put in as many/few as you want. Creates one large merged bam and calls from this. I doubt this will work with hundreds/thousands of samples
Advantages are:
- Uses new bcftools caller which should be better
- Calls are haploid not diploid
- Allows use of a more accurate prior for bacteria rather than the human default
- VCFs are annotated if they fall in a region with a feature. Frameshifting indels are annotated
- Can use snap for mapping, which is much quicker than bwa mem or smalt
- Works with multiple samples concurrently
Disadvantages:
- No pseudo-genome
- Not great scaling (use the same number of threads as samples, but would be better as job arrays). This should be ok if you write a script to bsub each sample
- Memory scaling of snap seems capricious
- Multiple cores is dodgy with snap
Future improvements:
- Allow use of GATK genotyper (quicker)
- Try and call large deletions
Usage: ./map_snp_call.pl -r <reference.fasta> -a <reference_annotation.gff> -r <reads_file> -o <output_prefix>
Maps input reads to reference, and calls snps
Options
-a, --assembly Fasta file of reference to map to.
-g, --annotation Annotation of the reference to annotate variants with.
-r, --reads Tab delimited file of fastq or fastq.gz file locations.
One line per sample. First column sample name, second
column forward reads, third column reverse reads.
It is best to give absolute paths
-o, --output Prefix for output files
-m, --mapper Choose snap, bwa or smalt. Default snap
--gatk Use gatk to improve bams (realign around indels)
(recommended!)
-t, --threads Number of threads to use. Default 1
-p, --prior Prior for number of snps expected. Default 1e-3
--dirty Don't remove temporary files
-h, --help Shows more detailed help.
Requires: samtools, bcftools
At least one of: smalt, bwa, snap
Optional: picard, gatk