-
Notifications
You must be signed in to change notification settings - Fork 10
Tutorial
Here I show how to apply seer to 603 pneumococcal genomes isolated from the nasopharynx of US (MA) children (see Croucher et. al. Nat Gen 6(45) 656--663 doi: 10.1038/ng.2625). 224 were resistant or intermediately resistant to penicillin.
We start with genomes assembled by velvet (in assemblies) and the list of which were resistant and susceptible to penicillin (penicillin.pheno)
paste <(ls assemblies | cut -d "." -f 1) <(ls assemblies/*.fa) > fsm_files.txt
fsm-lite -v -l fsm_files.txt -t tmp_idx -s 10 -S 593 > fsm_kmers.txt
split -d -n l/16 fsm_kmers.txt fsm_out
rm fsm_kmers.txt
gzip fsm_out*
These first steps count the k-mers from the assemblies and split the output into files to facilitate parallelisation later. We use k-mers to test for association as these allow us to look for both gene and SNP variation simultaneously, and they don't rely on a single reference genome.
The parameters -s and -S should be set to around 1% and 99% of the sample size respectively.
This step takes about 4 hours. If you are running into issues with disk space you should pipe the fsm-lite
command directly to gzip
, and leave out the split
command.
Make sure you are using v1.1.3 or later. The order of samples in ls assemblies/*.fa
and cut -f 1 penicillin.pheno
should be the same. Otherwise in the final command give a file with sample names in the order of the former command.
mash sketch -o reference assemblies/*.fa
mash dist reference.msh reference.msh > mash_distances.txt
perl ~/seer/scripts/mash2matrix.pl mash_distances.txt > all_distances.csv
perl ~/seer/scripts/R_mds.pl -d all_distances.csv -p penicillin.pheno -o projection
- Create a minhash of all the assemblies
- Calculate the pairwise distance between all samples
- Convert the mash output into a distance matrix
- Project the 603x603 distance matrix into 3 dimensions
Bacteria are clonal, and so a simple test of association will bring up all variants in a lineage associated with the phenotype as well as the causal variants. By assessing the similarity between samples we can approximate this population structure, and control for it in seer.
seer -k fsm_out00.gz -p penicillin.pheno --struct projection --threads 8 --maf 0.05 > seer00.txt
This runs the association. It uses the counted kmers -k, the phenotype -p and the population structure calculated above --struct. We also filter out low frequency kmers, and multithread on the vm for speed.
This command can be run in a reasonable amount of time (<20 minutes) to get the output from one of the kmer files. Using less -S, have a look in the output file to see what seer generates as output.
This can then be filtered to just those kmers positively associated with resistance
filter_seer -k seer00.txt --pos_beta > seer00_filtered.txt
parallel --results answers -j 8 seer -k fsm_out{}.gz -p penicillin.pheno --struct projection --maf 0.05 ::: 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15
for i in $(find . -name stdout); do filter_seer -k $i --pos_beta | sed '1d' >> seer_filtered.txt; done
This runs all the seer jobs in parallel and then filters them. This takes about two hours.
~/seer/scripts/hits_to_fastq.pl -k seer_filtered.txt -b 10e-8 > seer_filtered.fastq
bowtie2 -p 8 --reorder -q -U seer_filtered.fastq --ignore-quals -D 24 -R 3 -N 0 -L 7 -i S,1,0.50 -x Streptococcus_pneumoniae_ATCC_700669_v1 -S seer_filtered.sam
~/seer/scripts/mapping_to_phandango.pl -s seer_filtered.sam -k seer_filtered.txt -m 20 > phandango.plot
- Converts significant kmers to a fastq for mapping
- Maps significant kmers to a reference genome
- Visualise as a Manhattan plot in phandango
Note the message from bowtie2 -> if kmers are unmapped another reference can easily be mapped to without repeating any of the long steps above. Therefore the pan-genome has been tested, not just the core
Copy the gff and plot file locally and drag onto http://jameshadfield.github.io/phandango/
The four biggest peaks respectively are pbp2x, orf15 (in the ICE), pbp1a, pbp2b All three penicillin binding proteins!
Bonus question for fans of the pneumococcus: why does the ICE appear to be associated with penicillin resistance?