In this example we are identifying S. lycopersicum specific k-mers with length 32 bases that are present in at least 2 of 5 S. lycopersicum chloroplast genome and none of the given 1714 nontarget taxa chloroplast genome. We also filter out k-mers that are present in S. tuberosum or S. pimpinellifolium whole sequencing raw reads with frequency at least 10. Sequencing reads are downloaded from NCBI SRA database.
Before you can start, you need to:
- a) download PlantTaxSeeker repository containing bins, scripts and readme files from Github
- b) move to the folder "example"
- c) download S. lycopersicum and nontarget taxa chloroplast genome sequences as FASTA format files, and also sequencing data of 4 samples as FASTQ-format files from HERE.
Make sure you have enough space for storing these files. FASTA files that are used in this example are ca 255 MB. The FASTQ files are ca 157 GB unpacked. The results files containing k-mers´ lists are ca 100 KB.
FASTA files contain 5 S. lycopersicum and 1714 nontarget taxa chloroplast genome sequences. FASTQ files contain whole genome sequencing raw reads of 4 samples of S. tuberosum or S. pimpinellifolium (NCBI SRA accession numbers ERR418080, SRR1608100, SRR2069941 and SRR1481624).
Use following command lines to perform the example analysis ("bash test.sh" downloads FASTA and FASTQ files, moves bins and scripts needed for analysis to the folder "example" and executes scripts):
git clone https://github.com/bioinfo-ut/PlantTaxSeeker/
cd PlantTaxSeeker/example/
bash test.sh
The 4 results files contain the lists of S. lycopersicum specific k-mers before ("Specific_kmers_32.txt", "Specific_kmers_32.list) and after additional filtering. K-mers´ lists are given as binary files (enables additional operations using GenomeTester4 programs) and also as human readable TXT files.