Gene copy number prediction from k-mer frequencies
Pull Github repository (or download only GeneToKmer.py and KmerToCN.py). Also download glistmaker, glistquery, glistcompare, gmer_counter tools from GenomeTester4 toolkit (https://github.com/bioinfo-ut/GenomeTester4 - see the manual for GenomeTester4 and FastGT)
GeneToCN is implemented on Python 3 and requires following packages: biopython, pandas, matplotlib.
We suggest using anaconda and create a new virtual enivironment
conda create --name <env_name> biopython pandas matplotlib
Outputs a binary file <output_prefix>_.list, the suggested k-mer length for human genome is 25
glistmaker <reference_sequences> -w <k-mer length> -o <output_prefix>
Activate the conda environment, if not already activated
conda activate <env_name>
python GeneToKmer.py <region_file> <chr_reference_sequence>.fa <reference_kmer_list> -o <location_for_output_files> -i -gt <location_for_GenomeTester_source_code>
Needs to be run separately for each chromosome, requires the fasta sequence for only one chromosome. The region (gene) names have to be unique.
Example of the region file (fields separated by a space):
AMY1 G 1:103655290-103664554 1:103687415-103696680 1:103750406-103758690
AMY2A G 1:103616811-103625780
AMY2B G 1:103553815-103579534
Option 1 (faster) - use premade k-mer databases for human genome (GRCh38):
python extract_flanking_kmers.py <region_file>
To change the default option for the number of flanking k-mers, use -n <nr_of_flanking_kmers> (default: n=2000).
The script takes a subset of at least n/2 reference k-mers from both sides of each gene (a total of n kmers) using the k-mers in "Kmer_db/Ref_kmers" for human genome. One k-mer database is created for each gene (may contain the same set of k-mers for closely located genes).
Option 2:
Add flanking regions to the region file to create k-mer databases when running GeneToKmer.py
Flanking_AMY F 1:103500000-103550000 1:103760000-103800000
Cat together all the k-mer database files to be able to estimate all the copy numbers with the same run (the order is not important).
Create a file containing the original k-mer db locations for each gene. Genes using one flanking region should be separated with an empty row and flanking db always before the gene (or genes) for which it should be used. Fields separated by a space.
An example:
Flanking_AMY Flanking_AMY_NIPT_103M_kmers_cleaned.db
AMY1 AMY1_kmers.db
AMY2A AMY2A_kmers.db
AMY2B AMY/AMY2B_kmers.db
Flanking_NPY4R NPY4R/Flanking_NPY4R_NIPT_46M47M_kmers_cleaned.db
NPY4R NPY4R/NPY4R_kmers.db
The program will create new directories (counts/ counter_errors/ plots/) into the specified output directory containing the k-mer counts, possible errors from the gmer_counter and plots for visual inspection of the normalized k-mer frequencies in the gene regions.
The copy number estimations are written in the outpur direction into file <output_prefix>_cn.txt
Use -i to print out some information while running
python KmerToCN.py -db <kmer_db> -kp <original_db_location_file> -s <fastq files> -d <output_directory> -o <output_prefix> -gm <location_for_gmer_counter> -r <region_file>
Some K-mer databases and region files are included in the folder Kmer_db.
The k-mers in Ref_kmers folder can be used for choosing reference k-mers from flanking regions.
Gene_kmers folder contains k-mers used for the validation of the method.
All k-mer databases are text files formatted in a similar way with one k-mer per row:
chr:position nr_of_kmers kmers
An example:
1:103656318 1 CTTCAAAGCAAAATGAAGCTCTTTT
1:103656319 1 TTCAAAGCAAAATGAAGCTCTTTTG
1:103656320 1 TCAAAGCAAAATGAAGCTCTTTTGG
1:103656321 1 CAAAGCAAAATGAAGCTCTTTTGGT
1:103656322 1 AAAGCAAAATGAAGCTCTTTTGGTT
GeneToCN method as well as the results of the validation are described in the paper.
Pajuste, FD., Remm, M. GeneToCN: an alignment-free method for gene copy number estimation directly from next-generation sequencing reads. Sci Rep 13, 17765 (2023). https://doi.org/10.1038/s41598-023-44636-z
Fanny-Dhelia Pajuste, University of Tartu