- Probe-design for SARS-Cov-2
- Genome sequence is stored here:
data\Wuhan-Hu-1_2019.gb
- For a blast search against this genome, we use:
data\genomes\cov2\cov2.fasta
- For a blast against >2500 aligned cov-2 genomes, we use:
data\genomes\cov2\cov2_aligned.fasta
. Please seedata\genomes\cov2\gisaid_acknowledgements.txt
for the acknowledgments. - Target regions are define in the file
data\Wuhan-Hu-1_2019__target_regions.tsv
. This can be the entire genome, or sub-regions.
Probes that target regions for having highest NGS coverage are selected. All relevant data is in the folder data\coverage
.
If new coverage files are added, they first have to be analyzed with the script workflows\probe_design\0a_analyze_NGS_coverage.py
.
Coverage refers to the average number of times a single base is read during a sequencing run.
What is Coverage in NGS? In theory, a sequencing run will geenrate reads that sample a genome randomly and independently. However, these reads are not distributed equally across an entire genome; some bases are covered by fewer, some by more reads than the average coverage.
As a simplification, coverage can be assumed to be proportional to the amount of RNA species in the sample, and should hence allow to pinpoint towards regions more likely to be present in a sample (as genomes/subgenomes/transcripts).
Identified probes should not be specific against any other beta-coronavirus.
data\genomes\beta-corona
: genomes are stored in multi-fasta file.
beta-coronavirus | ID | Included |
---|---|---|
SARS | NC_004718.3 | [x] |
MERS | NC_019843.3 | [x] |
HKU1 | NC_006577.2 | [x] |
OC43 | NC_006213.1 | [x] |
NL63 | JX504050.1 | [x] |
229E | NC_002645.1 | [x] |
Identified probes should not be specific against other pathogens/viruses with similar symptomes.
data\genomes\viruses
: genomes are stored in separate fasta file.
Virus/pathogen | ID | Included |
---|---|---|
Mycobacterium tuberculosis | NC_000962.3 | [x] |
Human parainfluenza virus type 1 | NC_003461 | [x] |
Human parainfluenza virus type 2 | NC_003443.1 | [x] |
Human parainfluenza virus type 3 | NC_001796 | [x] |
Human parainfluenza virus type 4 | NC_021928.1 | [x] |
Respiratory syncytial virus | NC_001803 | [x] |
Human metapneumovirus | NC_039199 | [x] |
Mycoplasma pneumoniae | NZ_CP010546 | [x] |
Chlamydophila pneumoniae | NC_005043.1 | [x] |
Influenza A H3N2 | [x] | |
Influenza A H1N1 (such as A/Alaska/58/2017(H1N1)) | txid2043069 | [x] |
Influenza B (such as B/Yamagata/16/88 and B/Victoria/2/87) | txid416674 | [x] |
Influenza C | GCF_000856665.10 | [x] |
Influenza D | GCF_002867775.1 | [x] |
Rhinovirus/enterovirus | NC_038312.1 | [x] |
Identified probes should not be specific against transcriptome of commonly used host organisms.
Only tanscripts are kept from blast hits, e.g. Refseq annotations starting with : 'NM_', 'XM_', 'NR', 'XR_'
Host | Identifier | Tested |
---|---|---|
Home sapiens | Human G+T | [x] |
Mus musculus | Mouse G+T | [x] |
African Green monkey | 60711 | [x] |
Hamsters | 10026 | [x] |
Ferret | 9669 | [x] |
Takes the file with the target regions, extracts their sequences, and creates a separate fasta file for each target region type. If several sequences are defined for a target regions type, the script will create a multi-fasta file, e.g. multiple fasta entries in one file.
In this function, you can define if the fasta sequence should be the reverse complement.
Note: fasta file names, sequence ids or description should NOT contain underscores. In Oligostan, underscores are used to split string to extract some meta-data.
Function: probe_design\1_create_fasta_of_target_region.py
Input:
data\Wuhan-Hu-1_2019.gb
data\Wuhan-Hu-1_2019_potential_target_regions.tsv
Different region types can be defined as well if needed (specified by the name in the first column). Currently supported arecov-2
spike-glycoprotein
target-region
primer-region
For a new region type, several target ranges (specified by their start and end position) can be specified.
If you define a new region type, you have to modify the the Python script 1_create_fasta_of_target_region.py
to take
new entry types into considerations.
Output:
data\Wuhan-Hu-1_2019.gb
data\Wuhan-Hu-1_2019_potential_target_regions.tsv
We use Oligostan for probe-design.
See also the provided document workflows\probe_design\Oligostan_documentation.doc
for more details.
IMPORTANT: in the probe design script you have to update the path to the fasta sequence (line 8).
- Runs in R as a script, which designs probes against the fasta sequences created in step 1.
- Requires installation of some extra packages:
install.packages(c("ade4", "seqinr", "zoo"))
- Takes fasta files as an input.
- Multi-fasta is supported.
- File-names, sequence ids or description can NOT contain underscores
- When specifying path names under Windows,
\
has to be replaced by/
Function: probe_design\oligostan.r
Input:
data\Wuhan-Hu-1_2019.gb
data\Wuhan-Hu-1_2019_potential_target_regions.tsv
Output:
Will create a folder data\fasta\Probes__annotation-name
, where region-name
is the name of the
regions given in step 1, e.g. cov-2
when the annotated region was the default region covering the entire genome.
Contains a fasta file and a summary text file for all probes identified (ALL
), and the probes passing the specified filters (FILT
).
Perform local blast against several reference genomes. Results are stored as blast hit files with output format 6.
Blast is performed with the option blastn-short
, which is specifically optimized for shorter sequences (<50 bases). This changes the following
parameters
word-size
: 2gapopen
: 5gapextend
: 2reward
: 1penalty
: -3
IMPORTANT: you have to build the databases on your local installation (see below).
You can download the blast+ suite from this link. Instructions for different operating systems are provided.
Important: blast commands have to be in system path as described here.
- Local databases can be build for alignment against any fasta sequence.
FASTA files for different viruses/pathogens/beta-coronaviruses are provided. blast data-base has to be build for each new local installation.
Build commands are listed in data\genomes\info.md
The general command is makeblastdb -in beta-corona.fasta -dbtype nucl -title beta-corona -max_file_sz 500000 -parse_seqids
- The
-parse_seqids
option is required to keep the original sequence identifiers. - The
-max_file_sz
option helps to avoid an error under windows?
Has to be performed for each of the genomes where a blast is desired.
Build can provoke error messages under Windows. See below.
This can create an error on windows. This can be resolved by
by setting an environmental variable BLASTDB_LMDB_MAP_SIZE=1000000
Steps to create or modify environment variables are summarized below:
- Search for "Environment Variables" and Select "Edit environmental ...".
- Click the "New" button under the "User variable for ..." panel
- Type the environment variable
BLASTDB_LMDB_MAP_SIZE
and set its value to1000000
- Click "OK" to close the prompts
Function: probe_design\3_blast_local.py
Input:
- Fasta files with probe sequences:
data\fasta\Probes__cov-2\Probes__cov-2_ALL.fasta
- Local blastn databases:
data\genomes\
Output:
- Blast hit files for each data-basesstored in
\data\fasta\Probes__cov-2\blast\local
- Blast website
- Enter query sequence: Upload file with probe sequences**:
data\fasta\Probes__cov-2\Probes__cov-2_ALL.fasta
- Choose Search Set:
Database
:Standard databases (nr etc.)
Organism
: add taxid of host organism
- Program Selection:
Somewhat similar sequences
- Perform Blast.
- Download results as hit file (csv). We performed blast against
Results of all blast searches are combined, and a summary files with details of probe design and the blast results created.
The file data\blast_identifiers.json
permits to specify an identifier for a blast search. If absent, the file-name will be used.
Function: probe_design\5_blast_combine.py
Input:
- Fasta files from local search.
- Fasta files from web search.
Output:
- csv file with all information of probes.
- Some plots showing mismatch distribution.
We calculate for each probe the NGS coverage (summarized with the script (workflows\probe_design\0a_analyze_NGS_coverage.py
).
Function: workflows\probe_design\6_coverage.py
Input:
- csv file with all information of probes and blast results.
Output:
- csv file with all information of probes, blast results, and probe coverage
- Plots showing coverage of each probe.
Results are provided as a csv file combining all information about probe design, blast analysis, and probe coverage.
Probes are queried for
- GC content
- Passing sequence filters
- Highest NGS coverage
- Maximum overlap with different cov-2 genomes
- Minimum overlap with other beta-corona viruses, other viruses/pathogens causing similar symptomes.
Function: probe_design\7_probes_query.py
Input:
- Summary csv file.
Output:
- Fasta file with queried sequence.
- Plot with distribution of probes along sequence.
Installed under C:\Program Files\NCBI\blast-2.10.0+
Path
was automatically update with bin directory: C:\Program Files\NCBI\blast-2.10.0+\bin
Obtained when downloading blastn internet searches as a hit table. BLASTn tabular output format 6.
qseqid
: query (e.g., unknown gene) sequence idsseqid
: subject (e.g., reference genome) sequence idpident
: percentage of identical matcheslength
: alignment length (sequence overlap)mismatch
: number of mismatchesgapopen
: number of gap openingsqstart
: start of alignment in queryqend
: end of alignment in querysstart
: start of alignment in subjectsend
: end of alignment in subjectevalue
: expect valuebitscore
: bit score
Note: additional fields could be added. From blastn -help:
qseq
stands for aligned part of query sequence
sseq
stands for aligned part of subject sequence
Note that qseq/sseq may contain gaps (-
characters).