DIAMOND nr

DIAMOND `nr` search (AWS EC2)

To test a collection of sequences against the BLAST nr (non-redundant protein) database quickly, we set-up an AWS EC2 instance with a DIAMOND database.

Objective is to test if a sequence is known.

1. Launch EC2 instance

Launch an EC2 instance via AWS Console with these parameters. Use a c5n.xlarge for the initial networking, then switch to r5d.4xlarge for creating the index or search.

EC2 Set-up Parameters

OS: Amazon Linux 2 AMI (HVM) x86
ami: ami-0be2609ba883822ec
instance: c5.xlarge // r5d.4xlarge
description: "c5n.xlarge (- ECUs, 4 vCPUs, 3.4 GHz, -, 10.5 GiB memory, EBS only)"
description: "r5d.4xlarge (16 vCPU 128 GB 2 x 300 NVMe SSD)"
storage: 450 GiB SSD (gp3)
encryption: false

2. Install `DIAMOND`

# From base amazon linux 2
sudo yum install -y docker git

# From `serratus-align` container
mkdir diamond; cd diamond

# Install diamond2 
# Libraries for building diamond2
sudo yum -y install git gcc gcc-c++ glibc-devel \
  cmake patch automake zlib-devel make

# grab latest with fix from Benjamin
git clone https://github.com/bbuchfink/diamond.git
cd diamond

mkdir bin; cd bin
cmake ..
make -j4
sudo cp ./diamond /usr/bin/diamond
sudo chmod 755 /usr/bin/diamond

3. Download `nr` database

As of 210721, the nr database uncompressed (as below) is 192GB. This may take some time to set-up.

# DOWNLOAD BLAST DB - NR
mkdir -p ~/nr; cd nr
wget -O - ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz \
 | pigz -d - \
 > nr.fa
 
# And taxonomy data
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zip
unzip taxdmp.zip

4. Create DIAMOND `nr` database

For this process, we will switch to a r5d.4xlarge instance for more processes at once and more memory.

# Switch to r5d.4xlarge instance with 450 GB block storage
# Make diamond nr db
# md5sum: 7158f0b4dfddc6f8e3c9d349a09e4f23
# size: 198GB

diamond makedb -p 14 --in nr.fa \
  --taxonmap prot.accession2taxid.gz \
  --taxonnodes nodes.dmp \
  --taxonnames names.dmp \
  -d nr

5. Run DIAMOND search

INFA='epsy_120_diamond.fa'
OUT='epsy_120_diamond'

# Diamond blastp alignment
time diamond blastp \
  -q  $INFA \
  -d ~/nr/nr.dmnd \
  --masking 0 \
  --unal 1 \
  --mid-sensitive -l 1 \
  -p14 -k1 \
  -f 6 qseqid  qstart qend qlen qstrand \
       sseqid  sstart send slen \
       pident evalue \
       full_qseq \
  > "$OUT".pro

Overview

Architecture and Pipeline

Raw Data

Serratus Explorer (serratus.io)

Usage

Running Serratus
- Serratus-Lite, local
Finding Novel Viruses (tutorials)
Papers using Serratus
Containers
Summarizer usage
Cloud Budgeting
Serratus SQL Database Management
Data Policy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DIAMOND nr

DIAMOND `nr` search (AWS EC2)

1. Launch EC2 instance

EC2 Set-up Parameters

2. Install `DIAMOND`

3. Download `nr` database

4. Create DIAMOND `nr` database

5. Run DIAMOND search

Overview

Raw Data

Serratus Explorer (serratus.io)

Usage

Contributing

Work in Progress

Clone this wiki locally

DIAMOND nr

DIAMOND nr search (AWS EC2)

1. Launch EC2 instance

EC2 Set-up Parameters

2. Install DIAMOND

3. Download nr database

4. Create DIAMOND nr database

5. Run DIAMOND search

Overview

Raw Data

Serratus Explorer (serratus.io)

Usage

Contributing

Work in Progress

Clone this wiki locally

DIAMOND `nr` search (AWS EC2)

2. Install `DIAMOND`

3. Download `nr` database

4. Create DIAMOND `nr` database