Serratus SQL Database Management

serratus-aurora is the main instance. It is an Aurora Serverless instance and can be scaled up to meet ingestion demand during batch uploads. This instance "sleeps" at 0 capacity units while inactive.

Periodically, a publicly accessible, always-on, Aurora Provisioned instance is created from a snapshot of serratus-aurora. See Creating a Public Instance.

Latest public instance connection info

Endpoint: serratus-aurora-20210225.cluster-ro-ccz9y6yshbls.us-east-1.rds.amazonaws.com
Database: summary
Username: public_reader
Password: serratus

Tables/Views

Serratus summary info. Created by serratus-summary-uploader.
- Nucleotide
  - nsra
  - nfamily
  - nsequence
- Protein
  - psra
  - pfamily
  - protein
  - psequence
- RdRP
  - rsra
  - rphylum
  - rfamily
  - rdrp
BioSample/SRA metadata. Created by biosample-sql and sraruninfo-sql.
- biosample: all BioSamples (as of 2021/02/23) with geospatial data extracted if available
- biosample_geocode: internal table for mapping geospatial text values to x/y coordinate values
- biosample_geo_coordinates: all from biosample with x/y coordinates combined with biosample_geocode
- srarun: metadata for all SRA run accessions in the Serratus search space
- srarun_geo_coordinates: all from srarun with x/y coordinates from biosample_geo_coordinates
rdrp_pos: uploaded from Artem's rdrp_pos.csv

Creating Users

serratus is the main user. Password stored in Secrets Manager. This user is used for table curation and creating the other users below.

-- revoke default access for all users
REVOKE ALL ON SCHEMA public FROM public;

-- read-only group
CREATE ROLE viewer NOSUPERUSER NOINHERIT NOCREATEDB NOCREATEROLE NOREPLICATION;
GRANT CONNECT ON DATABASE summary TO viewer;
GRANT USAGE ON SCHEMA public TO viewer;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO viewer; -- re-run any time a table/view is replaced

-- users
CREATE USER public_reader WITH PASSWORD 'serratus' IN ROLE viewer;
CREATE USER tantalus WITH PASSWORD 'serratus' IN ROLE viewer;
CREATE USER web_api WITH PASSWORD 'serratus' IN ROLE viewer;

Creating Views

CREATE MATERIALIZED VIEW biosample_geo_coordinates AS
        SELECT biosample_id,
            coordinate_x,
            coordinate_y,
            b.geo_text_extracted AS from_text
        FROM biosample b
        INNER JOIN biosample_geocode bgeo
            ON (b.geo_text_extracted = bgeo.geo_text_extracted)
        WHERE coordinate_x IS NOT NULL
    UNION ALL
        SELECT biosample_id,
            geo_coordinate_x AS coordinate_x,
            geo_coordinate_y AS coordinate_y,
            NULL AS FROM_text
        FROM biosample
        WHERE geo_coordinate_x IS NOT NULL

CREATE MATERIALIZED VIEW srarun_geo_coordinates AS
    SELECT run AS sra_id,
        bio_sample AS biosample_id,
        release_date,
        coordinate_x, coordinate_y, from_text
    FROM srarun
    INNER JOIN biosample_geo_coordinates bgeo
        ON (srarun.bio_sample = bgeo.biosample_id)

CREATE MATERIALIZED VIEW nfamily_counts AS
    SELECT family_name, score, percent_identity, COUNT(*)
    FROM nfamily
    GROUP BY family_name, score, percent_identity

When referenced tables are replaced, these views should be dropped and recreated. REFRESH MATERIALIZED VIEW doesn't seem to work in this case.

Sometimes, biosample_geocode may be uploaded with text data type for numeric columns. If that happens, run this before creating views:

ALTER TABLE biosample_geocode
ALTER COLUMN coordinate_x TYPE DOUBLE PRECISION,
ALTER COLUMN coordinate_y TYPE DOUBLE PRECISION

Creating Indexes

Indexes are used for optimizing queries.

CREATE INDEX nfamily_sra_id_index ON nfamily (sra_id);
CREATE INDEX nfamily_family_name_index ON nfamily (family_name);
CREATE INDEX nfamily_score_index ON nfamily (score);
CREATE INDEX nfamily_percent_identity_index ON nfamily (percent_identity);

CREATE INDEX nsequence_sra_id_index ON nsequence (sra_id);
CREATE INDEX nsequence_genbank_id_index ON nsequence (genbank_id);
CREATE INDEX nsequence_score_index ON nsequence (score);
CREATE INDEX nsequence_percent_identity_index ON nsequence (percent_identity);
CREATE INDEX nsequence_genbank_id_score_index ON nsequence (genbank_id, score);

CREATE INDEX srarun_run_index ON srarun (run);
CREATE INDEX srarun_bio_sample_index ON srarun (bio_sample);

CREATE INDEX biosample_geo_coordinates_biosample_id_index ON biosample_geo_coordinates (biosample_id);
CREATE INDEX srarun_geo_coordinates_sra_id_index ON srarun_geo_coordinates (sra_id);

CREATE INDEX rdrp_pos_sra_id_index ON rdrp_pos (sra_id);

Creating a Public Instance

export ORIGINAL_CLUSTER_ID="serratus-aurora"
export SNAPSHOT_ID="serratus-aurora-20210223"
export RESTORE_CLUSTER_ID="serratus-aurora-20210223"
export RESTORE_INSTANCE_ID="serratus-aurora-20210223-main"

# create snapshot
aws rds create-db-cluster-snapshot \
    --db-cluster-snapshot-identifier $SNAPSHOT_ID \
    --db-cluster-identifier $ORIGINAL_CLUSTER_ID

# restore from snapshot
aws rds restore-db-cluster-from-snapshot \
    --db-cluster-identifier $RESTORE_CLUSTER_ID \
    --snapshot-identifier $SNAPSHOT_ID \
    --db-subnet-group-name default-vpc-025ef5ccc841b5b86 \
    --engine-mode provisioned \
    --engine aurora-postgresql \
    --vpc-security-group-ids sg-07e2ad8dccb1d4ba6 \
    --engine-version 10.12
aws rds create-db-instance \
    --db-instance-identifier $RESTORE_INSTANCE_ID \
    --db-instance-class db.t3.medium \
    --engine aurora-postgresql \
    --db-cluster-identifier $RESTORE_CLUSTER_ID \
    --publicly-accessible

Known References

Readers
Writers

Overview

Architecture and Pipeline

Raw Data

Serratus Explorer (serratus.io)

Usage

Running Serratus
- Serratus-Lite, local
Finding Novel Viruses (tutorials)
Papers using Serratus
Containers
Summarizer usage
Cloud Budgeting
Serratus SQL Database Management
Data Policy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly