This directory contains a Makefile that will build a sourmash database
for the genome in GCF_000018865.1_ASM1886v1_genomic.fna.gz
and the
proteome in GCF_000018865.1_ASM1886v1_protein.faa.gz
Run make
to run the pipeline. (You'll need sourmash v4.4.0
installed.)
The Makefile does the following:
The Makefile first uses the script ../genbank-to-fromfile.py
to scan
the genomes and produce a summary file, build.csv
, that contains
names and source genomes for sourmash signatures. (In this case,
there's only one genome and one proteome, note!)
Names for the genomes are taken from the NCBI assembly_summary.txt
file
that is distributed with Refseq and Genbank assemblies.
Next, the Makefile runs
sourmash sketch fromfile build.csv -p dna -p protein -o all.zip
to sketch all of the genomes in build.csv
. The parameter string -p dna
tells sourmash to construct DNA sketches using the default parameters,
and -p protein
tells sourmash to construct protein sketches, too.
The names for the output signatures are taken from build.csv
.
You can run sourmash sig summarize all.zip
to get a summary of
the contents of the zip file, or sourmash sig describe all.zip
to get a listing of all the signatures.