Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make a sourmash sketch fromfile to support large scale sketching. #1671

Closed
ctb opened this issue Jul 14, 2021 · 50 comments
Closed

make a sourmash sketch fromfile to support large scale sketching. #1671

ctb opened this issue Jul 14, 2021 · 50 comments

Comments

@ctb
Copy link
Contributor

ctb commented Jul 14, 2021

ref #1652, you want to be able to sketch certain ksizes/moltypes for certain identifiers only, but on a large scale. how can we best do this?

@ctb
Copy link
Contributor Author

ctb commented Jul 14, 2021

this may actually be a real world use case for having the full genomes be available as "signatures" per #1647

@ctb
Copy link
Contributor Author

ctb commented Jul 14, 2021

a much simpler version is to just have the 'nomatch' file output by mom-extract-sigs be a full manifest...

@bluegenes
Copy link
Contributor

bluegenes commented Jul 15, 2021

a much simpler version is to just have the 'nomatch' file output by mom-extract-sigs be a full manifest...

One way this could work if we enabled it in sourmash -

  1. New utility:sourmash sig check (with picklist functionality) = similar to extract, but don't actually extract/load any signatures, just check for the right signatures and output a missing manifest/picklist if they don't exist.

  2. add --picklist to sketch, to enable building just missing sigs (and ultimately/ideally adding them to an existing zip file :)

With picklist in sketch, sketch could also internally check for the sig existence before sketching. Seems handy, at least as an option, but not sure this should be default. But enabling check as a separate step allows us to get around any issues where the genome needs to be downloaded prior to sketching.

@ctb
Copy link
Contributor Author

ctb commented Jul 15, 2021

I like sourmash sig check or something like it!

Adding --picklist to sketch is much harder, because at least some of the picklist values (md5sum, for example) don't exist until after the signature is calculated. So we'd have to special case picklists to deal with that. Not impossible, just doesn't feel quite right.

Backing up a bit, the functionality we need looks something like this:

(1) given some input specification, one or more of -

  • a select statement?
  • a set of identifiers?
  • a set of identifiers and a set of sourmash sketch param strings,

(2) then inspect a manifest and figure out what remains to be calculated,

(3) and then go do it.


for (1), I think we can probably hack it together fairly easily.

for (2), there's some interesting interaction to work out, but it should be easy to say "you want X? X isn't in the manifest."

for (3), we need to have some way to connect identifiers to files, and to make things worse, we need to connect the same identifier to at least two different files - DNA and protein.


so for now it seems like there are several steps to work out 😄

@bluegenes
Copy link
Contributor

bluegenes commented Dec 3, 2021

Returning to this because I really would like a way to name sigs when building many at once, e.g. from a directory or a file list. I guess we could add something like --from-csv to sketch or sourmash sig rename, but enabling naming while sketching via picklist would be far cleaner.

Perhaps what we want is a different standard picklist format, e.g. sketchlist with standard column names for sketching. Columns I would envision allowing:

  • name
  • source (or input)
  • destination (or output)
  • outdir
  • license? Though we currently only support CC0, right?
  • all the param string info (ksize, scaled, moltype, etc or param_string)

We could unite this with --from-file handling for sketching, if desired -- e.g. if the most basic type of sketchlist is a file with a single column of filenames, then just check for the right column names first, if none, fall back to current --from-file behavior). But maybe this overcomplicates.

While at it, I would imagine we would also want to add a --manifest output file option to sketch, so we could output all sketch info in a format that can be used as a picklist downstream. As an added bonus, this also would create a nice output file a snakemake sketching --sketchlist rule.

I guess something I'm implicitly stating here is that I care far less about being able to select things from a sketchlist - I would be happy to have individual lines for each sketch I'd like, even if that makes the file slightly less human-friendly (though param_string column would make that easier). What I think we really need is a way to specify sketches properly via this sort of file.

use case for ref: Building large databases from many signatures makes for slow and/or intractable snakemake DAGs. I can use file lists for sketching and zipping, but I can't think of a way to (batch) name the sigs properly via sourmash cli. Am I forgetting something obvious!?

edit: #1315 could be a way to make this work

@ctb
Copy link
Contributor Author

ctb commented Dec 4, 2021

I haven't thought about this for a while, so apologies if this is obviously wrong - but, over in #1647, I made the comment that

A hack I was thinking of implementing is the idea of a sequence as a sketch type, where we can store actual FASTA sequences and/or collections of k-mers as a signature. It sounds kinda stupid, but could be a good proof of concept in the current absence of different sketches.

I wonder if we could some thing where we unify this idea with selection/sketching and do something where when FASTA sequences are stored in signatures or directories or zipfiles, we can calculate the MinHash sketches and output them somewhere.

At first glance it seems outside the current sourmash architecture but ...


separately, I worry about integrating large-batch sketch calculation into sourmash's Python implementation because it would not be parallelized (due to Python limitations).

@bluegenes
Copy link
Contributor

bluegenes commented Dec 6, 2021

We could make a base class SourmashSignatureInfo, or the like, which contains properties such as name, source/inputfile, destination/outputfile etc (all non-sketching components of SourmashSignature). This could be a great time/place to add dnafile and proteinfile as separate inputfile options, as we suggested wanting above. This would store everything for sketching except for the param string(s), which we could either also store (seems handy), or keep track of separately.

To support the use case you're thinking of, we could then have two classes that inherit from the base class -- the fasta sketchtype, and the FracMinHash sketchtype (our current SourmashSignature). Ofc this leaves the door open for additional sketchtypes.

For the use case I've been thinking of, the base class could be used to store info from --sketchlist csv input. We could build signature info for all desired sigs (from cli + sketchlist input), then build signatures. This would also enable the sig check --> building signatures utility discussed above.

separately, I worry about integrating large-batch sketch calculation into sourmash's Python implementation because it would not be parallelized (due to Python limitations).

two thoughts:

  1. What are the barriers to future parallelization in rust?
  2. Parallelization via snakemake batching makes this worthwhile, even if sourmash is working single threaded -- massive numbers of snakemake jobs are still not tractable.

@ctb
Copy link
Contributor Author

ctb commented Dec 6, 2021

This is not a response to the above yet, but it reminds me of an idle thought I had yesterday - what about a new command sourmash sketch fromfile that would support a custom format containing the necessary information - dna, protein, ksize, etc. - and this could be matched against a bunch of manifests before deciding which new signatures are created?

@bluegenes
Copy link
Contributor

bluegenes commented Dec 6, 2021

This is not a response to the above yet, but it reminds me of an idle thought I had yesterday - what about a new command sourmash sketch fromfile that would support a custom format containing the necessary information - dna, protein, ksize, etc. - and this could be matched against a bunch of manifests before deciding which new signatures are created?

this is the exact behavior I'd want, and I would happily use it in this format. It would be an added bonus to avoid needing to completely separate dna vs protein sketching.

@ctb
Copy link
Contributor Author

ctb commented Dec 6, 2021

this is the exact behavior I'd want, and I would happily use it in this format. It would be an added bonus to avoid needing to completely separate dna vs protein sketching.

excellent ;).

@ctb
Copy link
Contributor Author

ctb commented Dec 7, 2021

so, I poked around a little bit with this today, and came up with the following Python syntax:

sketch_spec1 = MinHashSpec(ksizes=[31], output_type="DNA")
sketch_spec2 = MinHashSpec(ksizes=[10], output_type="protein")

source1 = SketchFromFile(name="sketch name goes here",
                  protein_filename="data/some_prot_file.faa",
                  dna_filename="data/GCF_000005845.2_ASM584v2_genomic.fna.gz")

source2 = SketchFromRecords(map_record_to_name=fn_or_mapping,
                  protein_filename="some_big_file_of_proteins.faa")

missing = check_specs_against_manifests([sketch_spec1, sketch_spec2],
                                        [source1, source2], manifests)

The idea is that you build spec objects that detail the various sketch types you're interested in building, and then cross-product them with the data sources, to create a list of the actual sketches you want built. This can then be checked against manifests of existing signatures.

Questions:

  1. what does the missing object allow you to do here, in terms of actually generating the sketches? Some ideas (not mutually exclusive):
  • it could produce a set of command lines that could then be executed (with e.g. GNU parallel)
  • it could produce a set of output files that should be created (for snakemake compatibility)
  • it could support 'check only' mode where it fails/succeeds based on whether anything is missing (again, snakemake compatibility)
  • it could provide functions to actually, y'know, build the missing sketches
  1. if we support generating the sketches, what is the output save format for the sketches? we could just support all of the formats supported by sourmash CLI (that is, the .sig, .sig.gz, directory/, and .zip formats). Or just limit to .zip, which comes with built-in manifests.

Lots of additional thoughts too -

  • should use databases, rather than manifests in check_specs_against_manifests - that allows for more general things like traversing directories etc.
  • do we want to support declarative syntaxes like YAML for this, or is the Python code an OK place to start? (I chose to start with the Python code version of this because it was easy for me to write and then validate syntactically via execution (the above code parses just fine :). )

@bluegenes
Copy link
Contributor

bluegenes commented Dec 7, 2021

The idea is that you build spec objects that detail the various sketch types you're interested in building, and then cross-product them with the data sources, to create a list of the actual sketches you want built. This can then be checked against manifests of existing signatures.

I like it. This is definitely the separation we need!

what does the missing object allow you to do here, in terms of actually generating the sketches? Some ideas (not mutually exclusive):
it could produce a set of command lines that could then be executed (with e.g. GNU parallel)
it could produce a set of output files that should be created (for snakemake compatibility)
it could support 'check only' mode where it fails/succeeds based on whether anything is missing (again, snakemake compatibility)
it could provide functions to actually, y'know, build the missing sketches

  • it could produce a csv for consumption by sketchfromfile

if we support generating the sketches, what is the output save format for the sketches? we could just support all of the formats supported by sourmash CLI (that is, the .sig, .sig.gz, directory/, and .zip formats). Or just limit to .zip, which comes with built-in manifests.

I think it would be ok to only enable .zip. This functionality is most useful for large numbers of files, and at that level, zips are far more useful than individual sigfiles. Should we want to use sigs individually downstream, picklists enable easy extraction for outputting sigfiles or selecting on the fly (e.g. prior to search/gather).

Do we want to enable something separate like sig append that could append sigfiles (any format) to an existing zipfile without rewriting all existing sigs via sig cat? Seems like it might be useful in a few cases. I think this is a separate issue though :)

do we want to support declarative syntaxes like YAML for this, or is the Python code an OK place to start? (I chose to start with the Python code version of this because it was easy for me to write and then validate syntactically via execution (the above code parses just fine :). )

starting with python code seems good! Not terribly hard to add file consumption after the structure is in place, right?

Ultimately, for source (SketchFromFile, SketchFromRecords), csv input seems useful, since each name or ident might have dnafilename and proteinfilename. yaml could work too, just seems harder for many files if we link the dna and protein inputfiles.

For spec, yaml could be a good fit, since we're usually using the same spec for all sigs in a collection.

Here's an idea for format:

MinHashSpec:
  dna:
    - ksize: [21,31,51]
    - scaled: 1000
  protein:
    - input: [dna, protein]
    - ksize: 10
    - scaled: [100, 200]
  dayhoff:
    - ksize: 16

...with dna internally only allowing dna input and anything missing falling back to defaults. It's pretty easy to allow single input or lists within the yaml for ksize, input, and scaled. I think it's useful to let each alphabet be separate, since different ksize/scaled make sense. With this spec, we could execute only the alphabets that are specified.

The only thing I don't like is that if we use different formats (csv, yaml) for specification, then we can't output a single file for consumption with sketchfromfile. BUT, this minor annoyance is likely overshadowed by the fact that it is VERY handy to be able to easily edit a yaml to build a new alphabet/ksize sketch for many files.

@luizirber
Copy link
Member

luizirber commented Dec 9, 2021

may I suggest toml instead of YAML? It would look something like this:

[[MinHashSpec.dna]]
ksize = [ 21, 31, 51 ]
scaled = 1_000

[[MinHashSpec.protein]]
input = [ "dna", "protein" ]
ksize = 10
scaled = [ 100, 200 ]

[[MinHashSpec.dayhoff]]
ksize = 16

Many places (including Python with pyproject.toml) are moving away from YAML because it has some confusing parsing issues (due to being underspecified).

@ctb ctb changed the title add picklists to sourmash sketch? add picklists to sourmash sketch? or, make a sourmash sketch fromfile... Jan 15, 2022
@ctb
Copy link
Contributor Author

ctb commented Jan 15, 2022

this issue has come back to the forefront of my brain because of dib-lab/genome-grist#130, where the construction of a private database is ...annoying because it's hard to properly name signatures in bulk.

One thing that I did in dib-lab/genome-grist#130 that I kinda liked was to have the copy_private_genomes utility scan a pile of FASTA files and then output a CSV in a way that made it easy to edit and consume downstream. I wonder if we could take that approach here, in terms of (one way of) generating a list of FASTA sources for signatures?

@ctb
Copy link
Contributor Author

ctb commented Jan 15, 2022

A specific proposal -

I'm leaning towards implementing something like sourmash sketch fromfile that takes in a CSV file that is a list of sequence sources. Then we can do the standard param string thing (-p k=31 etc) with it. Crucially, this CSV file would have signature output names in it. It would be designed to have it be easy to construct with only a few lines of Python code, and we can provide some simple examples for us and others.

SEPARATELY, we would also add a --param-file or -P option (name is provisional :) to sourmash sketch commands that would take in the kind of TOML file that @luizirber roughs out above.

So that way we'd have params cross-product sources.

Finally, a separate piece of functionality would then be to enable sourmash sketch commands to take a list of databases (? probably better than manifests or picklists... but of course would use manifests/picklists underneath) that it would check against the names proposed in the fromfile above.

@ctb
Copy link
Contributor Author

ctb commented Feb 14, 2022

ugh, running into this YET AGAIN in some work on sketching PFAM. Need/could use a standard way to check which files have not been sketched yet.

🎶 motivation 🎶

@ctb
Copy link
Contributor Author

ctb commented Mar 12, 2022

getting started with the construction side of the input CSV file here: https://github.com/ctb/2022-sourmash-sketchfrom

@ctb
Copy link
Contributor Author

ctb commented Mar 12, 2022

This works and is not terribly hacky:

% ./fasta-to-source-list.py ncbi-assemblies/* -o xyz.csv --names-from gtdb-rs202.taxonomy.v2.db --ident-from-filename
processing genome 'ncbi-assemblies/GCF_000018865.1_ASM1886v1_genomic.fna.gz'
processing genome 'ncbi-assemblies/GCF_000018865.1_ASM1886v1_protein.faa.gz'
---
wrote 1 entries to 'xyz.csv'

and it produces xyz.csv which looks like this:

ident,name,genome_filename,protein_filename
GCF_000018865,s__Chloroflexus aurantiacus,ncbi-assemblies/GCF_000018865.1_ASM1886v1_genomic.fna.gz,ncbi-assemblies/GCF_000018865.1_ASM1886v1_protein.faa.gz

and (crucially) has correctly associated the protein faa.gz with the nucleotide fna.gz, based on the accession in the filename.

@ctb
Copy link
Contributor Author

ctb commented Mar 12, 2022

Do we want to enable something separate like sig append that could append sigfiles (any format) to an existing zipfile without rewriting all existing sigs via sig cat? Seems like it might be useful in a few cases. I think this is a separate issue though :)

as a side note - you can already do this with sourmash sig cat <new sigs> -o <existing zip file>.zip. No need for a sig append. This is not obvious or consistent behavior for -o, unfortunately - -o .sig will overwrite the existing file, and -o dirname/ will append...

@ctb
Copy link
Contributor Author

ctb commented Mar 12, 2022

Side thought: if we have a list of FASTA DNA and protein sequences together with names, we might need to have two such files if we build different names for GTDB and NCBI taxonomies.

I think a better solution will be to provide a separate mass-renaming function, perhaps via sig rename, which could also solve some other problems. Will think on it more.

@ctb
Copy link
Contributor Author

ctb commented Mar 12, 2022

renaming idea in: #1883

@ctb ctb changed the title add picklists to sourmash sketch? or, make a sourmash sketch fromfile... make a sourmash sketch fromfile to support large scale sketching. Mar 13, 2022
@ctb
Copy link
Contributor Author

ctb commented Mar 13, 2022

progress!!

NOTE: uses https://github.com/ctb/2022-sourmash-sketchfrom, requires code in #1884

round 0: build a CSV file with source genome/protein information

Starting from a directory ncbi-assemblies/, which contains:

% ls ncbi-assemblies/
GCF_000018865.1_ASM1886v1_genomic.fna.gz
GCF_000018865.1_ASM1886v1_protein.faa.gz

we construct a CSV file that automagically builds GTDB names:

% ./fasta-to-source-list.py ncbi-assemblies/* -o xyz.csv --names-from gtdb-rs202.taxonomy.v2.db --ident-from-filename
processing file 'ncbi-assemblies/GCF_000018865.1_ASM1886v1_genomic.fna.gz'
(new record for name 'GCF_000018865 s__Chloroflexus aurantiacus')
processing file 'ncbi-assemblies/GCF_000018865.1_ASM1886v1_protein.faa.gz'
(merging into existing record)
---
wrote 1 entries to 'xyz.csv'

The resulting file contains ident, name, and source files for sourmash sketch to use:

ident,full_ident,name,genome_filename,protein_filename
GCF_000018865,GCF_000018865.1,GCF_000018865 s__Chloroflexus aurantiacus,ncbi-assemblies/GCF_000018865.1_ASM1886v1_genomic.fna.gz,ncbi-assemblies/GCF_000018865.1_ASM1886v1_protein.faa.gz

round 1: sketch some stuff

Now build the sketches:

% ./sketch_from.py xyz.csv -p k=10,protein -p k=9,protein -p k=21,dna -o round1.zip
...
** Of 3 total requested in cross-product, skipped 0, built 3)

results 🎉

% sourmash sig summarize round1.zip 

** loading from 'round1.zip'
path filetype: ZipFileLinearIndex
location: /Users/t/dev/2022-sourmash-sketchfrom/round1.zip
is database? yes
has manifest? yes
num signatures: 3
** examining manifest...
total hashes: 18468
summary of sketches:
   1 sketches with protein, k=10, scaled=200          6774 total hashes
   1 sketches with protein, k=9, scaled=200           6719 total hashes
   1 sketches with DNA, k=21, scaled=1000             4975 total hashes

round 2: try sketching the same stuff

% ./sketch_from.py xyz.csv -p k=10,protein -p k=9,protein -p k=21,dna -o round2.zip --already-done round1.zip
Loaded 1 pre-existing names from manifest(s)
** Of 3 total requested in cross-product, skipped 3, built 0)

no soup for you!

round 3: try sketching the same stuff and some new stuff

Add -p k=31,k=51,dna:

% ./sketch_from.py xyz.csv -p k=10,protein -p k=9,protein -p k=21,dna -p k=31,k=51,dna -o round3.zip --already-done round1.zip
...
** Of 5 total requested in cross-product, skipped 3, built 2)

and voila, only the new stuff is there:

% sourmash sig summarize round3.zip

...

** loading from 'round3.zip'
path filetype: ZipFileLinearIndex
location: /Users/t/dev/2022-sourmash-sketchfrom/round3.zip
is database? yes
has manifest? yes
num signatures: 2
** examining manifest...
total hashes: 10210
summary of sketches:
   1 sketches with DNA, k=31, scaled=1000             5050 total hashes
   1 sketches with DNA, k=51, scaled=1000             5160 total hashes

@ctb
Copy link
Contributor Author

ctb commented Mar 13, 2022

round 4 -

construct names from GTDB taxonomy:

% ./fasta-to-source-list.py ncbi-assemblies2/* -o xyz2.csv --names-from gtdb-rs202.taxonomy.v2.db --ident-from-filename
processing file 'ncbi-assemblies2/GCA_903797575.1_PARATYPHIC668_genomic.fna.gz'
(new record for name 'GCA_903797575 s__Salmonella enterica')
processing file 'ncbi-assemblies2/GCA_903797575.1_PARATYPHIC668_protein.faa.gz'
(merging into existing record)
---
wrote 1 entries to 'xyz2.csv'
% csvtk cut -f name xyz2.csv
name
GCA_903797575 s__Salmonella enterica

construct names from GenBank taxonomy:

% ./fasta-to-source-list.py ncbi-assemblies2/* -o xyz3.csv --names-from all_genbank_lineages.20200727.db --ident-from-filename
processing file 'ncbi-assemblies2/GCA_903797575.1_PARATYPHIC668_genomic.fna.gz'
(new record for name 'GCA_903797575 Salmonella enterica')
processing file 'ncbi-assemblies2/GCA_903797575.1_PARATYPHIC668_protein.faa.gz'
(merging into existing record)
---
wrote 1 entries to 'xyz3.csv'
% csvtk cut -f name xyz3.csv
name
GCA_903797575 Salmonella enterica

and these files can be used to build signatures with different names:

GTDB:

% ./sketch_from.py xyz2.csv -p k=31,dna -p k=10,protein -o xyz2.zip
Loaded 0 pre-existing names from manifest(s)
** Building 2 sketches for 2 files
... reading sequences from ncbi-assemblies2/GCA_903797575.1_PARATYPHIC668_genomic.fna.gz
calculated 1 signatures for 71 sequences in ncbi-assemblies2/GCA_903797575.1_PARATYPHIC668_genomic.fna.gz
... reading sequences from ncbi-assemblies2/GCA_903797575.1_PARATYPHIC668_protein.faa.gz
calculated 1 signatures for 4382 sequences in ncbi-assemblies2/GCA_903797575.1_PARATYPHIC668_protein.faa.gz
saved 2 signature(s) to 'xyz2.zip'. Note: signature license is CC0.
** Of 2 total requested in cross-product, skipped 0, built 2)
% sourmash sig describe xyz2.zip

signature filename: /Users/t/dev/2022-sourmash-sketchfrom/xyz2.zip
signature: GCA_903797575 s__Salmonella enterica
source file: ncbi-assemblies2/GCA_903797575.1_PARATYPHIC668_genomic.fna.gz
md5: 2956485d23a2b6ba353b13e7ca5efeec
k=31 molecule=DNA num=0 scaled=1000 seed=42 track_abundance=0
size: 4797
sum hashes: 4797
signature license: CC0

---
signature filename: /Users/t/dev/2022-sourmash-sketchfrom/xyz2.zip
signature: GCA_903797575 s__Salmonella enterica
source file: ncbi-assemblies2/GCA_903797575.1_PARATYPHIC668_protein.faa.gz
md5: 5589e961dd83929032ed1edbf96dc1d3
k=10 molecule=protein num=0 scaled=200 seed=42 track_abundance=0
size: 6460
sum hashes: 6460
signature license: CC0

loaded 2 signatures total, from 1 files

GenBank:

./sketch_from.py xyz3.csv -p k=31,dna -p k=10,protein -o xyz3.zip
Loaded 0 pre-existing names from manifest(s)
** Building 2 sketches for 2 files
... reading sequences from ncbi-assemblies2/GCA_903797575.1_PARATYPHIC668_genomic.fna.gz
calculated 1 signatures for 71 sequences in ncbi-assemblies2/GCA_903797575.1_PARATYPHIC668_genomic.fna.gz
... reading sequences from ncbi-assemblies2/GCA_903797575.1_PARATYPHIC668_protein.faa.gz
calculated 1 signatures for 4382 sequences in ncbi-assemblies2/GCA_903797575.1_PARATYPHIC668_protein.faa.gz
saved 2 signature(s) to 'xyz3.zip'. Note: signature license is CC0.
** Of 2 total requested in cross-product, skipped 0, built 2)
% sourmash sig describe xyz3.zip
...
---
signature filename: /Users/t/dev/2022-sourmash-sketchfrom/xyz3.zip
signature: GCA_903797575 Salmonella enterica
source file: ncbi-assemblies2/GCA_903797575.1_PARATYPHIC668_genomic.fna.gz
md5: 2956485d23a2b6ba353b13e7ca5efeec
k=31 molecule=DNA num=0 scaled=1000 seed=42 track_abundance=0
size: 4797
sum hashes: 4797
signature license: CC0

---
signature filename: /Users/t/dev/2022-sourmash-sketchfrom/xyz3.zip
signature: GCA_903797575 Salmonella enterica
source file: ncbi-assemblies2/GCA_903797575.1_PARATYPHIC668_protein.faa.gz
md5: 5589e961dd83929032ed1edbf96dc1d3
k=10 molecule=protein num=0 scaled=200 seed=42 track_abundance=0
size: 6460
sum hashes: 6460
signature license: CC0

loaded 2 signatures total, from 1 files

@ctb
Copy link
Contributor Author

ctb commented Mar 13, 2022

notes to self, collection/summary of the above comments and other thoughts

  • batch renaming punted to add large scale/batch signature renaming #1883
  • support for batch renaming - i.e. output that implements whatever we decide.
  • need/could use a standard way to check which files have not been sketched yet.
  • eventually want --param-file support (punted to consider providing standardized parameter files/-P for sourmash sketch #1910)
  • think about possible alternative output such as a list of commands that could be fed to parallel
  • also possible output format: a --check-only and/or some form of diagnostic about for what remains to be built
  • revisit with @bluegenes the question of outputting picklists. what was the purpose of this - to retrieve all the sketches from other databases? to provide just the missing sketches?
  • related: provide a way to build a database that contains all the sketches in the csv file x param list. Maybe that's what the picklist in previous item can support. 😄
  • specific implementation thought: we will need a way to go FROM ComputeParameters objects TO param strings.
  • idea might already be robust enough to implement sketch fromfile directly in sourmash, at least as a WIP/EXP PR.

@ctb
Copy link
Contributor Author

ctb commented Mar 14, 2022

ok, so this is all lovely, but I think that I'm missing one of the key use cases we had in mind in the beginning: what about situations where you have a list of accessions (e.g. a listing of GenBank, or a GTDB release), and you want to build a comprehensive database, or check that you have the right set of genomes to build signatures for those accessions?

@ctb
Copy link
Contributor Author

ctb commented Mar 14, 2022

do we want to do anything about translate? not currently supported by plans.

@ctb
Copy link
Contributor Author

ctb commented Mar 17, 2022

latest update - bulk bulk bulk!

(sourmash-dev) ctbrown@farm:~/scratch/fromfile$ 2022-sourmash-sketchfrom/genbank-to-fromfile.py -F ntpierce-proteomes.txt -F ntpierce-genomic.txt -t gtdb-rs202.taxonomy.v2.db  -o gtdb-rs202-entire.csv
Loaded 258406 entries from 'ntpierce-proteomes.txt'
Loaded 258407 entries from 'ntpierce-genomic.txt'
Any survivable errors will be reported to 'gtdb-rs202-entire.csv.error-report.txt'
processing file 'GCA_900543455.1_protein.faa.gz' (10978/516813)
processed 516813 files.
---
wrote 258406 entries to 'gtdb-rs202-entire.csv'
all entries had matched genome and protein files!
36642 files had no content (zero size).
1 filenames yielded duplicate identifiers.
** Errors were encountered ;(. See details in 'gtdb-rs202-entire.csv.error-report.txt'.

and indeed there is one duplicate identifer, tsk tsk 😆

ntpierce-genomic.txt:/home/ntpierce/2021-rank-compare/genbank/genomes/GCF_001060105.1_ASM106010v1_genomic.fna.gz
ntpierce-genomic.txt:/home/ntpierce/2021-rank-compare/genbank/genomes/GCF_001060105.1_genomic.fna.gz

@bluegenes
Copy link
Contributor

bluegenes commented Mar 17, 2022

and indeed there is one duplicate identifer, tsk tsk 😆

Funny story! The original download of this file was empty - likely some download issue. I think the sketch may actually be empty in my gtdb databases, because we currently ignore empty files. @taylorreiter noticed it while running gtdb charcoal, so I manually re-downloaded the file and transferred, but initially forgot to change its name, as I was doing in the automated workflow. When she said the file was still empty, I retransferred and renamed, but forgot to check for the old one.

tl;dr - this file is here because we ignore empty files when sketching, and we should definitely do something differently ;)
ref: #1887

...and I shall go remove it now 💨

@bluegenes
Copy link
Contributor

bluegenes commented Mar 17, 2022

do we want to do anything about translate? not currently supported by plans.

I would like if we could support translate at some point, but two thoughts:

  1. we won't want to use translate for reference genome databases (aka our immediate use case; though I very much do want to use it on large-scale metagenome samples and/or MAG queries)
  2. Right now it's prohibitively slow on large datasets

Other translate thoughts: We could have a flag, --translate or --allow-translation that allows protein sketching to be done on DNA input files, if the matching protein file does not exist? Would only trigger when protein params are passed...

@ctb
Copy link
Contributor Author

ctb commented Mar 17, 2022

Better output - I fixed some reporting (but forgot to fix the duplicate entry, will do that now).

(sourmash-dev) ctbrown@farm:~/scratch/fromfile$ 2022-sourmash-sketchfrom/genbank-to-fromfile.py -F ntpierce-proteomes.txt -F ntpierce-genomic.txt -t gtdb-rs202.taxonomy.v2.db  -o gtdb-rs202-entire.csv
Loaded 258406 entries from 'ntpierce-proteomes.txt'
Loaded 258407 entries from 'ntpierce-genomic.txt'
Any survivable errors will be reported to 'gtdb-rs202-entire.csv.error-report.txt'
processing file 'GCF_003798365.1_protein.faa.gz' (2046/516813)
processed 516813 files.
---
wrote 258406 entries to 'gtdb-rs202-entire.csv'
36631 entries had only protein (and no genome) files.
11 entries had only genome (and no protein) files.
(missing files do not cause error exit without --strict)
36642 files had no content (zero size).
1 filenames yielded duplicate identifiers.
** Errors were encountered ;(. See details in 'gtdb-rs202-entire.csv.error-report.txt'.

@bluegenes
Copy link
Contributor

bluegenes commented Mar 17, 2022

36642 files had no content (zero size).

Most of these are entries that had no *faa.gz protein files (so I ran prodigal on the genomes instead). Forgot to give you the location for the prodigal proteomes: /home/ntpierce/2021-rank-compare/genbank/prodigal/.

Probably more useful, I have a file with ident,prodigal_filepath for all relevant proteomes here: /home/ntpierce/2021-rank-compare/gtdb-rs202.prodigal-filenames.csv (n=36,631).

@ctb
Copy link
Contributor Author

ctb commented Mar 17, 2022

my mighty script should be able to figure it all out, given only the path (and an appropriately named set of files!) I will let it loose! 🦁

@bluegenes
Copy link
Contributor

my mighty script should be able to figure it all out, given only the path (and an appropriately named set of files!) I will let it loose! 🦁

same naming convention, just a different location. Let er rip! 🦖

@ctb
Copy link
Contributor Author

ctb commented Mar 18, 2022

we're down to the following errors - after eliminating zero size files, we have:

duplicate protein filename: /home/ntpierce/2021-rank-compare/genbank/proteomes/GCF_000739575.1_protein.faa.gz
duplicate protein filename: /home/ntpierce/2021-rank-compare/genbank/proteomes/GCA_003543635.1_protein.faa.gz
duplicate protein filename: /home/ntpierce/2021-rank-compare/genbank/proteomes/GCF_002302865.1_protein.faa.gz
duplicate protein filename: /home/ntpierce/2021-rank-compare/genbank/proteomes/GCF_009876525.1_protein.faa.gz
duplicate protein filename: /home/ntpierce/2021-rank-compare/genbank/proteomes/GCF_013296485.1_protein.faa.gz
duplicate protein filename: /home/ntpierce/2021-rank-compare/genbank/proteomes/GCA_000435395.1_protein.faa.gz
duplicate protein filename: /home/ntpierce/2021-rank-compare/genbank/proteomes/GCF_900766585.1_protein.faa.gz
duplicate protein filename: /home/ntpierce/2021-rank-compare/genbank/proteomes/GCA_002103955.1_protein.faa.gz
duplicate protein filename: /home/ntpierce/2021-rank-compare/genbank/proteomes/GCF_000025685.1_protein.faa.gz
duplicate protein filename: /home/ntpierce/2021-rank-compare/genbank/proteomes/GCF_900759255.1_protein.faa.gz
missing genome file: GCF_002124775
missing genome file: GCF_005819405
missing genome file: GCA_009838785
missing genome file: GCF_000699565
missing genome file: GCF_002244155
missing genome file: GCF_900768585
missing genome file: GCF_000568735
missing genome file: GCF_000261045
missing genome file: GCF_001659225
missing genome file: GCF_003572455
missing genome file: GCF_001877985

the script is getting pretty good at this :). I haven't layered on the official list of identifiers yet, I'll try that.

@ctb
Copy link
Contributor Author

ctb commented Mar 18, 2022

(sourmash-dev) ctbrown@farm:~/scratch/fromfile$ 2022-sourmash-sketchfrom/genbank-to-fromfile.py -F ntpierce-prodigal.txt -F ntpierce-proteomes.txt -F ntpierce-genomic.txt -t gtdb-rs202.taxonomy.v2.db  -o gtdb-rs202-entire.csv --picklist gtdb-rs202.taxonomy.v2.db.ident.csv:ident:identprefix
Loaded 36641 entries from 'ntpierce-prodigal.txt'
Loaded 258406 entries from 'ntpierce-proteomes.txt'
Loaded 258406 entries from 'ntpierce-genomic.txt'
picking column 'ident' of type 'identprefix' from 'gtdb-rs202.taxonomy.v2.db.ident.csv'
loaded 258406 distinct values into picklist.
Any survivable errors will be reported to 'gtdb-rs202-entire.csv.error-report.txt'
processed 553453 files.
---
wrote 258406 entries to 'gtdb-rs202-entire.csv'
11 entries had only genome (and no protein) files.
(missing files do not cause error exit without --strict)
36642 files had no content (zero size).
10 filenames yielded duplicate identifiers.
for given picklist, found 258406 matches to 258406 distinct values
** Errors were encountered ;(. See details in 'gtdb-rs202-entire.csv.error-report.txt'.

key new message: for given picklist, found 258406 matches to 258406 distinct values

so except for the 11 entries with only genome (and no protein) files, we could now go forth and build 🎉

@ctb
Copy link
Contributor Author

ctb commented Mar 18, 2022

daaaaaaaamn

% sourmash sketch fromfile gtdb-rs202-entire.csv -p protein

== This is sourmash version 4.2.4.dev27+g0a1b713f. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** 258406 total requested; built 258406, skipped 0

took about 3 seconds to run. (It just checked to see how many to build.)

@ctb
Copy link
Contributor Author

ctb commented Mar 21, 2022

interesting ~new challenge to consider: I'm looking at /group/ctbrowngrp/irber/data/wort-data/wort-genomes and /group/ctbrowngrp/irber/data/ncbi/genbank and thinking about how to properly deal with a pile of unindexed sigs.

this might be an opportunity to revisit the manifest-of-manifests thing, #1685. The basic idea:

  • build a manifest-of-manifests for all of those signatures, stick in sqlite3 database
  • patch in the sketch-fromfile stuff to be able to use manifests-of-manifests
  • fun, profit

@ctb
Copy link
Contributor Author

ctb commented Mar 22, 2022

Confronted the horror of too many interconnected things, went for a nice hike, realized that maybe if we just could load manifest CSV files directly on the command line, we could build tooling around creating/maintaining such manifests for large wort collections without adding a whole lot of complexity to sourmash.

See #1891 for code.

@ctb
Copy link
Contributor Author

ctb commented Mar 24, 2022

wort-genbank, all sigs, in a single manifest, using #1891:

(sourmash-dev) ctbrown@c6-50:~/scratch/fromfile/split.irber-genome-sigs$ sourmash sig summarize entire2.csv

== This is sourmash version 4.2.4.dev27+g0a1b713f. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from 'entire2.csv'

path filetype: StandaloneManifestIndex
location: entire2.csv
is database? no
has manifest? yes
num signatures: 4697901
** examining manifest...
total hashes: 55022206752
summary of sketches:
   1565967 sketches with DNA, k=21, scaled=1000, abund 17260335263 total hashes
   1565967 sketches with DNA, k=31, scaled=1000, abund 18364942000 total hashes
   1565967 sketches with DNA, k=51, scaled=1000, abund 19396929489 total hashes

Per /usr/bin/time -v, took about 1 minute, and ~6 GB of RAM:

        User time (seconds): 56.74
        System time (seconds): 3.74
        Percent of CPU this job got: 94%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:03.78
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 6238316

@ctb
Copy link
Contributor Author

ctb commented Mar 24, 2022

ok, to recap:

there's some support code in https://github.com/ctb/2022-sourmash-sketchfrom that we don't need to integrate into sourmash (or may never need) -

  • code to mass-rename signatures based on identifer: mass-rename.py
  • code to take a list of genbank-named genome and proteome files and turned them into an input file for sketch fromfile: genbank-to-fromfile.py
  • code to take a generic set of FASTA files and do the same: fasta-to-fromfile.py
  • code to take a large pile of signature files and build an abspath manifest from them: sigs-to-manifest.py

Things we could maybe use at this point, beyond making those experimental PRs mergable:

  • is it worthwhile to have sourmash sketch fromfile output a manifest of all of those signatures it has found? probably. This could be used as a catalog and a picklist.
  • a simple human-readable summarization output from fromfile.
  • some way of connecting a list of accessions into all of the above - "sudo make me a GTDB release database", essentially.

Things we are punting on, for now; we'll need to make issues for these:

@ctb
Copy link
Contributor Author

ctb commented Mar 26, 2022

finally realizing that @bluegenes knew what she was talking about all along in #1365 and now #1902, I implemented sig check to check picklists against existing databases in #1891. Combined with the direct manifest loading stuff, it's pretty nice!!

Here we are checking ~100 identifiers against a manifest for 4.7 million wort genbank sigs and pulling out the matching manifest entries:

% sourmash sig check --picklist bacteria.ident:ident:ident split.irber-genome-sigs/entire2.csv -o 1.csv --save-manifest 2.csv

== This is sourmash version 4.2.4.dev27+g0a1b713f. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

picking column 'ident' of type 'ident' from 'bacteria.ident'
loaded 98 distinct values into picklist.
notify done
loaded 294 total signatures that matched ksize & molecule type
for given picklist, found 98 matches to 98 distinct values
wrote 294 matching manifest rows to '2.csv'

...which can then be directly summarized, searched, loaded, etc:

sourmash sig summarize 2.csv

== This is sourmash version 4.2.4.dev27+g0a1b713f. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from '2.csv'
path filetype: StandaloneManifestIndex
location: 2.csv
is database? yes
has manifest? yes
num signatures: 294
** examining manifest...
total hashes: 1752580
summary of sketches:
   98 sketches with DNA, k=21, scaled=1000, abund     581638 total hashes
   98 sketches with DNA, k=31, scaled=1000, abund     583825 total hashes
   98 sketches with DNA, k=51, scaled=1000, abund     587117 total hashes

(If there had been picklist values that weren't matched, they would have ended up in 1.csv)

@ctb
Copy link
Contributor Author

ctb commented Mar 27, 2022

building genbank stuff from assembly reports for archaea

Using code from #1885,

download assembly_summary

curl -L -O https://ftp.ncbi.nlm.nih.gov/genomes/refseq/archaea/assembly_summary.txt

construct identifier picklist

echo ident > archaea.assembly_summary.idents.csv
cut -f 1 assembly_summary.txt | grep -v ^# >> archaea.assembly_summary.idents.csv

run sig check

% sourmash sig check --picklist archaea.assembly_summary.idents.csv:ident:ident ../split.irber-genome-sigs/entire2.csv -o archaea.assembly_summary.missing.csv --save-manifest archaea.manifest.csv

== This is sourmash version 4.3.1.dev46+g997741a8. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

picking column 'ident' of type 'ident' from 'archaea.assembly_summary.idents.csv'
loaded 1210 distinct values into picklist.
notify done3
loaded 3627 signatures.
for given picklist, found 1209 matches to 1210 distinct values
WARNING: 1 missing picklist values.
saved 1 non-matching rows of 1210 picklist rows to 'archaea.assembly_summary.missing.csv'
wrote 3627 matching manifest rows to 'archaea.manifest.csv'

oh no, one is missing!

% more archaea.assembly_summary.missing.csv
ident
GCF_022489005.1

...but the rest are there:

% sourmash sig summarize archaea.manifest.csv

== This is sourmash version 4.3.1.dev46+g997741a8. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from 'archaea.manifest.csv'
path filetype: StandaloneManifestIndex
location: archaea.manifest.csv
is database? yes
has manifest? yes
num signatures: 3627
** examining manifest...
total hashes: 10739411
summary of sketches:
   1209 sketches with DNA, k=21, scaled=1000, abund   3564584 total hashes
   1209 sketches with DNA, k=31, scaled=1000, abund   3581871 total hashes
   1209 sketches with DNA, k=51, scaled=1000, abund   3592956 total hashes

@ctb
Copy link
Contributor Author

ctb commented Mar 27, 2022

building gtdb genomic for rs207

Looks like there might be a new GTDB release coming? See rs207 directory.

download files with identifiers

mkdir gtdb
cd gtdb
curl -L -O https://data.ace.uq.edu.au/public/gtdb/data/releases/release207/207.0/bac120_taxonomy_r207.tsv.tsv.tar.gz
curl -L -O https://data.ace.uq.edu.au/public/gtdb/data/releases/release207/207.0/ar53_taxonomy_r207.tsv.tar.gz

tar xzf ar53_taxonomy_r207.tsv.tar.gz 
tar xzf bac120_taxonomy_r207.tsv.tsv.tar.gz 
echo ident > gtdb-all-rs207.ident.csv
cut -f 1 *.tsv | cut -c 4- >> gtdb-all-rs207.ident.csv
wc gtdb-all-rs207.ident.csv

see how many signatures we can retrieve from wort

then run sig check:

% sourmash sig check --picklist gtdb-all-rs207.ident.csv:ident:ident ../split.irber-genome-sigs/entire2.csv -o missing-idents.csv --save-manifest matching-sigs.csv

== This is sourmash version 4.3.1.dev46+g997741a8. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

picking column 'ident' of type 'ident' from 'gtdb-all-rs207.ident.csv'
loaded 317542 distinct values into picklist.
loaded 952626 signatures.
for given picklist, found 317542 matches to 317542 distinct values
(no remaining picklist entries; not saving to 'missing-idents.csv')
wrote 952626 matching manifest rows to 'matching-sigs.csv'

🎉 no missing signatures! wort rulez!

examine output manifest for ksizes etc.

Then examine the output manifest matching-sigs.csv:

% sourmash sig summarize matching-sigs.csv 

== This is sourmash version 4.3.1.dev46+g997741a8. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from 'matching-sigs.csv'
path filetype: StandaloneManifestIndex
location: matching-sigs.csv
is database? yes
has manifest? yes
num signatures: 952626
** examining manifest...
total hashes: 3541080185
summary of sketches:
   317542 sketches with DNA, k=21, scaled=1000, abund 1180828455 total hashes
   317542 sketches with DNA, k=31, scaled=1000, abund 1178655672 total hashes
   317542 sketches with DNA, k=51, scaled=1000, abund 1181596058 total hashes

construct new GTDB release

for k=31:

% sourmash sig cat matching-sigs.csv -o gtdb-rs207.genomic.dna.k31.zip -k 31

@ctb
Copy link
Contributor Author

ctb commented Mar 28, 2022

(sourmash-dev) ctbrown@c4-68:~/scratch/fromfile/gtdb$ ls -lh *zip
-rw-rw-r-- 1 ctbrown ctbrowngrp 9.4G Mar 27 17:04 gtdb-rs207.genomic.dna.k21.zip
-rw-rw-r-- 1 ctbrown ctbrowngrp 9.4G Mar 27 14:40 gtdb-rs207.genomic.dna.k31.zip
-rw-rw-r-- 1 ctbrown ctbrowngrp 9.4G Mar 27 19:02 gtdb-rs207.genomic.dna.k51.zip

@ctb
Copy link
Contributor Author

ctb commented Apr 1, 2022

some documentation and examples for the fromfile stuff has been added here: https://github.com/ctb/2022-sourmash-sketchfrom/

@ctb
Copy link
Contributor Author

ctb commented Apr 2, 2022

some documentation and examples for the fromfile stuff has been added here: https://github.com/ctb/2022-sourmash-sketchfrom/

renamed to sourmash-bio/database-examples!

And I'm closing this issue now that #1885 has been merged. 🎉

@ctb ctb closed this as completed Apr 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants