Merge pull request #84 from qiyunzhu/2.0b3

2.0b3
qiyunlab · Nov 25, 2021 · 31e28a2 · 31e28a2
2 parents 1101fd1 + bdb606c
commit 31e28a2
Show file tree

Hide file tree

Showing 13 changed files with 458 additions and 269 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,16 +1,27 @@
 Change Log
 ==========
 
-## Version 2.0b3 (12/27/2020)
+## Version 2.0b3 (2021, ongoing)
 
 ### Added
 - Predicted HGT list now includes potential donors. Users can optionally specify a taxonomic rank at which they will be reported.
 - A quick-start guide added to the homepage.
+- Let the database construction task resume from an interrupted task and use already downloaded files, while detecting and skipping corrupt files.
+- Added an option `--above` to sample genomes at all ranks from the designated one to phylum during database construction.
+- Added an option `--typemater` to include NCBI-defined type materials in genome sampling during database construction.
+- Added an option `--manual` to export URLs of sampled genomes during database construction, and let the user download them manually.
 
 ### Changed
+- Updated pre-built default database to 2021-11-21 (after NCBI RefSeq 209)
 - Repository transferred from [DittmarLab](https://github.com/DittmarLab) to [qiyunlab](https://github.com/qiyunlab).
+- Updated recommended dependency versions, however the program should continue to be compatible with previous versions.
 - Minor tweaks with no visible impact on program behavior.
 
+### Fixed
+- Fixed an issue with the NCBI FTP server connection during database construction. NCBI now recommends rsync over ftp. Therefore the protocol has been updated accordingly.
+- Fixed compatibility with latest scikit-learn (1.0.1).
+- Fixed compatibility with latest DIAMOND (2.0.13).
+
 
 ## Version 2.0b2 (5/3/2020)
 

diff --git a/README.md b/README.md
@@ -30,7 +30,7 @@ References
 Set up a Conda environment and install dependencies:
 
 ```bash
-conda create -n hgtector python=3 pyyaml pandas matplotlib scikit-learn bioconda::diamond
+conda create -n hgtector -c conda-forge python=3 pyyaml pandas matplotlib scikit-learn bioconda::diamond
 conda activate hgtector
 ```
 
@@ -40,13 +40,15 @@ Install HGTector2:
 pip install git+https://github.com/qiyunlab/HGTector.git
 ```
 
-Build a reference database using the default protocol:
+Then you will be able to type `hgtector` to run the program. Here are more details of [installation](doc/install.md).
+
+Build a reference [database](doc/database.md) using the default protocol:
 
 ```bash
 hgtector database -o db_dir --default
 ```
 
-This will retrieve the latest genomic data from NCBI. If this does not work (e.g., due to network issues), or you need some customization, please read the [database](doc/database.md) page.
+Or [download](https://www.dropbox.com/s/tszxy9etp52id3u/hgtdb_20211121.tar.xz?dl=0) a pre-built database as of 2021-11-21, and [compile](doc/database.md#Manual-compiling) it.
 
 Prepare input file(s). They should be multi-Fasta files of amino acid sequences (faa). Each file represents the whole protein set of a complete or partial genome.
 
@@ -69,7 +71,7 @@ It is recommended that you read the [first run](doc/1strun.md), [second run](doc
 
 ## License
 
-Copyright (c) 2013-2020, [Qiyun Zhu](mailto:qiyunzhu@gmail.com) and [Katharina Dittmar](mailto:katharinad@gmail.com). Licensed under [BSD 3-clause](http://opensource.org/licenses/BSD-3-Clause). See full license [statement](LICENSE).
+Copyright (c) 2013-2021, [Qiyun Zhu](mailto:qiyunzhu@gmail.com) and [Katharina Dittmar](mailto:katharinad@gmail.com). Licensed under [BSD 3-clause](http://opensource.org/licenses/BSD-3-Clause). See full license [statement](LICENSE).
 
 
 ## Citation

diff --git a/doc/1strun.md b/doc/1strun.md
@@ -11,7 +11,7 @@ A small example is provided in the subdirectory [example](../example). The input
 
 Let's analyze this small example using HGTector.
 
-**Note**: It has been increasingly infeasible as of 2020 to run remote search through the NCBI BLAST server. If you experience an very slow run when going through this tutorial, please skip and move on to [second run](2ndrun.md).
+**Note**: Automatic remote BLAST search using URL API is inefficient as of 2021 and has been [deprecated](https://ncbi.github.io/blast-cloud/dev/api.html) by NCBI. Therefore this tutorial is for reference only. Unless you want to wait for long hours, please skip this tutorial and move on to the [second run](2ndrun.md).
 
 
 ## Search

diff --git a/doc/2ndrun.md b/doc/2ndrun.md
@@ -37,7 +37,7 @@ Here we will analyze a single genome of [_Escherichia coli_ O55:H7](https://www.
 Download the whole protein set of _E. coli_ O55:H7 (5,278 protein sequences in total) from the NCBI server:
 
 ```bash
-wget -O o55h7.faa.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/025/165/GCF_000025165.1_ASM2516v1/GCF_000025165.1_ASM2516v1_protein.faa.gz
+wget -O o55h7.faa.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/025/165/GCF_000025165.1_ASM2516v1/GCF_000025165.1_ASM2516v1_protein.faa.gz
 ```
 
 You don't need to unzip the file, as HGTector can automatically parse compressed files in the format of gzip, bzip2 and lzma.

diff --git a/doc/database.md b/doc/database.md
@@ -1,6 +1,9 @@
 Database
 ========
 
+## Index
+- [Overview](#overview) | [Default protocol](#default-protocol) | [Pre-built database](#pre-built-database) | [Database files](#database-files) | [Considerations](#considerations) | [Command-line reference](#command-line-reference)
+
 ## Overview
 
 The `database` command is an automated workflow for sampling reference genomes, downloading non-redundant protein sequences, and building local databases for sequence homology search. It provides various options for flexible customization of the database, to address specific research goals including HGT prediction or other general purposes.
@@ -9,33 +12,36 @@ The `database` command is an automated workflow for sampling reference genomes,
 hgtector database -o <output_dir> <parameters...>
 ```
 
-### Default protocol
+The workflow consists of the following steps:
 
-HGTector provides a default protocol for database building.
+1. Download NCBI taxonomy database (taxdump).
+2. Download NCBI RefSeq assembly summary.
+3. Sample genomes based on various properties and taxonomic information.
+4. Download protein sequences associated with sampled genomes.
+5. Compile local databases using DIAMOND and/or BLAST.
 
-```bash
-hgtector database -o <output_dir> --default
-```
 
-This will download all protein sequences of NCBI RefSeq genomes of bacteria, archaea, fungi and protozoa, keep one genome per species, plus all NCBI-defined reference and representative genomes. Finally it will attempt to compile the database using DIAMOND, if available. The command is equivalent to:
+## Default protocol
+Database files
+This will download all protein sequences of NCBI RefSeq genomes of **bacteria**, **archaea**, **fungi** and **protozoa**, keep _one genome per species_ that has a Latinate name, plus one genome per taxonomic group at higher ranks, regardless whether that genome has a Latinate species name, plus all NCBI-defined **reference**, **representative** and **type material** genomes (prioritized during taxonomy-based sampling, and added afterwards if not sampled). Finally it will attempt to compile the database using DIAMOND, if available. The command is equivalent to:
 
 ```bash
-hgtector database --output <output_dir> --cats microbe --sample 1 --rank species --reference --representative --compile diamond
+hgtector database --output <output_dir> --cats microbe --sample 1 --rank species_latin --above --reference --represent --typemater --compile diamond
 ```
 
-A pre-built default database as of 2019-10-21 is available for [download](https://www.dropbox.com/s/qdnfgzdcjadlm4i/hgtdb_20191021.tar.xz?dl=0). It needs to be [compiled](#Manual-compiling) using choice of aligner.
 
-### Procedures
+## Pre-built database
 
-The workflow consists of the following steps:
+A database built using the default protocol on 2021-11-21 is available for [download](https://www.dropbox.com/s/tszxy9etp52id3u/hgtdb_20211121.tar.xz?dl=0) \([MD5](https://www.dropbox.com/s/kdopz946pk088wr/hgtdb_20211121.tar.xz.md5?dl=0)\). It needs to be [compiled](#Manual-compiling) using choice of aligner.
 
-1. Download NCBI taxonomy database (taxdump).
-2. Download NCBI RefSeq assembly summary.
-3. Sample genomes based on various properties and taxonomic information.
-4. Download protein sequences associated with sampled genomes.
-5. Compile local databases using DIAMOND and/or BLAST.
+This database, sampled from NCBI RefSeq after release, 209 contains 68,977,351 unique protein sequences from 21,754 microbial genomes, representing 3 domains, 74 phyla, 145 classes, 337 orders, 783 families, 3,753 genera and 15,932 species.
 
-### Database files
+Building this database used a maximum of 63 GB memory. Searching this database using DIAMOND v2.0.13 requires ~7 GB memory.
+
+A previous version of the database built on 2019-10-21 is available [here](https://www.dropbox.com/s/qdnfgzdcjadlm4i/hgtdb_20191021.tar.xz?dl=0).
+
+
+## Database files
 
 File or directory | Description
 --- | ---
@@ -60,6 +66,9 @@ The protein-to-TaxID map is already integrated into the compiled databases, so o
 
 Feel free to delete (e.g., `download/`) or compress the intermediate files (e.g., `db.faa`) to save disk space.
 
+
+## Considerations
+
 ### More examples
 
 ```bash
@@ -80,12 +89,31 @@ hgtector database -g gids.txt -o .
 
 This will only download genomes specified in the file `gids.txt`. Useful for controlled tests.
 
+### Clean up
+
+After the database is successfully built, you may consider compressing `db.faa` and deleting `download/` (or just `download/faa/`) to save disk space. HGTector won't do this automatically.
+
 ### Break and resume
 
 Should any of the download steps be interrupted by e.g., a network failure, one can resume the downloading process by re-executing the same command. The program will skip the already downloaded files in this new run. In some instances, one may need to manually remove the last file from the failed run (because that file may be corrupt), before re-running the program.
 
 If one wants to overwrite downloaded files (e.g., upgrading), add `--overwrite` to the command.
 
+### Manual downloading
+
+One may want to download genomes manually in a more controled manner, instead of letting HGTector running for hours to days to retrieve them one after another before moving to the next step. In this case, add `--manual` to the command, and the program will generate `urls.txt`, a list of URLs of the sampled genomes, and quit.
+
+Then one can choose the most appropriate method to download them. For example, one may use the [rsync protocol](https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/#protocols), as recommended by NCBI:
+
+```bash
+while read url
+do
+    rsync -Ltq rsync${url#https}/${url##*/}_protein.faa.gz download/faa/
+done < urls.txt
+```
+
+After all genomes (protein sequences) are downloaded to `download/faa/`, one may restart the program without `--manual`, and the program will take the downloaded files and move to the next step.
+
 ### Manual compiling
 
 The sampling & downloading steps (1-4) require extensive network traffics (usually several hours, if the bottleneck is not on the recipient side) but little local computing load; whereas the compling step (5) requires extensive local computing power, but no networking.
@@ -95,12 +123,11 @@ Therefore, it is a reasonable plan to only download database files without compi
 For DIAMOND:
 
 ```bash
-echo $'accession\taccession.version\ttaxid' > prot.accession2taxid
-zcat taxon.map.gz | awk -v OFS='\t' '{split($1, a, "."); print a[1], $1, $2}' >> prot.accession2taxid
+echo $'accession.version\ttaxid' | cat - <(zcat taxon.map.gz) > prot.accession2taxid.FULL
 
-diamond makedb --threads 16 --in db.faa --taxonmap prot.accession2taxid --taxonnodes taxdump/nodes.dmp --taxonnames taxdump/names.dmp --db diamond/db
+diamond makedb --threads 16 --in db.faa --taxonmap prot.accession2taxid.FULL --taxonnodes taxdump/nodes.dmp --taxonnames taxdump/names.dmp --db diamond/db
 
-rm prot.accession2taxid
+rm prot.accession2taxid.FULL
 ```
 
 For BLAST:
@@ -174,17 +201,19 @@ Option | Default | Description
 
 Option | Default | Description
 --- | --- | ---
-`-s`, `--sample` | 0 | Sample up to this number of genomes per taxonomic group at the given rank. "0" is for all (disable sampling).
-`-r`, `--rank` | species | Taxonomic rank at which subsampling will be performed.
+`-s`, `--sample` | 0 | Sample up to this number of genomes per taxonomic group at the given rank. "0" is for all (disable sampling). Prior to sampling, genomes will be sorted by NCBI genome category: reference > representative > type material, then by assembly level: complete genome or chromosome > scaffolds > contigs. Sampling will start from the top of the list.
+`-r`, `--rank` | species | Taxonomic rank at which subsampling will be performed. Can be any taxonomic rank defined in the NCBI taxonomy database. A special case is "species_latin", which will sample from species that have Latinate names.
+`--above` | - | Sampling will also be performed on ranks from the one given by `-r` to phylum (low to high). They will not overlap the already sampled ones. For example, if two _E. coli_ genomes are already sampled, no more genome will be added when sampling in genus _Escherichia_. This flag is useful in the case of `-r species_latin`, because some ranks above species may be undersampled.
 
 ### Genome sampling
 
 Option | Default | Description
 --- | --- | ---
 `--genbank` | - | By default the program only downloads RefSeq genomes (`GCF`). This flag will let the program also download GenBank genomes (`GCA`). But RefSeq has higher priority than GenBank if the same genome is hosted by both catalogs.
 `--complete` | - | Only include complete genomes, i.e., `assembly_level` is `Complete Genome` or `Chromosome`.
-`--reference` | - | Add NCBI-defined reference genomes to selection (after taxon sampling).
-`--representative` | - | Add NCBI-defined representative genomes to selection (after taxon sampling).
+`--reference` | - | Include NCBI-defined reference genomes.
+`--represent` | - | Include NCBI-defined representative genomes.
+`--typemater` | - | Include NCBI-defined type material genomes.
 
 ### Taxonomic filter
 

diff --git a/doc/install.md b/doc/install.md
@@ -18,7 +18,7 @@ HGTector is written in Python 3. One needs at least Python 3.6 to run the progra
 ### Option 1: Through Conda (recommended)
 
 ```bash
-conda create -n hgtector python=3 pyyaml pandas matplotlib scikit-learn
+conda create -n hgtector -c conda-forge python=3 pyyaml pandas matplotlib scikit-learn
 conda activate hgtector
 pip install git+https://github.com/qiyunlab/HGTector.git
 ```
@@ -49,7 +49,7 @@ conda install -c bioconda diamond blast
 
 HGTector has a command `database` for automated database construction. It defaults to the **NCBI** RefSeq microbial genomes and taxonomy. Meanwhile, we also provide instructions for using **GTDB** and custom databases. See [details](database.md).
 
-A standard database built using the default protocol on 2019-10-21 is available for [download](https://www.dropbox.com/s/qdnfgzdcjadlm4i/hgtdb_20191021.tar.xz?dl=0), together with [instruction](database.md#Manual-compiling) for compiling.
+A standard database built using the default protocol on 2021-11-21 is available for [download](https://www.dropbox.com/s/tszxy9etp52id3u/hgtdb_20211121.tar.xz?dl=0) \([MD5](https://www.dropbox.com/s/kdopz946pk088wr/hgtdb_20211121.tar.xz.md5?dl=0)\), together with [instruction](database.md#Manual-compiling) for compiling.
 
 A small, pre-compiled test database is also available for [download](https://www.dropbox.com/s/46v3uc708rvc5rc/ref107.tar.xz?dl=0).
 
@@ -80,8 +80,8 @@ conda env remove -n hgtector --all
 
 ## Compatibility
 
-If in the future some dependencies have changes that are not compatible with the current release of HGTector, the following "safe" command can be used to install the current versions of dependencies (note: DIAMOND version is too tricky to specify).
+If in the future some dependencies have changes that are not compatible with the current release of HGTector, the following "safe" command can be used to install the current versions of dependencies.
 
 ```bash
-conda create -n hgtector -c bioconda python=3.7.7 pyyaml=5.3.1 pandas=1.0.3 matplotlib=3.1.3 scikit-learn=0.22.1 diamond blast=2.9.0
+conda create -n hgtector-dev -c conda-forge python=3.10.0 pyyaml=6.0 pandas=1.3.4 matplotlib=3.5.0 scikit-learn=1.0.1 bioconda::diamond=2.0.13 bioconda::blast=2.12.0
 ```
diff --git a/hgtector/analyze.py b/hgtector/analyze.py
@@ -887,9 +887,8 @@ def perform_kde(self, data):
             https://jakevdp.github.io/blog/2013/12/01/kernel-density-estimation/
         """
         bw = self.bandwidth
-        data_ = data[:, np.newaxis]
         scaler = StandardScaler()
-        data_ = scaler.fit_transform(data[:, np.newaxis])
+        data_ = scaler.fit_transform(data.reshape(-1, 1))
         estimator = KernelDensity(kernel='gaussian')
 
         # grid search optimization
@@ -914,8 +913,8 @@ def perform_kde(self, data):
 
         # get density function
         x, y = self.density_func(data_, kde)
-        x = scaler.inverse_transform(x)
-        y = scaler.inverse_transform(y)
+        x = scaler.inverse_transform(x.reshape(-1, 1)).reshape(-1)
+        y = scaler.inverse_transform(y.reshape(-1, 1)).reshape(-1)
         return x, y, bw
 
     @staticmethod
@@ -1127,7 +1126,7 @@ def smart_kde(self, group):
         """
         data = self.df[group].values
         scaler = StandardScaler()
-        data_ = scaler.fit_transform(data[:, np.newaxis])
+        data_ = scaler.fit_transform(data.reshape(-1, 1))
         estimator = KernelDensity(kernel='gaussian')
         bwspace = np.logspace(0, -1, self.bw_steps)
 
@@ -1137,8 +1136,8 @@ def smart_kde(self, group):
             setattr(estimator, 'bandwidth', bw)
             kde = estimator.fit(data_)
             x, y = self.density_func(data_, kde)
-            x = scaler.inverse_transform(x)
-            y = scaler.inverse_transform(y)
+            x = scaler.inverse_transform(x.reshape(-1, 1)).reshape(-1)
+            y = scaler.inverse_transform(y.reshape(-1, 1)).reshape(-1)
             try:
                 peak, valley = self.first_hill(x, y)
             except ValueError:
@@ -1177,7 +1176,8 @@ def calc_cluster_props(self):
         # calculate centroid
         clf = NearestCentroid()
         clf.fit(data_, self.df['hgt'])
-        cent = scaler.inverse_transform(clf.centroids_[1])
+        cent = scaler.inverse_transform(clf.centroids_[1].reshape(
+            -1, data_.shape[1])).reshape(-1)
         return cent
 
     def refine_cluster(self, cent):

diff --git a/hgtector/config.yml b/hgtector/config.yml
@@ -101,8 +101,8 @@ blast:
 remote:
 
   # remote search database
-  # options: nr, refseq_protein, swissprot, pdb, etc.
-  db: nr
+  # options: nr, refseq_select_prot, refseq_protein, swissprot, pdb, etc.
+  db: refseq_select_prot
 
   # remote search algorithm
   # options: blastp psiBlast deltaBlast kmerBlastp phiBlast, etc.
@@ -114,13 +114,13 @@ remote:
                     # typically cannot exceed 8,000 characters)
   retries: 5        # maximum number of retries per search
   delay: 60         # seconds between two search requests
-  timeout: 900      # seconds before program gives up waiting
+  timeout: 1800     # seconds before program gives up waiting
 
   # entrez query text
   entrez: all [filter] NOT(environmental samples[filter] OR metagenomes[orgn]) txid131567[orgn]
 
   # extra URL arguments for remote search
-  extrargs: "&FILTER=m%20S"
+  extrargs: "&WORD_SIZE=6&FILTER=m%20S"
 
 
 ## Fetch information from remote server