Skip to content

Commit

Permalink
genome_updater v0.2.5 (#31)
Browse files Browse the repository at this point in the history
* centralized wget, filter gtdb

* update Readme

* update README
  • Loading branch information
pirovc authored May 4, 2021
1 parent a60fc33 commit 67a320e
Show file tree
Hide file tree
Showing 4 changed files with 95 additions and 30 deletions.
21 changes: 16 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Bash script to download and update snapshots of the NCBI genomes (refseq/genbank

## Description:

- genome_updater runs on a working directory (**-o**) and creates snapshots/versions (**-b**) of refseq/genbank genome repositories based on selected parameters: database (**-d**), organism group or species/taxids (**-g**), RefSeq category (**-c**), assembly level (**-l**), top assemblies (**-j**) and file type(s) (**-f**)
- genome_updater runs on a working directory (**-o**) and creates snapshots/versions (**-b**) of refseq/genbank genome repositories based on selected parameters: database (**-d**), organism group or species/taxids (**-g**), RefSeq category (**-c**), assembly level (**-l**), top assemblies (**-j**), GTDB [3] compatible (**-z**) and file type(s) (**-f**)
- genome_updater can update the selected repository by executing the same command again. It will identify previous files and update the working directory with the most recent version, keeping track of changes and just downloading/removing updated files

## Installation:
Expand All @@ -24,7 +24,7 @@ or
- genome_updater depends only on the GNU Core Utilities and additional tools (`awk` `bc` `find` `join` `md5sum` `parallel` `sed` `tar` `xargs` `wget`) which are commonly available in most distributions. If you are not sure if you have them all, just run genome_updater.sh and it will tell you if something is missing (otherwise the it will show the help page).
- To test genome_updater basic functions, run the script `tests/tests.sh`. It should print "All tests finished successfully" at the end.
- Make sure you have access to the NCBI ftp folders: `ftp://ftp.ncbi.nlm.nih.gov/genomes/` and `ftp://ftp.ncbi.nih.gov/pub/taxonomy/`
- If you still run into some issues it may be that some tools are running with incompatible or old versions. Please open an issue (https://github.com/pirovc/genome_updater/issues) with the description and the output of the command `genome_updater.sh -D`.
- If you still run into issues it may be that some utilities are incompatible or outdated. Please open an issue (https://github.com/pirovc/genome_updater/issues) describing the problem and the output of the command `genome_updater.sh -D`.

## Simple example:

Expand All @@ -43,6 +43,7 @@ Data selection:
- **-l**: filter by Assembly level [all, Complete Genome, Chromosome, Scaffold, Contig]
- **-c**: filter by RefSeq Category [all, reference genome, representative genome, na]
- **-j**: select [top assemblies](#top-assemblies) for species or taxids: (**-j "species:3"**) to download the top 3 assemblies for each species selected or (**-j "taxids:1"**) to download only the top assembly for each taxid selected.
- **-z**: select only assemblies included in the latest GTDB release

Utilities:
- **-i**: fixes current snapshot in case of network or any other failure during download
Expand Down Expand Up @@ -79,7 +80,7 @@ Reports:

### Download one genome assembly for each bacterial species in genbank

./genome_updater.sh -d "genbank" -g "bacteria" -f "genomic.fna.gz" -o "top1_bacteria_refseq" -t 12 -j "species:1"
./genome_updater.sh -d "genbank" -g "bacteria" -f "genomic.fna.gz" -o "top1_bacteria_genbank" -t 12 -j "species:1"

### Download all E. Coli assemblies available on GenBank and RefSeq with a named label (v1)

Expand Down Expand Up @@ -166,7 +167,10 @@ or

## Parameters:

genome_updater v0.2.2 by Vitor C. Piro http://github.com/pirovc
┌─┐┌─┐┌┐┌┌─┐┌┬┐┌─┐ ┬ ┬┌─┐┌┬┐┌─┐┌┬┐┌─┐┬─┐
│ ┬├┤ ││││ ││││├┤ │ │├─┘ ││├─┤ │ ├┤ ├┬┘
└─┘└─┘┘└┘└─┘┴ ┴└─┘────└─┘┴ ─┴┘┴ ┴ ┴ └─┘┴└─
v0.2.5

-g Organism group (one or more comma-separated entries) [archaea, bacteria, fungi, human (also contained in vertebrate_mammalian), invertebrate, metagenomes (genbank), other (synthetic genomes - only genbank), plant, protozoa, vertebrate_mammalian, vertebrate_other, viral (only refseq)]. Example: archaea,bacteria
or Species level taxids (one or more comma-separated entries). Example: species:622,562
Expand All @@ -178,8 +182,12 @@ or
Default: all
-l Assembly level [all, Complete Genome, Chromosome, Scaffold, Contig]
Default: all
-f File formats [genomic.fna.gz,assembly_report.txt, ... - check ftp://ftp.ncbi.nlm.nih.gov/genomes/all/README.txt for all file formats]
-f File formats [genomic.fna.gz,assembly_report.txt, ...]
check ftp://ftp.ncbi.nlm.nih.gov/genomes/all/README.txt for all file formats
Default: assembly_report.txt
-j Number of top references for each species/taxids to download ["", species:INT, taxids:INT]. Example: "species:3". Selection is based on 1) RefSeq Category, 2) Assembly level, 3) Relation to type material and 4) Date (most recent first)
Default: ""
-z Keep only assemblies present on the latest GTDB release

-k Dry-run, no data is downloaded or updated - just checks for available sequences and changes
-i Fix failed downloads or any incomplete data from a previous run, keep current version
Expand Down Expand Up @@ -207,8 +215,11 @@ or

-D Print print debug information and exit


## References:

[1] ftp://ftp.ncbi.nlm.nih.gov/genomes/

[2] Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.

[3] https://gtdb.ecogenomic.org/
88 changes: 65 additions & 23 deletions genome_updater.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ IFS=$' '

# The MIT License (MIT)

# Copyright (c) 2020 - Vitor C. Piro - http://github.com/pirovc
# Copyright (c) 20201 - Vitor C. Piro - pirovc.github.io
# All rights reserved.

# Permission is hereby granted, free of charge, to any person obtaining a copy
Expand All @@ -25,27 +25,36 @@ IFS=$' '
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.

version="0.2.4"
version="0.2.5"

wget_tries=${wget_tries:-3}
wget_timeout=${wget_timeout:-120}
export wget_tries wget_timeout
export wget_tries wget_timeout
gtdb_urls=( "https://data.gtdb.ecogenomic.org/releases/latest/ar122_taxonomy.tsv.gz"
"https://data.gtdb.ecogenomic.org/releases/latest/bac120_taxonomy.tsv.gz" )

# Export locale numeric to avoid errors on printf in different setups
export LC_NUMERIC="en_US.UTF-8"

#activate aliases in the script
shopt -s expand_aliases
alias sort="sort --field-separator=$'\t'"

get_taxdump()
download_url() # parameter: ${1} url, ${2} output file/directory (omit/empty to STDOUT)
{
wget -qO- --tries="${wget_tries}" --read-timeout="${wget_timeout}" "ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz" > ${1}
}

get_new_taxdump()
{
wget -qO- --tries="${wget_tries}" --read-timeout="${wget_timeout}" "ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz" > ${1}
outfiledir="${2:-}"
if [[ ! -z "${outfiledir}" ]]; then
if [[ -d "${outfiledir}" ]]; then
out="--directory-prefix=${outfiledir}"
else
out="--output-document=${outfiledir}"
fi
else
out="--output-document=-" # STDOUT
fi
wget "${out}" --quiet --continue --tries="${wget_tries}" --read-timeout="${wget_timeout}" ${1}
}
export -f download_url #export it to be accessible to the parallel call

unpack() # parameter: ${1} file, ${2} output folder[, ${3} files to unpack]
{
Expand All @@ -65,9 +74,9 @@ parse_new_taxdump() # parameter: ${1} taxids - return all taxids on of provided
taxids=${1}
echolog "Downloading taxdump and generating lineage" "1"
tmp_new_taxdump="${target_output_prefix}new_taxdump.tar.gz"
tmp_taxidlineage="${working_dir}taxidlineage.dmp"
get_new_taxdump "${tmp_new_taxdump}"
download_url "ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz" "${tmp_new_taxdump}"
unpack "${tmp_new_taxdump}" "${working_dir}" "taxidlineage.dmp"
tmp_taxidlineage="${working_dir}taxidlineage.dmp"
tmp_lineage=${working_dir}lineage.tmp
for tx in ${taxids//,/ }; do
txids_lin=$(grep "[^0-9]${tx}[^0-9]" "${tmp_taxidlineage}" | cut -f 1) #get only taxids in the lineage section
Expand All @@ -89,7 +98,7 @@ get_assembly_summary() # parameter: ${1} assembly_summary file, ${2} database, $
for d in ${2//,/ }
do
if [[ ! -z "${taxids}" || ! -z "${species}" ]]; then # Get complete assembly_summary for database
wget --tries="${wget_tries}" --read-timeout="${wget_timeout}" -qO- ftp://ftp.ncbi.nlm.nih.gov/genomes/${d}/assembly_summary_${d}.txt | tail -n+3 >> "${1}"
download_url "ftp://ftp.ncbi.nlm.nih.gov/genomes/${d}/assembly_summary_${d}.txt" | tail -n+3 >> "${1}"
else
for og in ${3//,/ }
do
Expand All @@ -98,7 +107,7 @@ get_assembly_summary() # parameter: ${1} assembly_summary file, ${2} database, $
then
og="vertebrate_mammalian/Homo_sapiens"
fi
wget --tries="${wget_tries}" --read-timeout="${wget_timeout}" -qO- ftp://ftp.ncbi.nlm.nih.gov/genomes/${d}/${og}/assembly_summary.txt | tail -n+3 >> "${1}"
download_url "ftp://ftp.ncbi.nlm.nih.gov/genomes/${d}/${og}/assembly_summary.txt" | tail -n+3 >> "${1}"
done
fi
done
Expand Down Expand Up @@ -127,6 +136,20 @@ filter_assembly_summary() # parameter: ${1} assembly_summary file - return numbe
count_lines_file "${1}"
}

filter_gtdb() # parameter: ${1} assembly_summary file - return number of lines
{
gtdb_acc=${working_dir}"gtdb_acc"
for url in "${gtdb_urls[@]}"
do
# awk to remove prefix RS_ or GB_
download_url "${url}" | zcat | awk -F "\t" '{print substr($1, 4, length($1))}' >> "${gtdb_acc}"
done
join -1 1 -2 1 <(sort -k 1,1 "${1}") <(sort -k 1,1 "${gtdb_acc}") -t$'\t' -o "1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,1.10,1.11,1.12,1.13,1.14,1.15,1.16,1.17,1.18,1.19,1.20,1.21,1.22" | sort | uniq > "${1}_gtdb"
mv "${1}_gtdb" "${1}"
rm "${gtdb_acc}"
count_lines_file "${1}"
}

top_assembly_summary() # parameter: ${1} assembly_summary file, ${2} top_assemblies - return number of lines
{
if [[ " ${2} " =~ "taxids:" ]]; then
Expand Down Expand Up @@ -190,11 +213,12 @@ check_file_folder() # parameter: ${1} url, ${2} log (0->before download/1->after
rm -vf "${target_output_prefix}${files_dir}${file_name}" >> "${log_file}" 2>&1
return 1
else
if [ "${2}" -eq 0 ]; then
echolog "${file_name} file found on the output folder [${target_output_prefix}${files_dir}${file_name}]" "0"
else
echolog "${file_name} downloaded successfully [${1} -> ${target_output_prefix}${files_dir}${file_name}]" "0"
fi
# Disabled log in case of success, hard to detect failures
#if [ "${2}" -eq 0 ]; then
# echolog "${file_name} file found on the output folder [${target_output_prefix}${files_dir}${file_name}]" "0"
#else
# echolog "${file_name} downloaded successfully [${1} -> ${target_output_prefix}${files_dir}${file_name}]" "0"
#fi
return 0
fi
}
Expand All @@ -205,7 +229,7 @@ check_md5_ftp() # parameter: ${1} url - returns 0 (ok) / 1 (error)
if [ "${check_md5}" -eq 1 ]; then # Only if md5 checking is activated
md5checksums_url="$(dirname ${1})/md5checksums.txt" # ftp directory
file_name=$(basename ${1}) # downloaded file name
md5checksums_file=$(wget -qO- --tries="${wget_tries}" --read-timeout="${wget_timeout}" "${md5checksums_url}")
md5checksums_file=$(download_url "${md5checksums_url}")
if [ -z "${md5checksums_file}" ]; then
echolog "${file_name} MD5checksum file download failed [${md5checksums_url}] - FILE KEPT" "0"
return 0
Expand Down Expand Up @@ -245,7 +269,7 @@ download() # parameter: ${1} url, ${2} job number, ${3} total files, ${4} url_su
dl=1
fi
if [ "${dl}" -eq 1 ]; then # If file is not yet on folder, download it
wget ${1} --quiet --continue --tries="${wget_tries}" --read-timeout="${wget_timeout}" -P "${target_output_prefix}${files_dir}"
download_url "${1}" "${target_output_prefix}${files_dir}"
if ! check_file_folder ${1} "1"; then # Check if file was downloaded
ex=1
elif ! check_md5_ftp ${1}; then # Check file md5
Expand Down Expand Up @@ -386,6 +410,7 @@ refseq_category="all"
assembly_level="all"
file_formats="assembly_report.txt"
top_assemblies=""
gtdb_only=0
download_taxonomy=0
delete_extra_files=0
check_md5=0
Expand Down Expand Up @@ -427,6 +452,7 @@ function showhelp {
echo $' -l Assembly level [all, Complete Genome, Chromosome, Scaffold, Contig]\n\tDefault: all'
echo $' -f File formats [genomic.fna.gz,assembly_report.txt, ...]\n\tcheck ftp://ftp.ncbi.nlm.nih.gov/genomes/all/README.txt for all file formats\n\tDefault: assembly_report.txt'
echo $' -j Number of top references for each species/taxids to download ["", species:INT, taxids:INT]. Example: "species:3". Selection is based on 1) RefSeq Category, 2) Assembly level, 3) Relation to type material and 4) Date (most recent first)\n\tDefault: ""'
echo $' -z Keep only assemblies present on the latest GTDB release'
echo
echo $' -k Dry-run, no data is downloaded or updated - just checks for available sequences and changes'
echo $' -i Fix failed downloads or any incomplete data from a previous run, keep current version'
Expand Down Expand Up @@ -464,7 +490,7 @@ done
if [ "${tool_not_found}" -eq 1 ]; then exit 1; fi

OPTIND=1 # Reset getopts
while getopts "d:g:c:l:o:e:b:t:f:j:n:akixmurpswhD" opt; do
while getopts "d:g:c:l:o:e:b:t:f:j:zn:akixmurpswhD" opt; do
case ${opt} in
d) database=${OPTARG} ;;
g) organism_group=${OPTARG// } ;; #remove spaces
Expand All @@ -476,6 +502,7 @@ while getopts "d:g:c:l:o:e:b:t:f:j:n:akixmurpswhD" opt; do
t) threads=${OPTARG} ;;
f) file_formats=${OPTARG// } ;; #remove spaces
j) top_assemblies=${OPTARG} ;;
z) gtdb_only=1 ;;
a) download_taxonomy=1 ;;
k) just_check=1 ;;
i) just_fix=1 ;;
Expand Down Expand Up @@ -635,6 +662,7 @@ echolog "RefSeq category: ${refseq_category}" "0"
echolog "Assembly level: ${assembly_level}" "0"
echolog "File formats: ${file_formats}" "0"
echolog "Top assemblies: ${top_assemblies}" "0"
echolog "GTDB Only: ${gtdb_only}" "0"
echolog "Download taxonomy: ${download_taxonomy}" "0"
echolog "Just check for updates: ${just_check}" "0"
echolog "Just fix/recover current version: ${just_fix}" "0"
Expand Down Expand Up @@ -673,6 +701,13 @@ if [[ "${MODE}" == "NEW" ]]; then
echolog " - ${all_lines} entries available" "1"
filtered_lines=$(filter_assembly_summary "${new_assembly_summary}")
echolog " - $((all_lines-filtered_lines)) entries removed with filters: RefSeq category=${refseq_category}, Assembly level=${assembly_level}, Version status=latest, valid URLs" "1"
#GTDB
if [ "${gtdb_only}" -eq 1 ]; then
gtdb_lines=$(filter_gtdb "${new_assembly_summary}")
echolog " - $((filtered_lines-gtdb_lines)) entries removed not in GTDB" "1"
filtered_lines=${gtdb_lines}
fi
#TOP ASSEMBLIES
if [[ ! -z "${top_assemblies}" ]]; then
top_lines=$(top_assembly_summary "${new_assembly_summary}" "${top_assemblies}")
echolog " - $((filtered_lines-top_lines)) entries removed with top filter ${top_assemblies}" "1"
Expand Down Expand Up @@ -773,6 +808,13 @@ else # update/fix
echolog " - ${all_lines} entries available" "1"
filtered_lines=$(filter_assembly_summary "${new_assembly_summary}")
echolog " - $((all_lines-filtered_lines)) entries removed with filters: RefSeq category=${refseq_category}, Assembly level=${assembly_level}, Version status=latest, valid URLs" "1"
#GTDB
if [ "${gtdb_only}" -eq 1 ]; then
gtdb_lines=$(filter_gtdb "${new_assembly_summary}")
echolog " - $((filtered_lines-gtdb_lines)) entries removed not in GTDB" "1"
filtered_lines=${gtdb_lines}
fi
#TOP ASSEMBLIES
if [[ ! -z "${top_assemblies}" ]]; then
top_lines=$(top_assembly_summary "${new_assembly_summary}" "${top_assemblies}")
echolog " - $((filtered_lines-top_lines)) entries removed with top filter ${top_assemblies}" "1"
Expand Down Expand Up @@ -866,7 +908,7 @@ fi
if [ "${just_check}" -eq 0 ]; then
if [ "${download_taxonomy}" -eq 1 ]; then
echolog "Downloading current Taxonomy database [${target_output_prefix}taxdump.tar.gz] " "1"
get_taxdump "${target_output_prefix}taxdump.tar.gz"
download_url "ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz" "${target_output_prefix}taxdump.tar.gz"
echolog " - Done" "1"
echolog "" "1"
fi
Expand Down
2 changes: 2 additions & 0 deletions tests/assembly_summary_gtdb.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
GCA_000145985.1 PRJNA33361 SAMN00016987 na 583356 334771 Ignisphaera aggregans DSM 17230 strain=DSM 17230 latest Complete Genome Major Full 2010/08/24 ASM14598v1 US DOE Joint Genome Institute (JGI-PGF) GCF_000145985.1 identical ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/145/985/GCA_000145985.1_ASM14598v1 missing rRNA genes assembly from type material
GCA_XXXXXXXXX.X PRJNA202 SAMN02744041 na 414004 46770 Cenarchaeum symbiosum A latest Chromosome Major Full 2006/11/20 ASM20071v1 DOE Joint Genome Institute GCF_000200715.1 identical ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/200/715/GCA_000200715.1_ASM20071v1 derived from environmental source
14 changes: 12 additions & 2 deletions tests/tests.sh
Original file line number Diff line number Diff line change
Expand Up @@ -90,8 +90,18 @@ out_2="tests/tst_taxids2493627"
# check if both runs have the same files
diff <(find "${out_all}"/v1/files/ -xtype f -printf "%f\n" | sort) <(find "${out_1}"/v1/files/ "${out_2}"/v1/files/ -xtype f -printf "%f\n" | sort)

echo ""
echo ""

####################### GTDB filter test
out_gtdb="tests/tst_gtdb_filter"
./genome_updater.sh -o "${out_gtdb}" -e "tests/assembly_summary_gtdb.txt" -m -b v1 -t ${threads} -p -z
# should download only one file
test $(find "${out_gtdb}/v1/files/" -xtype f | wc -l) -eq 1
# test if nothing failed (it should not even be attempted to download, removed in filter step)
test $(cat "${out_gtdb}/v1/"*_url_failed.txt | wc -l) -eq 0
# and one successful
test $(cat "${out_gtdb}/v1/"*_url_downloaded.txt | wc -l) -eq 1


echo ""
echo "*******************************"
echo "All tests finished successfully"
Expand Down

0 comments on commit 67a320e

Please sign in to comment.