diff --git a/README.md b/README.md index 44b0f33..9f78084 100755 --- a/README.md +++ b/README.md @@ -6,7 +6,7 @@ Bash script to download and update snapshots of the NCBI genomes (refseq/genbank ## Description: -- genome_updater runs on a working directory (**-o**) and creates snapshots/versions (**-b**) of refseq/genbank genome repositories based on selected parameters: database (**-d**), organism group or species/taxids (**-g**), RefSeq category (**-c**), assembly level (**-l**), top assemblies (**-j**) and file type(s) (**-f**) +- genome_updater runs on a working directory (**-o**) and creates snapshots/versions (**-b**) of refseq/genbank genome repositories based on selected parameters: database (**-d**), organism group or species/taxids (**-g**), RefSeq category (**-c**), assembly level (**-l**), top assemblies (**-j**), GTDB [3] compatible (**-z**) and file type(s) (**-f**) - genome_updater can update the selected repository by executing the same command again. It will identify previous files and update the working directory with the most recent version, keeping track of changes and just downloading/removing updated files ## Installation: @@ -24,7 +24,7 @@ or - genome_updater depends only on the GNU Core Utilities and additional tools (`awk` `bc` `find` `join` `md5sum` `parallel` `sed` `tar` `xargs` `wget`) which are commonly available in most distributions. If you are not sure if you have them all, just run genome_updater.sh and it will tell you if something is missing (otherwise the it will show the help page). - To test genome_updater basic functions, run the script `tests/tests.sh`. It should print "All tests finished successfully" at the end. - Make sure you have access to the NCBI ftp folders: `ftp://ftp.ncbi.nlm.nih.gov/genomes/` and `ftp://ftp.ncbi.nih.gov/pub/taxonomy/` - - If you still run into some issues it may be that some tools are running with incompatible or old versions. Please open an issue (https://github.com/pirovc/genome_updater/issues) with the description and the output of the command `genome_updater.sh -D`. + - If you still run into issues it may be that some utilities are incompatible or outdated. Please open an issue (https://github.com/pirovc/genome_updater/issues) describing the problem and the output of the command `genome_updater.sh -D`. ## Simple example: @@ -43,6 +43,7 @@ Data selection: - **-l**: filter by Assembly level [all, Complete Genome, Chromosome, Scaffold, Contig] - **-c**: filter by RefSeq Category [all, reference genome, representative genome, na] - **-j**: select [top assemblies](#top-assemblies) for species or taxids: (**-j "species:3"**) to download the top 3 assemblies for each species selected or (**-j "taxids:1"**) to download only the top assembly for each taxid selected. +- **-z**: select only assemblies included in the latest GTDB release Utilities: - **-i**: fixes current snapshot in case of network or any other failure during download @@ -79,7 +80,7 @@ Reports: ### Download one genome assembly for each bacterial species in genbank - ./genome_updater.sh -d "genbank" -g "bacteria" -f "genomic.fna.gz" -o "top1_bacteria_refseq" -t 12 -j "species:1" + ./genome_updater.sh -d "genbank" -g "bacteria" -f "genomic.fna.gz" -o "top1_bacteria_genbank" -t 12 -j "species:1" ### Download all E. Coli assemblies available on GenBank and RefSeq with a named label (v1) @@ -166,7 +167,10 @@ or ## Parameters: - genome_updater v0.2.2 by Vitor C. Piro http://github.com/pirovc + ┌─┐┌─┐┌┐┌┌─┐┌┬┐┌─┐ ┬ ┬┌─┐┌┬┐┌─┐┌┬┐┌─┐┬─┐ + │ ┬├┤ ││││ ││││├┤ │ │├─┘ ││├─┤ │ ├┤ ├┬┘ + └─┘└─┘┘└┘└─┘┴ ┴└─┘────└─┘┴ ─┴┘┴ ┴ ┴ └─┘┴└─ + v0.2.5 -g Organism group (one or more comma-separated entries) [archaea, bacteria, fungi, human (also contained in vertebrate_mammalian), invertebrate, metagenomes (genbank), other (synthetic genomes - only genbank), plant, protozoa, vertebrate_mammalian, vertebrate_other, viral (only refseq)]. Example: archaea,bacteria or Species level taxids (one or more comma-separated entries). Example: species:622,562 @@ -178,8 +182,12 @@ or Default: all -l Assembly level [all, Complete Genome, Chromosome, Scaffold, Contig] Default: all - -f File formats [genomic.fna.gz,assembly_report.txt, ... - check ftp://ftp.ncbi.nlm.nih.gov/genomes/all/README.txt for all file formats] + -f File formats [genomic.fna.gz,assembly_report.txt, ...] + check ftp://ftp.ncbi.nlm.nih.gov/genomes/all/README.txt for all file formats Default: assembly_report.txt + -j Number of top references for each species/taxids to download ["", species:INT, taxids:INT]. Example: "species:3". Selection is based on 1) RefSeq Category, 2) Assembly level, 3) Relation to type material and 4) Date (most recent first) + Default: "" + -z Keep only assemblies present on the latest GTDB release -k Dry-run, no data is downloaded or updated - just checks for available sequences and changes -i Fix failed downloads or any incomplete data from a previous run, keep current version @@ -207,8 +215,11 @@ or -D Print print debug information and exit + ## References: [1] ftp://ftp.ncbi.nlm.nih.gov/genomes/ [2] Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47. + +[3] https://gtdb.ecogenomic.org/ diff --git a/genome_updater.sh b/genome_updater.sh index df2d95e..db66257 100755 --- a/genome_updater.sh +++ b/genome_updater.sh @@ -4,7 +4,7 @@ IFS=$' ' # The MIT License (MIT) -# Copyright (c) 2020 - Vitor C. Piro - http://github.com/pirovc +# Copyright (c) 20201 - Vitor C. Piro - pirovc.github.io # All rights reserved. # Permission is hereby granted, free of charge, to any person obtaining a copy @@ -25,11 +25,14 @@ IFS=$' ' # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN # THE SOFTWARE. -version="0.2.4" +version="0.2.5" wget_tries=${wget_tries:-3} wget_timeout=${wget_timeout:-120} -export wget_tries wget_timeout +export wget_tries wget_timeout +gtdb_urls=( "https://data.gtdb.ecogenomic.org/releases/latest/ar122_taxonomy.tsv.gz" + "https://data.gtdb.ecogenomic.org/releases/latest/bac120_taxonomy.tsv.gz" ) + # Export locale numeric to avoid errors on printf in different setups export LC_NUMERIC="en_US.UTF-8" @@ -37,15 +40,21 @@ export LC_NUMERIC="en_US.UTF-8" shopt -s expand_aliases alias sort="sort --field-separator=$'\t'" -get_taxdump() +download_url() # parameter: ${1} url, ${2} output file/directory (omit/empty to STDOUT) { - wget -qO- --tries="${wget_tries}" --read-timeout="${wget_timeout}" "ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz" > ${1} -} - -get_new_taxdump() -{ - wget -qO- --tries="${wget_tries}" --read-timeout="${wget_timeout}" "ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz" > ${1} + outfiledir="${2:-}" + if [[ ! -z "${outfiledir}" ]]; then + if [[ -d "${outfiledir}" ]]; then + out="--directory-prefix=${outfiledir}" + else + out="--output-document=${outfiledir}" + fi + else + out="--output-document=-" # STDOUT + fi + wget "${out}" --quiet --continue --tries="${wget_tries}" --read-timeout="${wget_timeout}" ${1} } +export -f download_url #export it to be accessible to the parallel call unpack() # parameter: ${1} file, ${2} output folder[, ${3} files to unpack] { @@ -65,9 +74,9 @@ parse_new_taxdump() # parameter: ${1} taxids - return all taxids on of provided taxids=${1} echolog "Downloading taxdump and generating lineage" "1" tmp_new_taxdump="${target_output_prefix}new_taxdump.tar.gz" - tmp_taxidlineage="${working_dir}taxidlineage.dmp" - get_new_taxdump "${tmp_new_taxdump}" + download_url "ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz" "${tmp_new_taxdump}" unpack "${tmp_new_taxdump}" "${working_dir}" "taxidlineage.dmp" + tmp_taxidlineage="${working_dir}taxidlineage.dmp" tmp_lineage=${working_dir}lineage.tmp for tx in ${taxids//,/ }; do txids_lin=$(grep "[^0-9]${tx}[^0-9]" "${tmp_taxidlineage}" | cut -f 1) #get only taxids in the lineage section @@ -89,7 +98,7 @@ get_assembly_summary() # parameter: ${1} assembly_summary file, ${2} database, $ for d in ${2//,/ } do if [[ ! -z "${taxids}" || ! -z "${species}" ]]; then # Get complete assembly_summary for database - wget --tries="${wget_tries}" --read-timeout="${wget_timeout}" -qO- ftp://ftp.ncbi.nlm.nih.gov/genomes/${d}/assembly_summary_${d}.txt | tail -n+3 >> "${1}" + download_url "ftp://ftp.ncbi.nlm.nih.gov/genomes/${d}/assembly_summary_${d}.txt" | tail -n+3 >> "${1}" else for og in ${3//,/ } do @@ -98,7 +107,7 @@ get_assembly_summary() # parameter: ${1} assembly_summary file, ${2} database, $ then og="vertebrate_mammalian/Homo_sapiens" fi - wget --tries="${wget_tries}" --read-timeout="${wget_timeout}" -qO- ftp://ftp.ncbi.nlm.nih.gov/genomes/${d}/${og}/assembly_summary.txt | tail -n+3 >> "${1}" + download_url "ftp://ftp.ncbi.nlm.nih.gov/genomes/${d}/${og}/assembly_summary.txt" | tail -n+3 >> "${1}" done fi done @@ -127,6 +136,20 @@ filter_assembly_summary() # parameter: ${1} assembly_summary file - return numbe count_lines_file "${1}" } +filter_gtdb() # parameter: ${1} assembly_summary file - return number of lines +{ + gtdb_acc=${working_dir}"gtdb_acc" + for url in "${gtdb_urls[@]}" + do + # awk to remove prefix RS_ or GB_ + download_url "${url}" | zcat | awk -F "\t" '{print substr($1, 4, length($1))}' >> "${gtdb_acc}" + done + join -1 1 -2 1 <(sort -k 1,1 "${1}") <(sort -k 1,1 "${gtdb_acc}") -t$'\t' -o "1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,1.10,1.11,1.12,1.13,1.14,1.15,1.16,1.17,1.18,1.19,1.20,1.21,1.22" | sort | uniq > "${1}_gtdb" + mv "${1}_gtdb" "${1}" + rm "${gtdb_acc}" + count_lines_file "${1}" +} + top_assembly_summary() # parameter: ${1} assembly_summary file, ${2} top_assemblies - return number of lines { if [[ " ${2} " =~ "taxids:" ]]; then @@ -190,11 +213,12 @@ check_file_folder() # parameter: ${1} url, ${2} log (0->before download/1->after rm -vf "${target_output_prefix}${files_dir}${file_name}" >> "${log_file}" 2>&1 return 1 else - if [ "${2}" -eq 0 ]; then - echolog "${file_name} file found on the output folder [${target_output_prefix}${files_dir}${file_name}]" "0" - else - echolog "${file_name} downloaded successfully [${1} -> ${target_output_prefix}${files_dir}${file_name}]" "0" - fi + # Disabled log in case of success, hard to detect failures + #if [ "${2}" -eq 0 ]; then + # echolog "${file_name} file found on the output folder [${target_output_prefix}${files_dir}${file_name}]" "0" + #else + # echolog "${file_name} downloaded successfully [${1} -> ${target_output_prefix}${files_dir}${file_name}]" "0" + #fi return 0 fi } @@ -205,7 +229,7 @@ check_md5_ftp() # parameter: ${1} url - returns 0 (ok) / 1 (error) if [ "${check_md5}" -eq 1 ]; then # Only if md5 checking is activated md5checksums_url="$(dirname ${1})/md5checksums.txt" # ftp directory file_name=$(basename ${1}) # downloaded file name - md5checksums_file=$(wget -qO- --tries="${wget_tries}" --read-timeout="${wget_timeout}" "${md5checksums_url}") + md5checksums_file=$(download_url "${md5checksums_url}") if [ -z "${md5checksums_file}" ]; then echolog "${file_name} MD5checksum file download failed [${md5checksums_url}] - FILE KEPT" "0" return 0 @@ -245,7 +269,7 @@ download() # parameter: ${1} url, ${2} job number, ${3} total files, ${4} url_su dl=1 fi if [ "${dl}" -eq 1 ]; then # If file is not yet on folder, download it - wget ${1} --quiet --continue --tries="${wget_tries}" --read-timeout="${wget_timeout}" -P "${target_output_prefix}${files_dir}" + download_url "${1}" "${target_output_prefix}${files_dir}" if ! check_file_folder ${1} "1"; then # Check if file was downloaded ex=1 elif ! check_md5_ftp ${1}; then # Check file md5 @@ -386,6 +410,7 @@ refseq_category="all" assembly_level="all" file_formats="assembly_report.txt" top_assemblies="" +gtdb_only=0 download_taxonomy=0 delete_extra_files=0 check_md5=0 @@ -427,6 +452,7 @@ function showhelp { echo $' -l Assembly level [all, Complete Genome, Chromosome, Scaffold, Contig]\n\tDefault: all' echo $' -f File formats [genomic.fna.gz,assembly_report.txt, ...]\n\tcheck ftp://ftp.ncbi.nlm.nih.gov/genomes/all/README.txt for all file formats\n\tDefault: assembly_report.txt' echo $' -j Number of top references for each species/taxids to download ["", species:INT, taxids:INT]. Example: "species:3". Selection is based on 1) RefSeq Category, 2) Assembly level, 3) Relation to type material and 4) Date (most recent first)\n\tDefault: ""' + echo $' -z Keep only assemblies present on the latest GTDB release' echo echo $' -k Dry-run, no data is downloaded or updated - just checks for available sequences and changes' echo $' -i Fix failed downloads or any incomplete data from a previous run, keep current version' @@ -464,7 +490,7 @@ done if [ "${tool_not_found}" -eq 1 ]; then exit 1; fi OPTIND=1 # Reset getopts -while getopts "d:g:c:l:o:e:b:t:f:j:n:akixmurpswhD" opt; do +while getopts "d:g:c:l:o:e:b:t:f:j:zn:akixmurpswhD" opt; do case ${opt} in d) database=${OPTARG} ;; g) organism_group=${OPTARG// } ;; #remove spaces @@ -476,6 +502,7 @@ while getopts "d:g:c:l:o:e:b:t:f:j:n:akixmurpswhD" opt; do t) threads=${OPTARG} ;; f) file_formats=${OPTARG// } ;; #remove spaces j) top_assemblies=${OPTARG} ;; + z) gtdb_only=1 ;; a) download_taxonomy=1 ;; k) just_check=1 ;; i) just_fix=1 ;; @@ -635,6 +662,7 @@ echolog "RefSeq category: ${refseq_category}" "0" echolog "Assembly level: ${assembly_level}" "0" echolog "File formats: ${file_formats}" "0" echolog "Top assemblies: ${top_assemblies}" "0" +echolog "GTDB Only: ${gtdb_only}" "0" echolog "Download taxonomy: ${download_taxonomy}" "0" echolog "Just check for updates: ${just_check}" "0" echolog "Just fix/recover current version: ${just_fix}" "0" @@ -673,6 +701,13 @@ if [[ "${MODE}" == "NEW" ]]; then echolog " - ${all_lines} entries available" "1" filtered_lines=$(filter_assembly_summary "${new_assembly_summary}") echolog " - $((all_lines-filtered_lines)) entries removed with filters: RefSeq category=${refseq_category}, Assembly level=${assembly_level}, Version status=latest, valid URLs" "1" + #GTDB + if [ "${gtdb_only}" -eq 1 ]; then + gtdb_lines=$(filter_gtdb "${new_assembly_summary}") + echolog " - $((filtered_lines-gtdb_lines)) entries removed not in GTDB" "1" + filtered_lines=${gtdb_lines} + fi + #TOP ASSEMBLIES if [[ ! -z "${top_assemblies}" ]]; then top_lines=$(top_assembly_summary "${new_assembly_summary}" "${top_assemblies}") echolog " - $((filtered_lines-top_lines)) entries removed with top filter ${top_assemblies}" "1" @@ -773,6 +808,13 @@ else # update/fix echolog " - ${all_lines} entries available" "1" filtered_lines=$(filter_assembly_summary "${new_assembly_summary}") echolog " - $((all_lines-filtered_lines)) entries removed with filters: RefSeq category=${refseq_category}, Assembly level=${assembly_level}, Version status=latest, valid URLs" "1" + #GTDB + if [ "${gtdb_only}" -eq 1 ]; then + gtdb_lines=$(filter_gtdb "${new_assembly_summary}") + echolog " - $((filtered_lines-gtdb_lines)) entries removed not in GTDB" "1" + filtered_lines=${gtdb_lines} + fi + #TOP ASSEMBLIES if [[ ! -z "${top_assemblies}" ]]; then top_lines=$(top_assembly_summary "${new_assembly_summary}" "${top_assemblies}") echolog " - $((filtered_lines-top_lines)) entries removed with top filter ${top_assemblies}" "1" @@ -866,7 +908,7 @@ fi if [ "${just_check}" -eq 0 ]; then if [ "${download_taxonomy}" -eq 1 ]; then echolog "Downloading current Taxonomy database [${target_output_prefix}taxdump.tar.gz] " "1" - get_taxdump "${target_output_prefix}taxdump.tar.gz" + download_url "ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz" "${target_output_prefix}taxdump.tar.gz" echolog " - Done" "1" echolog "" "1" fi diff --git a/tests/assembly_summary_gtdb.txt b/tests/assembly_summary_gtdb.txt new file mode 100644 index 0000000..4601cc3 --- /dev/null +++ b/tests/assembly_summary_gtdb.txt @@ -0,0 +1,2 @@ +GCA_000145985.1 PRJNA33361 SAMN00016987 na 583356 334771 Ignisphaera aggregans DSM 17230 strain=DSM 17230 latest Complete Genome Major Full 2010/08/24 ASM14598v1 US DOE Joint Genome Institute (JGI-PGF) GCF_000145985.1 identical ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/145/985/GCA_000145985.1_ASM14598v1 missing rRNA genes assembly from type material +GCA_XXXXXXXXX.X PRJNA202 SAMN02744041 na 414004 46770 Cenarchaeum symbiosum A latest Chromosome Major Full 2006/11/20 ASM20071v1 DOE Joint Genome Institute GCF_000200715.1 identical ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/200/715/GCA_000200715.1_ASM20071v1 derived from environmental source diff --git a/tests/tests.sh b/tests/tests.sh index 6ba8d6b..23e19f6 100755 --- a/tests/tests.sh +++ b/tests/tests.sh @@ -90,8 +90,18 @@ out_2="tests/tst_taxids2493627" # check if both runs have the same files diff <(find "${out_all}"/v1/files/ -xtype f -printf "%f\n" | sort) <(find "${out_1}"/v1/files/ "${out_2}"/v1/files/ -xtype f -printf "%f\n" | sort) -echo "" -echo "" + +####################### GTDB filter test +out_gtdb="tests/tst_gtdb_filter" +./genome_updater.sh -o "${out_gtdb}" -e "tests/assembly_summary_gtdb.txt" -m -b v1 -t ${threads} -p -z +# should download only one file +test $(find "${out_gtdb}/v1/files/" -xtype f | wc -l) -eq 1 +# test if nothing failed (it should not even be attempted to download, removed in filter step) +test $(cat "${out_gtdb}/v1/"*_url_failed.txt | wc -l) -eq 0 +# and one successful +test $(cat "${out_gtdb}/v1/"*_url_downloaded.txt | wc -l) -eq 1 + + echo "" echo "*******************************" echo "All tests finished successfully"