diff --git a/README.md b/README.md index 25a59f9..15482e6 100755 --- a/README.md +++ b/README.md @@ -34,6 +34,7 @@ Some days later, update the repository: genome_updater downloads and keeps several snapshots of a certain sub-set of the genomes repository, without redundancy and with incremental track of changes. - it runs on a working directory (defined with `-o`) and creates a snapshot (optionally named with `-b`, timestamp by default) of refseq and/or genbank (`-d`) genome repositories based on selected organism groups (`-g`) and/or taxonomic ids (`-T`) with the desired files type(s) (`-f`) +- files are downloaded to a single folder by default ("{prefix}files/") but can be also saved in the NCBI ftp file structure (`-N`) - filters can be applied to refine the selection: refseq category (`-c`), assembly level (`-l`), dates (`-D`/`-E`), custom filters (`-F`), [top assemblies](#Top-assemblies) (`-A`) - `-M gtdb` enables GTDB [3] compability. Only assemblies from the latest GTDB release will be kept and taxonomic filters will work based on GTDB nodes (e.g. `-T "c__Hydrothermarchaeia"` or `-A genus:3`) - the repository can be updated or changed with incremental changes. outdated files are kept in their respective version and repeated files linked to the new version. genome_updater keepts track of all changes and just downloads what is necessary @@ -49,7 +50,7 @@ or direct file download: wget https://raw.githubusercontent.com/pirovc/genome_updater/master/genome_updater.sh chmod +x genome_updater.sh - - genome_updater is portable and depends on the GNU Core Utilities + few additional tools (`awk` `bc` `find` `join` `md5sum` `parallel` `sed` `tar` `xargs` `wget`/`curl`) which are commonly available and installed in most distributions. If you are not sure if you have them all, just run `genome_updater.sh` and it will tell you if something is missing (otherwise the it will show the help page). + - genome_updater is portable and depends on the GNU Core Utilities + few additional tools (`awk` `bc` `find` `join` `md5sum` `parallel` `sed` `tar` `wget`/`curl`) which are commonly available and installed in most distributions. If you are not sure if you have them all, just run `genome_updater.sh` and it will tell you if something is missing (otherwise the it will show the help page). To test if all genome_updater functions are running properly on your system: @@ -197,96 +198,99 @@ or ## Parameters - ┌─┐┌─┐┌┐┌┌─┐┌┬┐┌─┐ ┬ ┬┌─┐┌┬┐┌─┐┌┬┐┌─┐┬─┐ - │ ┬├┤ ││││ ││││├┤ │ │├─┘ ││├─┤ │ ├┤ ├┬┘ - └─┘└─┘┘└┘└─┘┴ ┴└─┘────└─┘┴ ─┴┘┴ ┴ ┴ └─┘┴└─ - v0.5.1 - - Database options: - -d Database (comma-separated entries) - [genbank, refseq] - - Organism options: - -g Organism group(s) (comma-separated entries, empty for all) - [archaea, bacteria, fungi, human, invertebrate, metagenomes, - other, plant, protozoa, vertebrate_mammalian, vertebrate_other, viral] - Default: "" - -T Taxonomic identifier(s) (comma-separated entries, empty for all). - Example: "562" (for -M ncbi) or "s__Escherichia coli" (for -M gtdb) - Default: "" - - File options: - -f file type(s) (comma-separated entries) - [genomic.fna.gz, assembly_report.txt, protein.faa.gz, genomic.gbff.gz] - More formats at https://ftp.ncbi.nlm.nih.gov/genomes/all/README.txt - Default: assembly_report.txt - - Filter options: - -c refseq category (comma-separated entries, empty for all) - [reference genome, representative genome, na] - Default: "" - -l assembly level (comma-separated entries, empty for all) - [complete genome, chromosome, scaffold, contig] - Default: "" - -D Start date (>=), based on the sequence release date. Format YYYYMMDD. - Default: "" - -E End date (<=), based on the sequence release date. Format YYYYMMDD. - Default: "" - -F custom filter for the assembly summary in the format colA:val1|colB:valX,valY (case insensitive). - Example: -F "2:PRJNA12377,PRJNA670754|14:Partial" (AND between cols, OR between values) - Column info at https://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt - Default: "" - - Taxonomy options: - -M Taxonomy. gtdb keeps only assemblies in GTDB (R207). ncbi keeps only latest assemblies (version_status). - [ncbi, gtdb] - Default: "ncbi" - -A Keep a limited number of assemblies for each selected taxa (leaf nodes). 0 for all. - Selection by ranks are also supported with rank:number (e.g genus:3) - [species, genus, family, order, class, phylum, kingdom, superkingdom] - Selection order based on: RefSeq Category, Assembly level, Relation to type material, Date. - Default: 0 - -a Keep the current version of the taxonomy database in the output folder - - Run options: - -o Output/Working directory - Default: ./tmp.XXXXXXXXXX - -t Threads to parallelize download and some file operations - Default: 1 - -k Dry-run mode. No sequence data is downloaded or updated - just checks for available sequences and changes - -i Fix only mode. Re-downloads incomplete or failed data from a previous run. Can also be used to change files (-f). - -m Check MD5 of downloaded files - - Report options: - -u Updated assembly accessions report - (Added/Removed, assembly accession, url) - -r Updated sequence accessions report - (Added/Removed, assembly accession, genbank accession, refseq accession, sequence length, taxid) - Only available when file format assembly_report.txt is selected and successfully downloaded - -p Reports URLs successfuly downloaded and failed (url_failed.txt url_downloaded.txt) - - Misc. options: - -b Version label - Default: current timestamp (YYYY-MM-DD_HH-MM-SS) - -e External "assembly_summary.txt" file to recover data from. Mutually exclusive with -d / -g - Default: "" - -B Alternative version label to use as the current version. Mutually exclusive with -i. - Can be used to rollback to an older version or to create multiple branches from a base version. - Default: "" - -R Number of attempts to retry to download files in batches - Default: 3 - -n Conditional exit status based on number of failures accepted, otherwise will Exit Code = 1. - Example: -n 10 will exit code 1 if 10 or more files failed to download - [integer for file number, float for percentage, 0 = off] - Default: 0 - -L Downloader - [wget, curl] - Default: wget - -x Allow the deletion of regular extra files (not symbolic links) found in the output folder - -s Silent output - -w Silent output with download progress only - -V Verbose log - -Z Print debug information and run in debug mode +``` +┌─┐┌─┐┌┐┌┌─┐┌┬┐┌─┐ ┬ ┬┌─┐┌┬┐┌─┐┌┬┐┌─┐┬─┐ +│ ┬├┤ ││││ ││││├┤ │ │├─┘ ││├─┤ │ ├┤ ├┬┘ +└─┘└─┘┘└┘└─┘┴ ┴└─┘────└─┘┴ ─┴┘┴ ┴ ┴ └─┘┴└─ + v0.6.0 + +Database options: + -d Database (comma-separated entries) + [genbank, refseq] + +Organism options: + -g Organism group(s) (comma-separated entries, empty for all) + [archaea, bacteria, fungi, human, invertebrate, metagenomes, + other, plant, protozoa, vertebrate_mammalian, vertebrate_other, viral] + Default: "" + -T Taxonomic identifier(s) (comma-separated entries, empty for all). + Example: "562" (for -M ncbi) or "s__Escherichia coli" (for -M gtdb) + Default: "" + +File options: + -f file type(s) (comma-separated entries) + [genomic.fna.gz, assembly_report.txt, protein.faa.gz, genomic.gbff.gz] + More formats at https://ftp.ncbi.nlm.nih.gov/genomes/all/README.txt + Default: assembly_report.txt + +Filter options: + -c refseq category (comma-separated entries, empty for all) + [reference genome, representative genome, na] + Default: "" + -l assembly level (comma-separated entries, empty for all) + [complete genome, chromosome, scaffold, contig] + Default: "" + -D Start date (>=), based on the sequence release date. Format YYYYMMDD. + Default: "" + -E End date (<=), based on the sequence release date. Format YYYYMMDD. + Default: "" + -F custom filter for the assembly summary in the format colA:val1|colB:valX,valY (case insensitive). + Example: -F "2:PRJNA12377,PRJNA670754|14:Partial" (AND between cols, OR between values) + Column info at https://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt + Default: "" + +Taxonomy options: + -M Taxonomy. gtdb keeps only assemblies in GTDB (R207). ncbi keeps only latest assemblies (version_status). + [ncbi, gtdb] + Default: "ncbi" + -A Keep a limited number of assemblies for each selected taxa (leaf nodes). 0 for all. + Selection by ranks are also supported with rank:number (e.g genus:3) + [species, genus, family, order, class, phylum, kingdom, superkingdom] + Selection order based on: RefSeq Category, Assembly level, Relation to type material, Date. + Default: 0 + -a Keep the current version of the taxonomy database in the output folder + +Run options: + -o Output/Working directory + Default: ./tmp.XXXXXXXXXX + -t Threads to parallelize download and some file operations + Default: 1 + -k Dry-run mode. No sequence data is downloaded or updated - just checks for available sequences and changes + -i Fix only mode. Re-downloads incomplete or failed data from a previous run. Can also be used to change files (-f). + -m Check MD5 of downloaded files + +Report options: + -u Updated assembly accessions report + (Added/Removed, assembly accession, url) + -r Updated sequence accessions report + (Added/Removed, assembly accession, genbank accession, refseq accession, sequence length, taxid) + Only available when file format assembly_report.txt is selected and successfully downloaded + -p Reports URLs successfuly downloaded and failed (url_failed.txt url_downloaded.txt) + +Misc. options: + -b Version label + Default: current timestamp (YYYY-MM-DD_HH-MM-SS) + -e External "assembly_summary.txt" file to recover data from. Mutually exclusive with -d / -g + Default: "" + -B Alternative version label to use as the current version. Mutually exclusive with -i. + Can be used to rollback to an older version or to create multiple branches from a base version. + Default: "" + -R Number of attempts to retry to download files in batches + Default: 3 + -n Conditional exit status based on number of failures accepted, otherwise will Exit Code = 1. + Example: -n 10 will exit code 1 if 10 or more files failed to download + [integer for file number, float for percentage, 0 = off] + Default: 0 + -N Output files in folders like NCBI ftp structure (e.g. files/GCF/000/499/605/GCF_000499605.1_EMW001_assembly_report.txt) + -L Downloader + [wget, curl] + Default: wget + -x Allow the deletion of regular extra files (not symbolic links) found in the output folder + -s Silent output + -w Silent output with download progress only + -V Verbose log + -Z Print debug information and run in debug mode +``` ## References: diff --git a/genome_updater.sh b/genome_updater.sh index 992cee1..a27ae33 100755 --- a/genome_updater.sh +++ b/genome_updater.sh @@ -25,7 +25,7 @@ IFS=$' ' # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN # THE SOFTWARE. -version="0.5.2" +version="0.6.0" # Define ncbi_base_url or use local files (for testing) local_dir=${local_dir:-} @@ -91,6 +91,42 @@ download_retry_md5(){ # parameter: ${1} url, ${2} output file, ${3} url MD5, ${4 return 1; # failed to check md5 after all attempts } +path_output() # parameter: ${1} file/url +{ + f=$(basename ${1}); + path="${files_dir}"; + if [[ "${ncbi_folders}" -eq 1 ]]; then + path="${path}${f:0:3}/${f:4:3}/${f:7:3}/${f:10:3}/"; + fi + echo "${path}"; +} +export -f path_output + +link_version() # parameter: ${1} current_output_prefix, ${2} new_output_prefix, ${3} file +{ + path_out=$(path_output ${3}) + if [[ -f "${1}${path_out}${3}" ]]; then + mkdir -p "${2}${path_out}"; + ln -s -r "${1}${path_out}${3}" "${2}${path_out}"; + fi +} +export -f link_version #export it to be accessible to the parallel call + +list_local_files() # parameter: ${1} prefix, ${2} 1 to list list all, "" list only '-not -empty' +{ + # Returns list of local files, without folder structure + if [[ "${ncbi_folders}" -eq 0 ]]; then + depth="-maxdepth 1"; + else + depth="-mindepth 4"; + fi + param="-not -empty" + if [[ ! -z "${2:-}" ]]; then + param="" + fi + find "${1}${files_dir}" ${depth} ${param} -type f,l -printf "%f\n" +} + unpack() # parameter: ${1} file, ${2} output folder[, ${3} files to unpack] { tar xf "${1}" -C "${2}" "${3}" @@ -530,18 +566,19 @@ export -f print_progress #export it to be accessible to the parallel call check_file_folder() # parameter: ${1} url, ${2} log (0->before download/1->after download) - returns 0 (ok) / 1 (error) { file_name=$(basename ${1}) + path_name="${target_output_prefix}$(path_output ${file_name})${file_name}" # Check if file exists and if it has a size greater than zero (-s) - if [ ! -s "${target_output_prefix}${files_dir}${file_name}" ]; then + if [ ! -s "${path_name}" ]; then if [ "${2}" -eq 1 ]; then echolog "${file_name} download failed [${1}]" "0"; fi # Remove file if exists (only zero-sized files) - rm -vf "${target_output_prefix}${files_dir}${file_name}" >> "${log_file}" 2>&1 + rm -vf "${path_name}" >> "${log_file}" 2>&1 return 1 else if [ "${verbose_log}" -eq 1 ]; then if [ "${2}" -eq 0 ]; then - echolog "${file_name} file found on the output folder [${target_output_prefix}${files_dir}${file_name}]" "0" + echolog "${file_name} file found on the output folder [${path_name}]" "0" else - echolog "${file_name} downloaded successfully [${1} -> ${target_output_prefix}${files_dir}${file_name}]" "0" + echolog "${file_name} downloaded successfully [${1} -> ${path_name}]" "0" fi fi return 0 @@ -564,11 +601,12 @@ check_md5_ftp() # parameter: ${1} url - returns 0 (ok) / 1 (error) echolog "${file_name} MD5checksum file not available [${md5checksums_url}] - FILE KEPT" "0" return 0 else - file_md5=$(md5sum ${target_output_prefix}${files_dir}${file_name} | cut -f1 -d' ') + path_name="${target_output_prefix}$(path_output ${file_name})${file_name}" # local file path and name + file_md5=$(md5sum ${path_name} | cut -f1 -d' ') if [ "${file_md5}" != "${ftp_md5}" ]; then echolog "${file_name} MD5 not matching [${md5checksums_url}] - FILE REMOVED" "0" # Remove file only when MD5 doesn't match - rm -v "${target_output_prefix}${files_dir}${file_name}" >> ${log_file} 2>&1 + rm -v "${path_name}" >> ${log_file} 2>&1 return 1 else if [ "${verbose_log}" -eq 1 ]; then @@ -595,7 +633,9 @@ download() # parameter: ${1} url, ${2} job number, ${3} total files, ${4} url_su dl=1 fi if [ "${dl}" -eq 1 ]; then # If file is not yet on folder, download it - download_url "${1}" "${target_output_prefix}${files_dir}" + path_out="${target_output_prefix}$(path_output ${1})" + mkdir -p "${path_out}" + download_url "${1}" "${path_out}" if ! check_file_folder ${1} "1"; then # Check if file was downloaded ex=1 elif ! check_md5_ftp ${1}; then # Check file md5 @@ -668,13 +708,13 @@ remove_files() # parameter: ${1} file, ${2} fields [assembly_accesion,url] OR fi fi deleted_files=0 while read f; do - fname="${target_output_prefix}${files_dir}${f}" + path_name="${target_output_prefix}$(path_output ${f})${f}" # Only delete if delete option is enable or if it's a symbolic link (from updates) - if [[ -L "${fname}" || "${delete_extra_files}" -eq 1 ]]; then - rm "${target_output_prefix}${files_dir}${f}" -v >> ${log_file} + if [[ -L "${path_name}" || "${delete_extra_files}" -eq 1 ]]; then + rm "${path_name}" -v >> ${log_file} deleted_files=$((deleted_files + 1)) else - echolog "kept '${fname}'" "0" + echolog "kept '${path_name}'" "0" fi done <<< "${filelist}" echo ${deleted_files} @@ -682,14 +722,13 @@ remove_files() # parameter: ${1} file, ${2} fields [assembly_accesion,url] OR fi check_missing_files() # ${1} file, ${2} fields [assembly_accesion,url], ${3} extension - returns assembly accession, url and filename { - # Just returns if file doesn't exist or if it's zero size - list_files ${1} ${2} ${3} | xargs -P "${threads}" --no-run-if-empty -n3 sh -c 'if [ ! -s "'"${target_output_prefix}${files_dir}"'${2}" ]; then echo "${0}'$'\t''${1}'$'\t''${2}"; fi' + join -1 3 -2 1 <(list_files ${1} ${2} ${3} | sort -k 3,3 -t$'\t') <(list_local_files "${target_output_prefix}" | sort) -t$'\t' -v 1 -o "1.1,1.2,1.3" } check_complete_record() # parameters: ${1} file, ${2} field [assembly accession, url], ${3} extension - returns assembly accession, url { expected_files=$(list_files ${1} ${2} ${3} | sort -k 3,3) - join -1 3 -2 1 <(echo "${expected_files}" | sort -k 3,3) <(ls -1 "${target_output_prefix}${files_dir}" | sort) -t$'\t' -o "1.1" -v 1 | sort | uniq | # Check for accessions with at least one missing file + join -1 3 -2 1 <(echo "${expected_files}" | sort -k 3,3) <(list_local_files "${target_output_prefix}" | sort) -t$'\t' -o "1.1" -v 1 | sort | uniq | # Check for accessions with at least one missing file join -1 1 -2 1 <(echo "${expected_files}" | cut -f 1,2 | sort | uniq) - -t$'\t' -v 1 # Extract just assembly accession and url for complete entries (no missing files) } @@ -701,8 +740,8 @@ output_assembly_accession() # parameters: ${1} file, ${2} field [assembly access output_sequence_accession() # parameters: ${1} file, ${2} field [assembly accession, url], ${3} extension, ${4} mode (A/R), ${5} assembly_summary (for taxid) { join <(list_files ${1} ${2} "assembly_report.txt" | sort -k 1,1) <(check_complete_record ${1} ${2} ${3} | sort -k 1,1) -t$'\t' -o "1.1,1.3" | # List assembly accession and filename for all assembly_report.txt with complete record (no missing files) - returns assembly accesion, filename - join - <(sort -k 1,1 ${5}) -t$'\t' -o "1.1,1.2,2.6" | # Get taxid {1} assembly accesion, {2} filename {3} taxid - parallel --tmpdir ${working_dir} --colsep "\t" -j ${threads} -k 'grep "^[^#]" "'"${target_output_prefix}${files_dir}"'{2}" | tr -d "\r" | cut -f 5,7,9 | sed "s/^/{1}\\t/" | sed "s/$/\\t{3}/"' | # Retrieve info from assembly_report.txt and add assemby accession in the beggining and taxid at the end + join - <(sort -k 1,1 ${5}) -t$'\t' -o "1.1,1.2,2.6" | # Get taxid {1} assembly accession, {2} filename {3} taxid + parallel --tmpdir ${working_dir} --colsep "\t" -j ${threads} -k 'grep "^[^#]" "${target_output_prefix}$(path_output {2}){2}" | tr -d "\r" | cut -f 5,7,9 | sed "s/^/{1}\\t/" | sed "s/$/\\t{3}/"' | # Retrieve info from assembly_report.txt and add assemby accession in the beggining and taxid at the end sed "s/^/${4}\t/" # Add mode A/R at the end } @@ -812,6 +851,7 @@ function showhelp { echo $' -B Alternative version label to use as the current version. Mutually exclusive with -i.\n\tCan be used to rollback to an older version or to create multiple branches from a base version.\n\tDefault: ""' echo $' -R Number of attempts to retry to download files in batches \n\tDefault: 3' echo $' -n Conditional exit status based on number of failures accepted, otherwise will Exit Code = 1.\n\tExample: -n 10 will exit code 1 if 10 or more files failed to download\n\t[integer for file number, float for percentage, 0 = off]\n\tDefault: 0' + echo $' -N Output files in folders like NCBI ftp structure (e.g. files/GCF/000/499/605/GCF_000499605.1_EMW001_assembly_report.txt)' echo $' -L Downloader\n\t[wget, curl]\n\tDefault: wget' echo $' -x Allow the deletion of regular extra files (not symbolic links) found in the output folder' echo $' -s Silent output' @@ -842,6 +882,7 @@ url_list=0 dry_run=0 just_fix=0 conditional_exit=0 +ncbi_folders=0 silent=0 silent_progress=0 debug_mode=0 @@ -856,7 +897,7 @@ downloader_tool="wget" # Check for required tools tool_not_found=0 -tools=( "awk" "bc" "find" "join" "md5sum" "parallel" "sed" "tar" "xargs" "wget" ) +tools=( "awk" "bc" "find" "join" "md5sum" "parallel" "sed" "tar" "wget" ) for t in "${tools[@]}" do if [ ! -x "$(command -v ${t})" ]; then @@ -867,7 +908,7 @@ done if [ "${tool_not_found}" -eq 1 ]; then exit 1; fi # Parse -o and -B first to detect possible updates -getopts_list="aA:b:B:c:d:D:e:E:f:F:g:hikl:L:mM:n:o:prR:st:T:uVwxZ" +getopts_list="aA:b:B:c:d:D:e:E:f:F:g:hikl:L:mM:n:No:prR:st:T:uVwxZ" OPTIND=1 # Reset getopts # Parses working_dir from "$@" while getopts "${getopts_list}" opt; do @@ -933,6 +974,7 @@ while getopts "${getopts_list}" opt "${args[@]}"; do m) check_md5=1 ;; M) tax_mode=${OPTARG} ;; n) conditional_exit=${OPTARG} ;; + N) ncbi_folders=1 ;; o) working_dir=${OPTARG} ;; p) url_list=1 ;; r) updated_sequence_accession=1 ;; @@ -1128,7 +1170,7 @@ else fi working_dir="$(readlink -m ${working_dir})/" files_dir="files/" -export files_dir working_dir +export files_dir working_dir ncbi_folders default_assembly_summary=${working_dir}assembly_summary.txt history_file=${working_dir}history.tsv @@ -1339,7 +1381,8 @@ else # update/fix echolog "Checking for extra files in the current version [${current_label}]" "1" extra=$(tmp_file "extra.tmp") - join <(ls -1 "${current_output_prefix}${files_dir}" | sort) <(list_files "${current_assembly_summary}" "1,20" "${file_formats}" | cut -f 3 | sed -e 's/.*\///' | sort) -v 1 > "${extra}" + # List local files, "1" to list also empty files + join <(list_local_files "${current_output_prefix}" "1" | sort) <(list_files "${current_assembly_summary}" "1,20" "${file_formats}" | cut -f 3 | sed -e 's/.*\///' | sort) -v 1 > "${extra}" extra_files=$(count_lines_file "${extra}") if [ "${extra_files}" -gt 0 ]; then echolog " - ${extra_files} extra files" "1" @@ -1405,7 +1448,7 @@ else # update/fix # Link versions echolog "Linking versions [${current_label} --> ${new_label}]" "1" # Only link existing files relative to the current version - list_files "${current_assembly_summary}" "1,20" "${file_formats}" | cut -f 3 | xargs -P "${threads}" -I{} bash -c 'if [[ -f '"${current_output_prefix}${files_dir}{}"' ]]; then ln -s -r '"${current_output_prefix}${files_dir}{}"' '"${new_output_prefix}${files_dir}"'; fi' + list_files "${current_assembly_summary}" "1,20" "${file_formats}" | cut -f 3 | parallel -P "${threads}" link_version "${current_output_prefix}" "${new_output_prefix}" "{}" echolog " - Done" "1" echolog "" "1" # set version - update default assembly summary @@ -1475,6 +1518,12 @@ else # update/fix fi if [ "${dry_run}" -eq 0 ]; then + + # Clean possible empty folders in NCBI structure after update + if [[ "${ncbi_folders}" -eq 1 ]]; then + find "${target_output_prefix}${files_dir}" -type d -empty -delete + fi + if [ "${download_taxonomy}" -eq 1 ]; then echolog "Downloading taxonomy database [${tax_mode}]" "1" if [[ "${tax_mode}" == "ncbi" ]]; then @@ -1496,7 +1545,8 @@ if [ "${dry_run}" -eq 0 ]; then fi expected_files=$(( $(count_lines_file "${default_assembly_summary}")*(n_formats+1) )) # From assembly summary * file formats - current_files=$(ls "${target_output_prefix}${files_dir}" | wc -l | cut -f1 -d' ') # From current folder + current_files=$(list_local_files "${target_output_prefix}" | wc -l | cut -f1 -d' ') # From current folder + # If is in fixing mode, remove kept extra files from calculation if [[ "${extra_files}" -gt 0 && "${just_fix}" -eq 1 ]]; then current_files=$(( current_files-extra_files )) diff --git a/tests/download_test_set.sh b/tests/download_test_set.sh index 9f37fef..28ecd93 100755 --- a/tests/download_test_set.sh +++ b/tests/download_test_set.sh @@ -40,8 +40,10 @@ cut -f 1,6 ${outfld}genomes/*/assembly_summary_*.txt ${outfld}genomes/*/*/assemb wget --quiet --show-progress --output-document "${outfld}new_taxdump.tar.gz" "https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz" tar xf "${outfld}new_taxdump.tar.gz" -C "${outfld}" taxidlineage.dmp rankedlineage.dmp mkdir -p "${outfld}pub/taxonomy/new_taxdump/" +cat "${outfld}accessions_taxids.txt" | xargs -l bash -c 'grep "^${1}[^0-9]" "'${outfld}'taxidlineage.dmp"' > "${outfld}pub/taxonomy/new_taxdump/taxidlineage.dmp" cat "${outfld}accessions_taxids.txt" | xargs -l bash -c 'grep "[^0-9]${1}[^0-9]" "'${outfld}'taxidlineage.dmp"' >> "${outfld}pub/taxonomy/new_taxdump/taxidlineage.dmp" -cat "${outfld}accessions_taxids.txt" | xargs -l bash -c 'grep "^${1}[^0-9]" "'${outfld}'rankedlineage.dmp"' >> "${outfld}pub/taxonomy/new_taxdump/rankedlineage.dmp" +cat "${outfld}accessions_taxids.txt" | xargs -l bash -c 'grep "^${1}[^0-9]" "'${outfld}'rankedlineage.dmp"' > "${outfld}pub/taxonomy/new_taxdump/rankedlineage.dmp" + find "${outfld}pub/taxonomy/new_taxdump/" -printf "%P\n" | tar -czf "${outfld}pub/taxonomy/new_taxdump/new_taxdump.tar.gz" --no-recursion -C "${outfld}pub/taxonomy/new_taxdump/" -T - md5sum "${outfld}pub/taxonomy/new_taxdump/new_taxdump.tar.gz" > "${outfld}pub/taxonomy/new_taxdump/new_taxdump.tar.gz.md5" rm "${outfld}new_taxdump.tar.gz" "${outfld}taxidlineage.dmp" "${outfld}rankedlineage.dmp" "${outfld}pub/taxonomy/new_taxdump/taxidlineage.dmp" "${outfld}pub/taxonomy/new_taxdump/rankedlineage.dmp" diff --git a/tests/files/pub/taxonomy/new_taxdump/new_taxdump.tar.gz b/tests/files/pub/taxonomy/new_taxdump/new_taxdump.tar.gz index 3f71f0b..8d82ae4 100644 Binary files a/tests/files/pub/taxonomy/new_taxdump/new_taxdump.tar.gz and b/tests/files/pub/taxonomy/new_taxdump/new_taxdump.tar.gz differ diff --git a/tests/files/pub/taxonomy/new_taxdump/new_taxdump.tar.gz.md5 b/tests/files/pub/taxonomy/new_taxdump/new_taxdump.tar.gz.md5 index 4fc09cb..22c17b7 100644 --- a/tests/files/pub/taxonomy/new_taxdump/new_taxdump.tar.gz.md5 +++ b/tests/files/pub/taxonomy/new_taxdump/new_taxdump.tar.gz.md5 @@ -1 +1 @@ -a9b0b848349863ab9413d44400f99336 files/pub/taxonomy/new_taxdump/new_taxdump.tar.gz +9ad4e65d12140e0045d3e01f853b8f29 files/pub/taxonomy/new_taxdump/new_taxdump.tar.gz diff --git a/tests/integration_offline.bats b/tests/integration_offline.bats index 924ecf8..96df12d 100644 --- a/tests/integration_offline.bats +++ b/tests/integration_offline.bats @@ -435,6 +435,7 @@ setup_file() { # Check if report was printed and has all lines reported assert_file_exist ${outdir}${label}/*_assembly_accession.txt + assert_file_not_empty ${outdir}${label}/*_assembly_accession.txt assert_equal $(count_lines_file ${outdir}${label}/*_assembly_accession.txt) $(count_lines_file ${outdir}assembly_summary.txt) } @@ -446,6 +447,18 @@ setup_file() { # Check if report was printed assert_file_exist ${outdir}${label}/*_sequence_accession.txt + assert_file_not_empty ${outdir}${label}/*_sequence_accession.txt +} + +@test "Report sequence accession with ncbi folder structure" { + outdir=${outprefix}report-sequence-accession-ncbi-folders/ + label="test" + run ./genome_updater.sh -N -d refseq -b ${label} -o ${outdir} -r + sanity_check ${outdir} ${label} + + # Check if report was printed + assert_file_exist ${outdir}${label}/*_sequence_accession.txt + assert_file_not_empty ${outdir}${label}/*_sequence_accession.txt } @test "Report urls" { @@ -604,7 +617,7 @@ setup_file() { sanity_check ${outdir} ${label} # Remove files to simulate failure - rm ${outdir}${label}/files/* + rm -rf ${outdir}${label}/files/* # Dry-run FIX run ./genome_updater.sh -d refseq -b ${label} -o ${outdir} -k -i @@ -639,6 +652,29 @@ setup_file() { sanity_check ${outdir} ${label} } +@test "Mode UPDATE ncbi folders" { + outdir=${outprefix}mode-update-ncbi-folders/ + label="test" + + # Dry-run NEW + run ./genome_updater.sh -N -d refseq -b ${label} -o ${outdir} -k + assert_success + assert_dir_not_exist ${outdir} + + # Real run NEW + run ./genome_updater.sh -N -d refseq -b ${label} -o ${outdir} + sanity_check ${outdir} ${label} + + # Dry-run UPDATE (use another organism group to simulate change) + label="update" + run ./genome_updater.sh -N -d refseq -g archaea,fungi -b ${label} -o ${outdir} -k + assert_success + + # Real run FIX + run ./genome_updater.sh -N -d refseq -g archaea,fungi -b ${label} -o ${outdir} + sanity_check ${outdir} ${label} +} + @test "Mode auto UPDATE" { outdir=${outprefix}mode-auto-update/ label="test" @@ -657,7 +693,7 @@ setup_file() { run ./genome_updater.sh -o ${outdir} -b ${label} -k assert_success - # Real run (nothin to update, but carry parameters) + # Real run (nothing to update, but carry parameters) run ./genome_updater.sh -o ${outdir} -b ${label} sanity_check ${outdir} ${label} @@ -677,6 +713,44 @@ setup_file() { assert_success } +@test "Mode auto UPDATE ncbi folders" { + outdir=${outprefix}mode-auto-update-ncbi-folders/ + label="test" + + # Dry-run NEW + run ./genome_updater.sh -N -d refseq -b ${label} -o ${outdir} -g archaea -k + assert_success + assert_dir_not_exist ${outdir} + + # Real run NEW + run ./genome_updater.sh -N -d refseq -b ${label} -o ${outdir} -g archaea + sanity_check ${outdir} ${label} + + # Dry-run UPDATE (use same parameters) + label="update" + run ./genome_updater.sh -N -o ${outdir} -b ${label} -k + assert_success + + # Real run (nothing to update, but carry parameters) + run ./genome_updater.sh -N -o ${outdir} -b ${label} + sanity_check ${outdir} ${label} + + # Dry-run UPDATE + label="update2" + run ./genome_updater.sh -N -o ${outdir} -b ${label} -g "" -d refseq,genbank -u -k + assert_success + + # Real run FIX, remove org (get all), add database, add bool report + run ./genome_updater.sh -N -o ${outdir} -b ${label} -g "" -d refseq,genbank -u + sanity_check ${outdir} ${label} + + assert_file_exist ${outdir}${label}/*_assembly_accession.txt + + # Check log for updates + grep "0 updated, [1-9][0-9]* removed, [1-9][0-9]* new entries" ${outdir}${label}/*.log # >&3 + assert_success +} + @test "Tax. Mode GTDB" { outdir=${outprefix}tax-gtdb/ label="test" @@ -703,3 +777,44 @@ setup_file() { run ./genome_updater.sh -o ${outdir} -b ${label} -e ${files_dir}simulated/assembly_summary_invalid_xCF.txt assert_failure } + +@test "NCBI folders" { + outdir=${outprefix}ncbi-folders/ + label="1-refseq" + run ./genome_updater.sh -N -d refseq -g archaea -b ${label} -o ${outdir} + sanity_check ${outdir} ${label} + + # refseq base folder is created, no genbank + assert_dir_exist "${outdir}${label}/files/GCF/" + assert_dir_not_exist "${outdir}${label}/files/GCA/" + + # Add genbank + label="2-refseq-genbank" + run ./genome_updater.sh -N -d refseq,genbank -g archaea -b ${label} -o ${outdir} + sanity_check ${outdir} ${label} + + # refseq and genbank base folders are created + assert_dir_exist "${outdir}${label}/files/GCF/" + assert_dir_exist "${outdir}${label}/files/GCA/" + + # Remove refseq + label="3-genbank" + run ./genome_updater.sh -N -d genbank -g archaea -b ${label} -o ${outdir} + sanity_check ${outdir} ${label} + + assert_dir_not_exist "${outdir}${label}/files/GCF/" + assert_dir_exist "${outdir}${label}/files/GCA/" + + # no empty folders + assert_equal $(find "${outdir}${label}/files/" -type d -empty | wc -l | cut -f1 -d' ') 0 + + # Update without -N, do not consider folder structute and download again to base files folde + # Remove refseq + label="4-no-ncbi-folders" + run ./genome_updater.sh -d genbank -g archaea -b ${label} -o ${outdir} + sanity_check ${outdir} ${label} + + # refseq and genbank are no longer + assert_dir_not_exist "${outdir}${label}/files/GCF/" + assert_dir_not_exist "${outdir}${label}/files/GCA/" +} \ No newline at end of file diff --git a/tests/integration_online.bats b/tests/integration_online.bats index bc7bce6..b03d01a 100644 --- a/tests/integration_online.bats +++ b/tests/integration_online.bats @@ -45,12 +45,12 @@ setup_file() { # Protozoa in refseq is the smallest available assembly_summary at the time of writing this test (01.2022) # 5820 genus Plasmodium label_genus="genus" - run ./genome_updater.sh -d refseq -g protozoa -T 5820 -b ${label_genus} -t ${threads} -o ${outdir} + run ./genome_updater.sh -N -d refseq -g protozoa -T 5820 -b ${label_genus} -t ${threads} -o ${outdir} sanity_check ${outdir} ${label_genus} # 5794 phylum Apicomplexa label_phylum="phylum" - run ./genome_updater.sh -d refseq -g protozoa -T 5794 -b ${label_phylum} -t ${threads} -o ${outdir} + run ./genome_updater.sh -N -d refseq -g protozoa -T 5794 -b ${label_phylum} -t ${threads} -o ${outdir} sanity_check ${outdir} ${label_phylum} # More files filtering by phylum than genus diff --git a/tests/libs/bats b/tests/libs/bats index 210acf3..6636e2c 160000 --- a/tests/libs/bats +++ b/tests/libs/bats @@ -1 +1 @@ -Subproject commit 210acf3a8ed318ddedad3137c15451739beba7d4 +Subproject commit 6636e2c2ef5ffe361535cb45fc61682c5ef46b71 diff --git a/tests/libs/bats-assert b/tests/libs/bats-assert index e0de84e..78fa631 160000 --- a/tests/libs/bats-assert +++ b/tests/libs/bats-assert @@ -1 +1 @@ -Subproject commit e0de84e9c011223e7f88b7ccf1c929f4327097ba +Subproject commit 78fa631d1370562d2cd4a1390989e706158e7bf0 diff --git a/tests/libs/bats-file b/tests/libs/bats-file index 17fa557..805ffb7 160000 --- a/tests/libs/bats-file +++ b/tests/libs/bats-file @@ -1 +1 @@ -Subproject commit 17fa557f6fe28a327933e3fa32efef1d211caa5a +Subproject commit 805ffb74fe085fd7be4fe43d787387b20d4bd7b6 diff --git a/tests/utils.bash b/tests/utils.bash index f27f3d2..a388d36 100644 --- a/tests/utils.bash +++ b/tests/utils.bash @@ -9,11 +9,11 @@ count_lines_file(){ # $1 file } count_files() { # $1 outdir, $2 label - ls_files ${1} ${2} | wc -l | cut -f1 -d' ' + find_files ${1} ${2} | wc -l | cut -f1 -d' ' } -ls_files() { # $1 outdir, $2 label - ls -1 ${1}${2}/files/* +find_files() { # $1 outdir, $2 label + find ${1}${2}/files/ -type f,l } sanity_check() { # $1 outdir, $2 label, [$3 number of file types] @@ -37,7 +37,7 @@ sanity_check() { # $1 outdir, $2 label, [$3 number of file types] # Check file count based on assembly_summary assert_equal $(count_files ${1} ${2}) $(($(count_lines_file ${1}assembly_summary.txt) * ${nfiles})) # Check files in folder (if any) - for file in $(ls_files ${1} ${2}); do + for file in $(find_files ${1} ${2}); do assert_file_not_empty $file done