Skip to content
This repository has been archived by the owner on May 21, 2024. It is now read-only.

ENH: Update NCBITaxonomyDirFmt to accomodate data-version file #73

Closed
wants to merge 4 commits into from

Conversation

Sann5
Copy link
Contributor

@Sann5 Sann5 commented Jan 17, 2024

About this repo

  • q2-types-genomics is a qiime2 plugin that defines semantic types for other plugins.
  • The current PR updates an already existing directory format by adding a new file, adding the corresponding file format for the new file, as well as validation, and tests for it.

What's new

  • Closes ENH: Update NCBITaxonomyDirFmt to accomodate data-version file #72
  • NCBITaxonomyDirFmt now contains 4 files instead of 3.
    • nodes.dmp
    • names.dmp
    • prot.accession2taxid.gz
    • new -> version.tsv
  • Why is this useful? There is a need to add information about the version of the first three files since it might impact downstream results. By adding a version.tsv file this information is effectively recorded in the provenance of downstream artifacts.

Set up an environment

# For linux: 
# export MY_OS="linux"
# For mac:
export MY_OS="osx" 
wget "https://data.qiime2.org/distro/shotgun/qiime2-shotgun-2023.9-py38-"$MY_OS"-conda.yml"
conda env create -n q2-shotgun --file qiime2-shotgun-2023.9-py38-osx-conda.yml
rm "qiime2-shotgun-2023.9-py38-"$MY_OS"-conda.yml"

Run it locally

  1. First, clone the repo and checkout the PR branch:
# Remove q2-types-genomics so you can install your local version.
conda activate q2-shotgun
conda remove q2-types-genomics q2-types
pip install git+https://github.com/qiime2/q2-types.git
git clone git@github.com:bokulich-lab/q2-types-genomics.git
cd q2-types-genomics
gh pr checkout 73
pip install -e .
  1. Let's get you some data to play with:
cd wherever_you_want_to_download_the_data_to

The next commands will download ~15 Gb of data

# Download *.dmp
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zip
unzip -j taxdmp.zip names.dmp nodes.dmp -d ncbi_tax_data
rm taxdmp.zip

# Download prot.accession2taxid.gz
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz -P ncbi_tax_data

# Make the version.tsv file
echo -e "file_name\tdate\ttime" > ncbi_tax_data/version.tsv
ls -l -D "%d/%m/%Y %H:%M:%S" ncbi_tax_data | awk '{print $8, $6, $7}' | grep -E '(nodes\.dmp|names\.dmp|prot\.accession2taxid\.gz)' | tr ' ' '\t' >> ncbi_tax_data/version.tsv
  1. Test it out!
qiime tools import --input-path ncbi_tax_data --output-path ncbi_tax_data.qza --type "ReferenceDB[NCBITaxonomy]"

Running the tests

pytest -W ignore -vv --pyargs q2_types_genomics

@Sann5 Sann5 added the enhancement New feature or request label Jan 17, 2024
@Sann5 Sann5 self-assigned this Jan 17, 2024
Copy link

codecov bot commented Jan 17, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (3b993ed) 96.77% compared to head (ac2141c) 96.91%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #73      +/-   ##
==========================================
+ Coverage   96.77%   96.91%   +0.14%     
==========================================
  Files          42       42              
  Lines        1548     1620      +72     
==========================================
+ Hits         1498     1570      +72     
  Misses         50       50              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Sann5 Sann5 requested a review from VinzentRisch January 17, 2024 09:25
@Sann5
Copy link
Contributor Author

Sann5 commented Jan 17, 2024

Hey @VinzentRisch, can you give this one a review? Cheers!

@VinzentRisch
Copy link

Hey @Sann5, everything looks good to me. The tests run and the data can be imported without any issues. 🎉
I just had some problems with getting the right env. You added some new formats to q2-types and those formats are not in the 2023.9 distribution of QIIME2 so I had to install q2-types directly from github.
And there is a typo in your import command. It should be ReferenceDB[NCBITaxonomy] and not ReferenceDB[TaxonomyNCBI].
But when i figured those two things out everything went smoothly. 😄

Copy link

@VinzentRisch VinzentRisch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@Sann5
Copy link
Contributor Author

Sann5 commented Jan 17, 2024

@VinzentRisch

I just had some problems with getting the right env. You added some new formats to q2-types and those formats are not in the 2023.9 distribution of QIIME2 so I had to install q2-types directly from GitHub.
And there is a typo in your import command. It should be ReferenceDB[NCBITaxonomy] and not ReferenceDB[TaxonomyNCBI].

Crap! Thank you for checking and thanks for the review :). Ill update the PR message accordingly.

@Sann5
Copy link
Contributor Author

Sann5 commented Jan 17, 2024

@misialq do you want to take a quick look before I SQUASH-megre it?

@misialq
Copy link
Contributor

misialq commented Jan 17, 2024

Yup, thanks, I'll check it out and ping you 🙌

@Sann5
Copy link
Contributor Author

Sann5 commented Jan 23, 2024

We have decided not to go forward with this extension of the semantic type because the files (names and nodes.dmp) are updated very frequently and the last-modified-date information is already contained in the artifact without explicitly making a new file for it.

However, this branch will be pushed upstream just in case we wish to recycle some of the code further on.

@Sann5 Sann5 closed this Jan 23, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Update NCBITaxonomyDirFmt to accomodate data-version file
3 participants