Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Older database support? #63

Open
wwood opened this issue Apr 8, 2024 · 5 comments
Open

Older database support? #63

wwood opened this issue Apr 8, 2024 · 5 comments

Comments

@wwood
Copy link

wwood commented Apr 8, 2024

Hi,

I was keen to try out Metabuli, and was using an R207 pre-built database but now it seems that this has been superceded by the R214 one. For the sake of reproducibility, are old databases kept around?

One small suggestion also - it would be helpful if the GTDB db had the version coded directly into its filename, rather than simply gtdb.tar.gz. I realise that this info is in the README and once the tgz is unpacked, but would still be helpful so that the download link changes.

TIA, ben

@jaebeom-kim
Copy link
Member

Dear Ben,

Thank you for trying Metabuli :)

Let me discuss about keeping older databases in the cloud server and write an answer about it later.

Regarding your suggestion, l think making the version information more visible is the point.
Renaming the DB link can be a solution, but it requires changing the source code and making new release/bioconda whenever the DB is updated.
And users also should update their Metabuli to download DBs from the new links.
So, I'm thinking about a different way to improve the version visibility.

Sorry for not giving direct answers.
I will post progresses about solving your concerns.

Thank you,
Jaebeom

@jaebeom-kim
Copy link
Member

By the way, I'd like to recommend using R214 one.
In the R207 DB, K-mers from lowercased letters in a human genome were not extracted correctly.
So, it harmed Metabuli's performance for decontaminating human reads.

@wwood
Copy link
Author

wwood commented Apr 10, 2024

Thakns for the quick reply.

Renaming the DB link can be a solution, but it requires changing the source code and making new release/bioconda whenever the DB is updated.
And users also should update their Metabuli to download DBs from the new links.
So, I'm thinking about a different way to improve the version visibility.

There are other solutions. For instance metaphlan keeps an "mpa_latest" file which points to the versioned database files, and SingleM uses a library we made https://github.com/centre-for-microbiome-research/zenodo_backpack which operates through zenodo, where there is a concept of a DOI for a series and a DOI for a specific database version.

It would be great if each tool developer didn't have to spend time solving these problems and there was a broadly adopted, standardised solution.

By the way, I'd like to recommend using R214 one.
In the R207 DB, K-mers from lowercased letters in a human genome were not extracted correctly.
So, it harmed Metabuli's performance for decontaminating human reads.

Thanks - is there any other reason not to use the R207 one? My samples do not contain human reads.

@jaebeom-kim
Copy link
Member

Thank you for providing great examples for the task!
If your sample doesn't contain human reads, R207 is totally fine.
R207 just has a smaller number of genomes.

@wwood
Copy link
Author

wwood commented Jun 5, 2024

Thanks @jaebeom-kim .

Congratulations on the metabuli paper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants