Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset size, nb of files, etc. should be optional #558

Open
dbujold opened this issue Sep 16, 2022 · 8 comments
Open

Dataset size, nb of files, etc. should be optional #558

dbujold opened this issue Sep 16, 2022 · 8 comments
Assignees
Labels
enhancement New feature or request

Comments

@dbujold
Copy link

dbujold commented Sep 16, 2022

My datasets constantly changes in size, number of files, and number of participants. The content is also a mix of many file types. I think these types of information should be left optional in the interface.

@dbujold dbujold added the enhancement New feature or request label Sep 16, 2022
@emmetaobrien emmetaobrien self-assigned this Sep 29, 2022
@emmetaobrien
Copy link
Collaborator

Longer term, our intent is to deal with the issue of changing datasets by more clearly defined version management, in which any change in any of those factors would be represented as a distinct different version of the dataset.

@dbujold
Copy link
Author

dbujold commented Nov 1, 2022

I understand. So this implies that expanding dataset with frequent (daily/weekly) releases, the DATS document will need to get updated and versioned accordingly?

@emmetaobrien
Copy link
Collaborator

That would be the expectation with the current model, yes.

@dbujold
Copy link
Author

dbujold commented Nov 2, 2022

I think it would be nice to have a way to support projects with rolling releases as well. Such projects sometimes want to describe their cohort and datasets content in a standardized way, without entering into the specifics of how many files, what size they are, etc.

@emmetaobrien
Copy link
Collaborator

emmetaobrien commented Nov 2, 2022

Exactly how much data are you envisioning storing on CONP, and of what sort? Our processing involves building fixed links to every distinct file, so that needs redoing for anything that changes from release to release.

@dbujold
Copy link
Author

dbujold commented Nov 4, 2022

Right now we have two cohorts of >5000 participants, with thousands of whole genomes, whole exomes, etc. But data is under controlled access, which means files wouldn't be indexed by CONP. It's the dataset provenance that we're aiming to describe, rather than its content.

@bryancaron
Copy link

Hi David, I was discussing briefly with Emmet this morning. Are the datasets you have in mind those from the BQC19 which we have discussed in the context of distribution through NeuroHub, or different datasets? Thanks!

@dbujold
Copy link
Author

dbujold commented Nov 7, 2022

Hi Bryan, this one and others. We have a few cohorts supported in Bento currently, often in a rolling release kind of way. We prepare a DATS file to annotate the datasets, but we're not always able to provide precise details about that dataset content.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants