-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset size, nb of files, etc. should be optional #558
Comments
Longer term, our intent is to deal with the issue of changing datasets by more clearly defined version management, in which any change in any of those factors would be represented as a distinct different version of the dataset. |
I understand. So this implies that expanding dataset with frequent (daily/weekly) releases, the DATS document will need to get updated and versioned accordingly? |
That would be the expectation with the current model, yes. |
I think it would be nice to have a way to support projects with rolling releases as well. Such projects sometimes want to describe their cohort and datasets content in a standardized way, without entering into the specifics of how many files, what size they are, etc. |
Exactly how much data are you envisioning storing on CONP, and of what sort? Our processing involves building fixed links to every distinct file, so that needs redoing for anything that changes from release to release. |
Right now we have two cohorts of >5000 participants, with thousands of whole genomes, whole exomes, etc. But data is under controlled access, which means files wouldn't be indexed by CONP. It's the dataset provenance that we're aiming to describe, rather than its content. |
Hi David, I was discussing briefly with Emmet this morning. Are the datasets you have in mind those from the BQC19 which we have discussed in the context of distribution through NeuroHub, or different datasets? Thanks! |
Hi Bryan, this one and others. We have a few cohorts supported in Bento currently, often in a rolling release kind of way. We prepare a DATS file to annotate the datasets, but we're not always able to provide precise details about that dataset content. |
My datasets constantly changes in size, number of files, and number of participants. The content is also a mix of many file types. I think these types of information should be left optional in the interface.
The text was updated successfully, but these errors were encountered: