Dataset size, nb of files, etc. should be optional #558

dbujold · 2022-09-16T15:24:03Z

My datasets constantly changes in size, number of files, and number of participants. The content is also a mix of many file types. I think these types of information should be left optional in the interface.

emmetaobrien · 2022-09-29T18:45:07Z

Longer term, our intent is to deal with the issue of changing datasets by more clearly defined version management, in which any change in any of those factors would be represented as a distinct different version of the dataset.

dbujold · 2022-11-01T20:16:52Z

I understand. So this implies that expanding dataset with frequent (daily/weekly) releases, the DATS document will need to get updated and versioned accordingly?

emmetaobrien · 2022-11-02T15:58:03Z

That would be the expectation with the current model, yes.

dbujold · 2022-11-02T16:05:42Z

I think it would be nice to have a way to support projects with rolling releases as well. Such projects sometimes want to describe their cohort and datasets content in a standardized way, without entering into the specifics of how many files, what size they are, etc.

emmetaobrien · 2022-11-02T19:07:07Z

Exactly how much data are you envisioning storing on CONP, and of what sort? Our processing involves building fixed links to every distinct file, so that needs redoing for anything that changes from release to release.

dbujold · 2022-11-04T13:36:30Z

Right now we have two cohorts of >5000 participants, with thousands of whole genomes, whole exomes, etc. But data is under controlled access, which means files wouldn't be indexed by CONP. It's the dataset provenance that we're aiming to describe, rather than its content.

bryancaron · 2022-11-07T18:36:15Z

Hi David, I was discussing briefly with Emmet this morning. Are the datasets you have in mind those from the BQC19 which we have discussed in the context of distribution through NeuroHub, or different datasets? Thanks!

dbujold · 2022-11-07T20:42:55Z

Hi Bryan, this one and others. We have a few cohorts supported in Bento currently, often in a rolling release kind of way. We prepare a DATS file to annotate the datasets, but we're not always able to provide precise details about that dataset content.

dbujold added the enhancement New feature or request label Sep 16, 2022

emmetaobrien self-assigned this Sep 29, 2022

emmetaobrien assigned samirdas and jbpoline Sep 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset size, nb of files, etc. should be optional #558

Dataset size, nb of files, etc. should be optional #558

dbujold commented Sep 16, 2022

emmetaobrien commented Sep 29, 2022

dbujold commented Nov 1, 2022

emmetaobrien commented Nov 2, 2022

dbujold commented Nov 2, 2022

emmetaobrien commented Nov 2, 2022 •

edited

Loading

dbujold commented Nov 4, 2022

bryancaron commented Nov 7, 2022

dbujold commented Nov 7, 2022

Dataset size, nb of files, etc. should be optional #558

Dataset size, nb of files, etc. should be optional #558

Comments

dbujold commented Sep 16, 2022

emmetaobrien commented Sep 29, 2022

dbujold commented Nov 1, 2022

emmetaobrien commented Nov 2, 2022

dbujold commented Nov 2, 2022

emmetaobrien commented Nov 2, 2022 • edited Loading

dbujold commented Nov 4, 2022

bryancaron commented Nov 7, 2022

dbujold commented Nov 7, 2022

emmetaobrien commented Nov 2, 2022 •

edited

Loading