a format+schema for storing computational artifacts #293

golobor · 2020-10-09T12:21:50Z

golobor
Oct 9, 2020
Maintainer

Issue: cooltools (and, potentially, pairtools) generate a variety of computational "artifacts", i.e. derivatives of primary datasets, e.g. P(s), compartments, saddleplots, insulation scores, etc. Currently, we lack a consistent way to name and store these datasets and their metadata on disk. This results in messy and inconsistent project folders, missing metadata and hours of time wasted on ad-hoc code that matches artifacts with their primary datasets. The lack of a consistent naming scheme also hinders further development of reporting scripts.

Proposal: come up with (a) a storage format and (b) a naming schema that would automate storage, discovery, and access to computational artifacts.

Potential solutions.
(A) File format. We need some kind of a container that can store computational artifacts of various kinds (tables, texts, binary arrays, etc...) and provide random access and append/rewrite functionality. Potential solutions:

a folder
an hdf5 file
a database
a zero-compression zip file
...

My personal favorite is a zero-compression (aka STORE) zip file. It is a very well accepted format (MS Office formats are zip files!), can be accessed from all command lines, Python, R. Like a folder with files, it offers random access and append/rewrite functionality, but it also has an advantage of being easily transferable between machines (admittedly, this is not a very strong advantage).
HDF5 can serve as a key-value store as well. The downside is that it treats all datasets as arrays, doesn't work well with NFS (according to @mimakaev) and requires special CLI tools/libraries to manipulate.
Various databases/key-value-stores are another alternative, but it's not clear to me why would they serve better for

(B) Schema.
My initial proposal is that, for each primary dataset, we would create a file or folder with a name derived from the filename of the dataset. E.g., if the primary dataset is called 'WT.1000.mcool', the artifact file/folder would be called 'WT.1000.mcool.arts' or something like that. Probably, the most important point is that there should be a single, well-defined procedure that matches the artifact file/folder with its primary dataset and vice versa.

Then, inside the artifact container, each computational tool would claim its own folder, presumably, named after the tool itself. The structure of the files inside that folder would be left up to the tools creators. We could, however, suggest some default schema that would standardize metadata storage and fields.

Ideas/suggestions?..

the issue generalizes #38

golobor · 2020-10-09T12:47:05Z

golobor
Oct 9, 2020
Maintainer Author

update from @Phlya - having one file with all artifacts wouldn't play well with snakemake, since it expects that the results of each step are stored in a separate file. Presumably, a similar issue can arise with nextflow, since it is also designed around the idea that each step creates its own files.

0 replies

sergpolly · 2020-10-09T13:09:40Z

sergpolly
Oct 9, 2020
Maintainer

https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#flag-files

Touching files is possible to enforce pipeline steps that do not produce new files - works just fine...

Problem is parallelization - not sure pipelines would allow simultaneous file access , so only sample to sample parallelization would be straightforward ,also multiresolution nature of cooler feeds into the same issue - no easy way to parallelize analyses on several resolutions (without extracting individual coolers at least)

We should check if pipelines have special means of dealing with hdf - i.e. allowing parallel read for example

So, I don't think that single container is extremely problematic - just need to keep some things in separate files, separate samples of course , resolutions ?

0 replies

Phlya · 2020-11-10T14:07:25Z

Phlya
Nov 10, 2020
Maintainer

I think parallel reads from hdf5 are no problem, just writing is an issue. So reading from coolers with many jobs should be fine (I certainly do that without any issues).

But I guess we can't use an hdf5 file for the output storage due to this problem. Maybe snakemake's grouping feature can help, but that only works with rules, not samples, and also I don't think it guarantees the jobs are not running at the same time, if the memory/number of cores requested are <1/2 of those available on the node.

After all, I think the only option is using flag files there, but that would give flexibility to use any convenient container that supports parallel writes. Which are?.. Is it OK to write in parallel to a zip file? If so, maybe .npz?

0 replies

golobor · 2020-11-11T13:20:06Z

golobor
Nov 11, 2020
Maintainer Author

@Phlya , I'm afraid, parallel writes to a zip are even less possible than for hdf5. Zip is a very simple format...
Seems like we should go with folder+files?..

0 replies

Phlya · 2020-11-11T14:49:26Z

Phlya
Nov 11, 2020
Maintainer

Sad. I guess folder+files is the most robust and simple way, but I am not a fan of that... I don't like reading thousands of files at a time (which is what I do with pileups a lot, and it takes ages).

0 replies

golobor · 2020-11-11T16:14:57Z

golobor
Nov 11, 2020
Maintainer Author

yes, i kinda hate files too. With that said, a storage/naming scheme can make working with files much easier. Why would the users of cooltools need to read 1000s of files, though?

0 replies

Phlya · 2020-11-11T16:30:30Z

Phlya
Nov 11, 2020
Maintainer

I produce definitely hundreds and often thousands of pileups for each project... Maybe that's just my approach. But when doing them by distance, for multiple samples, for different regions, for different resolutions, you get to pretty high numbers, e.g. one of my quaich projects:

$ ls pileups/*/*txt | wc -l
32970

0 replies

Phlya · 2020-11-11T16:37:23Z

Phlya
Nov 11, 2020
Maintainer

That's why I was thinking about it again, while waiting to read pileups from one of the bigger subfolders in there... Which takes many minutes, at least on the cluster with a networked FS.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

a format+schema for storing computational artifacts #293

{{title}}

Replies: 8 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

a format+schema for storing computational artifacts #293

golobor Oct 9, 2020 Maintainer

Replies: 8 comments

golobor Oct 9, 2020 Maintainer Author

sergpolly Oct 9, 2020 Maintainer

Phlya Nov 10, 2020 Maintainer

golobor Nov 11, 2020 Maintainer Author

Phlya Nov 11, 2020 Maintainer

golobor Nov 11, 2020 Maintainer Author

Phlya Nov 11, 2020 Maintainer

Phlya Nov 11, 2020 Maintainer

golobor
Oct 9, 2020
Maintainer

golobor
Oct 9, 2020
Maintainer Author

sergpolly
Oct 9, 2020
Maintainer

Phlya
Nov 10, 2020
Maintainer

golobor
Nov 11, 2020
Maintainer Author

Phlya
Nov 11, 2020
Maintainer

golobor
Nov 11, 2020
Maintainer Author

Phlya
Nov 11, 2020
Maintainer

Phlya
Nov 11, 2020
Maintainer