Replies: 8 comments
-
update from @Phlya - having one file with all artifacts wouldn't play well with snakemake, since it expects that the results of each step are stored in a separate file. Presumably, a similar issue can arise with nextflow, since it is also designed around the idea that each step creates its own files. |
Beta Was this translation helpful? Give feedback.
-
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#flag-files Touching files is possible to enforce pipeline steps that do not produce new files - works just fine... Problem is parallelization - not sure pipelines would allow simultaneous file access , so only sample to sample parallelization would be straightforward ,also multiresolution nature of cooler feeds into the same issue - no easy way to parallelize analyses on several resolutions (without extracting individual coolers at least) We should check if pipelines have special means of dealing with hdf - i.e. allowing parallel read for example So, I don't think that single container is extremely problematic - just need to keep some things in separate files, separate samples of course , resolutions ? |
Beta Was this translation helpful? Give feedback.
-
I think parallel reads from hdf5 are no problem, just writing is an issue. So reading from coolers with many jobs should be fine (I certainly do that without any issues). But I guess we can't use an hdf5 file for the output storage due to this problem. Maybe snakemake's grouping feature can help, but that only works with rules, not samples, and also I don't think it guarantees the jobs are not running at the same time, if the memory/number of cores requested are <1/2 of those available on the node. After all, I think the only option is using flag files there, but that would give flexibility to use any convenient container that supports parallel writes. Which are?.. Is it OK to write in parallel to a zip file? If so, maybe .npz? |
Beta Was this translation helpful? Give feedback.
-
@Phlya , I'm afraid, parallel writes to a zip are even less possible than for hdf5. Zip is a very simple format... |
Beta Was this translation helpful? Give feedback.
-
Sad. I guess folder+files is the most robust and simple way, but I am not a fan of that... I don't like reading thousands of files at a time (which is what I do with pileups a lot, and it takes ages). |
Beta Was this translation helpful? Give feedback.
-
yes, i kinda hate files too. With that said, a storage/naming scheme can make working with files much easier. Why would the users of cooltools need to read 1000s of files, though? |
Beta Was this translation helpful? Give feedback.
-
I produce definitely hundreds and often thousands of pileups for each project... Maybe that's just my approach. But when doing them by distance, for multiple samples, for different regions, for different resolutions, you get to pretty high numbers, e.g. one of my
|
Beta Was this translation helpful? Give feedback.
-
That's why I was thinking about it again, while waiting to read pileups from one of the bigger subfolders in there... Which takes many minutes, at least on the cluster with a networked FS. |
Beta Was this translation helpful? Give feedback.
-
Issue: cooltools (and, potentially, pairtools) generate a variety of computational "artifacts", i.e. derivatives of primary datasets, e.g. P(s), compartments, saddleplots, insulation scores, etc. Currently, we lack a consistent way to name and store these datasets and their metadata on disk. This results in messy and inconsistent project folders, missing metadata and hours of time wasted on ad-hoc code that matches artifacts with their primary datasets. The lack of a consistent naming scheme also hinders further development of reporting scripts.
Proposal: come up with (a) a storage format and (b) a naming schema that would automate storage, discovery, and access to computational artifacts.
Potential solutions.
(A) File format. We need some kind of a container that can store computational artifacts of various kinds (tables, texts, binary arrays, etc...) and provide random access and append/rewrite functionality. Potential solutions:
My personal favorite is a zero-compression (aka STORE) zip file. It is a very well accepted format (MS Office formats are zip files!), can be accessed from all command lines, Python, R. Like a folder with files, it offers random access and append/rewrite functionality, but it also has an advantage of being easily transferable between machines (admittedly, this is not a very strong advantage).
HDF5 can serve as a key-value store as well. The downside is that it treats all datasets as arrays, doesn't work well with NFS (according to @mimakaev) and requires special CLI tools/libraries to manipulate.
Various databases/key-value-stores are another alternative, but it's not clear to me why would they serve better for
(B) Schema.
My initial proposal is that, for each primary dataset, we would create a file or folder with a name derived from the filename of the dataset. E.g., if the primary dataset is called 'WT.1000.mcool', the artifact file/folder would be called 'WT.1000.mcool.arts' or something like that. Probably, the most important point is that there should be a single, well-defined procedure that matches the artifact file/folder with its primary dataset and vice versa.
Then, inside the artifact container, each computational tool would claim its own folder, presumably, named after the tool itself. The structure of the files inside that folder would be left up to the tools creators. We could, however, suggest some default schema that would standardize metadata storage and fields.
Ideas/suggestions?..
the issue generalizes #38
Beta Was this translation helpful? Give feedback.
All reactions