allowing multiple pack storage locations #123

zhubonan · 2021-11-06T18:22:32Z

One problem I face with my current AiiDA-based workflow is the growing size of the repository verses the finite size of the fast SSD storage. This can happen quite quickly if I had to run a few "large" caclulations for which a lot of data is needed during post-processing and provenance critical . In theory, most of the files stored by AiiDA are not frequently accessed and they are perfectly fine to sit on a slow storage position, e.g. spinning disk or NFS mounts. On the other hand, having the whole repository on a slow storage location can slow down the daemon and workflows.

I think this package can give a natural solution to this problem. Here, the loose "objects" can be written onto a fast-to-write disk. The read-only access of the "fully" packed packs no longer benefit from fast disk speed, so they can be moved into a slow storage if needed, e.g:

loose files -> objectore folder on fast SSD
not fully packed pack file -> objectore folder on fast SSD
full pack file with only read access -> addition folders on slow storage location

At the moment, all of the (integer numbers) packs are stored under the packs folder, would it be possible to allow multiple storage positions to be used (for fully "packed" ones)? I think it should just be a matter of iterating over the storage locations and check if the file exists, or a dictionary of pack id and their locations can built when the Container class is instantiated to reduce the overhead.

Please let me know what do you think about thsi idea. Thanks!

The text was updated successfully, but these errors were encountered:

zhubonan · 2021-11-10T12:06:48Z

Proof of concept PR #126

Pinning @giovannipizzi @chrisjsewell

giovannipizzi · 2021-12-08T16:56:57Z

After discussion with @zhubonan and @chrisjsewell the following design could be envisaged:

add a new subfolder inside the container, called archived-packs
add a table in the SQLite DB, say ArchivedPacks, that has just two columns, the pack_id and the location (there should be a unique constraint on the pack_id column): existence of a pack id in this column means the pack should not be looked for into the packs subfolder, but in the location subfolder, that by default is archived-packs
we should provide api (and possibly also dostore cmdline commands) to move a pack to the archived directory (possibly with a custom name, and checking that this does not overlap with known names like sandbox and loose (or, each folder should be inside archived-packs/<LOCATION> where <LOCATION> is the value of the location column. This would take care of moving the pack in a way that is aware that the destination might be in a different file-system: e.g. first check that the archive is sealed (see issue suggestion: consider using record markers in the packs #124, we should define the concept of a "sealed" pack and only move that, and disallow to add to that pack afterwards); then copy it over; then (after checking the MD5 to ensure the pack was successfully copied?) add the entry in the ArchivedPacks table; then (maybe as a maintenance operation) remove the pack from packs and only keep the archived version.
there should be some command line to know where a pack is, and/or which archived locations exist
in the reading part, when an object is in a pack, if the pack is also in the ArchivedPacks table, then it is loaded from there and not from the packs/ folder.
- one note: also the function to get the pack to write to should avoid to recreate a pack named 0 if the pack exists an is archived.

As a power user, I can then create folders inside archived-packs and mount them from some remote location.
In this way, archiving will allow to move big data to other locations.

In addition, there should be a function to check that all packs are actually there (e.g. to avoid that one of the archived folders is not mounted - and ideally also add the checksum for further validation?). The simple check of file existence should hopefully be fast, and should be done every time you create a new container instance, otherwise an exception is thrown?

Finally, it should be easy for the user to archive the packs. E.g. one could have a command dostore archive-packs --keep-last=2 [--location=nfs], where --location might be optional and we might have a default location like archive; the command will take all unsealed archives, keep the last 2 in the packs/ folder, and "move" all the rest to the archived-packs folder as described above.

zhubonan · 2021-12-08T22:20:59Z

@giovannipizzi Thanks for the summary!

One potential issue I can think of with this is that if the user have multiple profiles and hence multiple repositories, one can potentially make mistakes when mouting the correct folder inside archive-packs to the right disk-objectstore container. If such misktake is made, my impression is that the current implementation would return an incorrect stream?

At the moment the packs are stored as numbered files, eg. 1, 2, ,3, would it make sense to add some kind of identifier to the pack file names, such as 1_<uuid-of-container> to avoid potential errrors?

giovannipizzi · 2021-12-09T09:30:59Z

good point, thanks! Either that, or have a JSON in the folder that gives information. But I agree

giovannipizzi · 2023-07-06T12:17:05Z

After re-discussion with @zhubonan we realized that the logic described here is probably too complex. Probably the easiest is to mount just the packs subfolder in a different location. This should typically be sufficient for most use cases. I will therefore close this as a wontfix

This was referenced Dec 9, 2021

AiiDA cannot create folders in object storage aiidateam/aiida-core#4895

Closed

suggestion: consider using record markers in the packs #124

Open

zhubonan mentioned this issue Jan 11, 2022

Allow sealing pack files in a way that turns them into valid ZIP archives #133

Closed

1 task

giovannipizzi closed this as completed Jul 6, 2023

giovannipizzi added the wontfix This will not be worked on label Jul 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allowing multiple pack storage locations #123

allowing multiple pack storage locations #123

zhubonan commented Nov 6, 2021

zhubonan commented Nov 10, 2021

giovannipizzi commented Dec 8, 2021 •

edited

Loading

zhubonan commented Dec 8, 2021 •

edited

Loading

giovannipizzi commented Dec 9, 2021

giovannipizzi commented Jul 6, 2023

allowing multiple pack storage locations #123

allowing multiple pack storage locations #123

Comments

zhubonan commented Nov 6, 2021

zhubonan commented Nov 10, 2021

giovannipizzi commented Dec 8, 2021 • edited Loading

zhubonan commented Dec 8, 2021 • edited Loading

giovannipizzi commented Dec 9, 2021

giovannipizzi commented Jul 6, 2023

giovannipizzi commented Dec 8, 2021 •

edited

Loading

zhubonan commented Dec 8, 2021 •

edited

Loading