Skip to content

Commit

Permalink
Addressing comments by Dominik
Browse files Browse the repository at this point in the history
  • Loading branch information
giovannipizzi committed Apr 28, 2020
1 parent 774c8d4 commit 70790e4
Showing 1 changed file with 22 additions and 2 deletions.
24 changes: 22 additions & 2 deletions 003_efficient_object_store_for_repository/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,15 +12,15 @@
## Background
AiiDA writes the "content" of each node in two places: attributes in the database, and files
(that do not need fast query) in a disk repository.
These files include for instance raw inputs and otputs of a job calculation, but also other binary or
These files include for instance raw inputs and outputs of a job calculation, but also other binary or
textual information best stored directly as a file (some notable examples: pseudopotential files,
numpy arrays, crystal structures in CIF format).

Currently, each of these files is directly stored in a folder structure, where each node "owns" a folder whose name
is based on the node UUID with two levels of sharding
(that is, if the node UUID is `4af3dd55-a1fd-44ec-b874-b00e19ec5adf`,
the folder will be `4a/f3/dd55-a1fd-44ec-b874-b00e19ec5adf`).
Files of a nodes are stored within the node repository folder,
Files of a node are stored within the node repository folder,
possibly within a folder structure.

While quite efficient when retrieving a single file
Expand Down Expand Up @@ -247,6 +247,26 @@ the different requirements, and represent what can be found in the current imple
As a note, seeking a file to a given position is what one typically does when watching a
video and jumping to a different section.

- Packing in general, at this stage, is left to the user. We can decide (at the object-store level, or probably
better at the AiiDA level) to suggest the user to repack, or to trigger the repacking automatically.
This can be a feature introduced at a second time. For instance, the first version we roll out could just suggest
to repack periodically in the docs to repack.
This could be a good approach, also to bind the repacking with the backing up (at the moment,
probably backups need to be executed using appropriate scripts to backup the DB index and the repository
in the "right order", and possibly using SQLite functions to get a dump).
As a note, even if repacking is never done, the situation is anyway as the current one in AiiDA, and actually
a bit better because getting the list of files for a node without files wouldn't need anymore to access the disk,
and similarly there wouldn't be anymore empty folders created for nodes without files.

In a second phase, we can print suggestions, e.g. when restarting the daemon,
that suggests to repack, for instance if the number of loose objects is too large.
We can also provide `verdi` commands for this.

Finally, if we are confident that this approach works fine, we can also automate the repacking. We need to be careful
that two different processes don't start packing at the same time, and that the user is aware that packing will be
triggered, that it might take some time, and that the packing process should not be killed

This comment has been minimized.

Copy link
@greschd

greschd Apr 28, 2020

Member

I think the packing procedures should provide some guarantees (if at all possible) that it will remain in a valid state even if killed. This could of course happen even if the user triggers it manually.

(this might be inconvenient, and this is why I would think twice before implementing an automatic repacking).

### Why a custom implementation of the library
We have been investigating if existing codes could be used for the current purpose.

Expand Down

0 comments on commit 70790e4

Please sign in to comment.