From 774c8d426f0a9236e0e78ba1f8f9f4fd81bb690d Mon Sep 17 00:00:00 2001 From: Giovanni Pizzi Date: Fri, 17 Apr 2020 22:07:11 +0200 Subject: [PATCH 1/8] First proposal of the AEP for the object store of AiiDA --- .gitignore | 2 + .../readme.md | 345 ++++++++++++++++++ README.md | 1 + 3 files changed, 348 insertions(+) create mode 100644 003_efficient_object_store_for_repository/readme.md diff --git a/.gitignore b/.gitignore index 7efedc5..1d9cc9b 100644 --- a/.gitignore +++ b/.gitignore @@ -222,3 +222,5 @@ tags [._]*.un~ # End of https://www.gitignore.io/api/vim,python,pycharm + +.DS_Store diff --git a/003_efficient_object_store_for_repository/readme.md b/003_efficient_object_store_for_repository/readme.md new file mode 100644 index 0000000..81734f6 --- /dev/null +++ b/003_efficient_object_store_for_repository/readme.md @@ -0,0 +1,345 @@ +# Efficient object store for the AiiDA repository + +| AEP number | 003 | +|------------|--------------------------------------------------------------| +| Title | Efficient object store for the AiiDA repository | +| Authors | [Giovanni Pizzi](mailto:giovanni.pizzi@epfl.ch) (giovannipizzi)| +| Champions | [Giovanni Pizzi](mailto:giovanni.pizzi@epfl.ch) (giovannipizzi), [Francisco Ramirez](mailto:francisco.ramirez@epfl.ch) (ramirezfranciscof), [Sebastiaan P. Huber](mailto:sebastiaan.huber@epfl.ch) (sphuber) | +| Type | S - Standard Track AEP | +| Created | 17-Apr-2020 | +| Status | submitted | + +## Background +AiiDA writes the "content" of each node in two places: attributes in the database, and files +(that do not need fast query) in a disk repository. +These files include for instance raw inputs and otputs of a job calculation, but also other binary or +textual information best stored directly as a file (some notable examples: pseudopotential files, +numpy arrays, crystal structures in CIF format). + +Currently, each of these files is directly stored in a folder structure, where each node "owns" a folder whose name +is based on the node UUID with two levels of sharding +(that is, if the node UUID is `4af3dd55-a1fd-44ec-b874-b00e19ec5adf`, +the folder will be `4a/f3/dd55-a1fd-44ec-b874-b00e19ec5adf`). +Files of a nodes are stored within the node repository folder, +possibly within a folder structure. + +While quite efficient when retrieving a single file +(keeping too many files or subfolders in the same folder is +inefficient), the current implementation suffers of a number of problems +when starting to have hundreds of thousands of nodes or more +(we already have databases with about ten million files). + +In particular: +- there is no way to compress files unless the AiiDA plugin does it; +- if a code generates hundreds of files, each of them is stored as an individual file; +- there are many empty folders that are generated for each node, even when no file needs to be stored. + +We emphasize that having many inodes (files or folders) is a big problem. In particular: +- a common filesystem (e.g. EXT4) has a maximum number of inodes that can be stored, and we already have + experience of AiiDA repositories that hit this limit. The disk can be reformatted to increase this limit, + but this is clearly not an option that can be easily suggested to users. +- When performing "bulk" operations on many files (two common cases are when exporting a portion of the DB, + or when performing a backup), accessing hundreds of thousands of files is extremely slow. Disk caches + (in Linux, for instance) make this process much faster, but this works only as long as the system has a lot + of RAM, and the files have been accessed recently. Note that this is not an issue of AiiDA. Even just running + `rsync` on a folder with hundreds of thousands of files (not recently accessed) can take minutes or even hours + just to check if the files need to be updated, while if the same content is in a single big file, the operation + would take much less (possibly even less than a second for a lot of small files). As a note, this is the reason + why at the moment we provide a backup script to perform incremental backups of the repository (checking the + timestamp in the AiiDA database and only transferring node repository folders of new nodes), but the goal would be to rely instead on standard tools like `rsync`. + + +## Proposed Enhancement +The goal of this project is to have a very efficient implementation of an "object store" that: +- works directly on a disk folder; +- ideally, does not require a service to be running in the background, + to avoid to have to ask users to run yet another service to use AiiDA; +- and addresses a number of performance issues mentioned above and discussed more in detail below. + +**NOTE**: This AEP does not address the actual implementation of the object store in AiiDA, but +rather the implementation of an object store to solve a number of performance issues. +The description of the integration with AiiDA will be discussed in a different AEP +(see [PR #7](https://github.com/aiidateam/AEP/pull/7) for some preliminary discussion, written before this AEP). + +## Detailed Explanation +The goal of this AEP is to define some design decisions for a library (an "object store", to be used internally by AiiDA) that, +given a stream of bytes, stores it as efficiently as possible somewhere, assigns a unique key (a "name") to it, +and then allows to retrieve it with that key. + +The libary also supports efficient bulk write and read operations +to cover the cases of bulk operations in AiiDA (exporting, importing, +backups). + +Here we describe some design criteria for such an object store. +We also provide an implementation that follows these criteria in the +[disk-objectstore](https://github.com/giovannipizzi/disk-objectstore) +repository, where efficiency has also been benchmarked. + +### Performance guidelines and requirements + +#### Packing and loose objects +The core idea behind the object store implementation is that, instead of writing a lot of small files, these +are packed into a few big files, and an index of some sort keeps track of where each object is in the big file. +This is for instance what also git does internally. + +However, one cannot write directly into a pack file: this would be inefficient, especially because of a key requirement: +multiple Unix processes must be able to write efficiently and *concurrently*, i.e. at the same time, without data +corruption (for instance, the AiiDA daemon is composed of multiple +concurrent workers accessing the repository). +Therefore, as also `git` does, we introduce the concept of `loose` objects (where each object is stored as a different +file) and of `packed` objects when these are written as part of a bigger pack file. + +The key concept is that we want maximum performance when writing a new object. +Therefore, each new object is written, by default, in loose format. +Periodically, a packing operation can be executed, taking loose objects and moving them into packs. + +Accessing an object should only require that the user of the library provides its key, and the concept of packing should +be hidden to the user (at least when retrieving, while it can be exposed for maintenance operations like repacking). + +#### Efficiency decisions +Here are some possible decisions. These are based on a compromise between +the different requirements, and represent what can be found in the current implementation in the +[disk-objectstore](https://github.com/giovannipizzi/disk-objectstore) package. + +- As discussed above, objects are written by default as loose objects, with one file per object. + They are also stored uncompressed. This gives maximum performance when writing a file (e.g. while storing + a new node). Moreover, it ensures that many writers can write at the same time without data corruption. + +- Loose objects are stored with a one-level sharding format: `4a/f3dd55a1fd44ecb874b00e19ec5adf`. + Current experience (with AiiDA) shows that it is actually not so good to use two + levels of nesting, that was employed to avoid to have too many files in the same folder. And anyway the core idea of this implementationis that when there are too many loose objects, the user will pack them. + The number of characters in the first part can be chosen in the library, but a good compromise + after testing is 2 (the default, and also the value used internally by git). + +- Loose objects are actually written first to a sandbox folder (in the same filesystem), + and then moved in place with the correct UUID only when the file is closed, with an atomic operation. + This should prevent having leftover objects if the process dies. + In this way, we can rely simply on the fact that an object present in the folder of loose objects signifies that the object exists, without needing a centralised place to keep track of them. + + Having the latter would be a potential performance bottleneck, as if there are many concurrent + writers, the object store must guarantee that the central place is kept updated. + +- Packing can be triggered by the user periodically, whenever the user wants. + It should be ideally possible to pack while the object store + is in use, without the need to stop its use (which would in turn + require to stop the use of AiiDA and the deamons during these operations). + This is possible in the current implementation, but might temporarily impact read performance while repacking (which is probably acceptable). + Instead, it is *not* required that packing can be performed in parallel by multiple processes + (on the contrary, the current implementation actually tries to prevent multiple processes trying to perform + write operations on packs concurrently). + +- In practice, this operation takes all loose objects and puts them in a controllable number + of packs. The name of the packs is given by the first few letters of the UUID + (by default: 2, so 256 packs in total; configurable). A value of 2 is a good balance + between the size of each pack (on average, ~4GB/pack for a 1TB repository) and + the number of packs (having many packs means that, even when performing bulk access, + many different files need to be open, which slows down performance). + +- Pack files are just concatenation of bytes of the packed objects. Any new object + is appended to the pack (thanks to the efficiency of opening a file for appending). + The information for the offset and length of each pack is kept in a single SQLite + database for the whole set of objects, as we describe below. + +- Packed objects can optionally be compressed. Note that compression is on a per-object level. + This allows much greater flexibility, and avoid e.g. to recompress files that are already compressed. + One could also think to clever logic ot heuristics to try to compress a file, but then store it + uncompressed if it turns out that the compression ratio is not worth the time + needed to further uncompress it later. + +- API exists both to get and write a single object but also, *importantly*, to write directly + to pack files (this cannot be done by multiple processes at the same time, though), + and to read in bulk a given number of objects. + This is particularly convenient when using the object store for bulk import and + export, and very fast. Also, it could be useful when getting all files of one or more nodes. Actually, for export files, one could internally + use the same object-store implementation to store files in the export file. + + During normal operation, however, as discussed above, we expect the library user to write loose objects, + to be repacked periodically (e.g. once a week, or when needed). + + Some reference results for bulk operations in the current implementation: + Storing 100'000 small objects directly to the packs takes about 10s. + The time to retrieve all of them is ~2.2s when using a single bulk call, + compared to ~44.5s when using 100'000 independent calls (still probably acceptable). + Moreover, even getting, in 10 bulk calls, 10 random chunks of the objects (eventually + covering the whole set of 100'000 objects) only takes ~3.4s. This should demonstrate + that exporting a subset of the graph should be efficient (and the object store format + could be used also inside the export file). Also, this should be compared to minutes to hours + when storing each object as individual files. + +- All operations internally (storing to a loose object, storing to a pack, reading + from a loose object or from a pack, compression) should happen via streaming + (at least internally, and there should be a public facing API for this). + So, even when dealing with huge files, these never fill the RAM (e.g. when reading + or writing a multi-GB file, the memory usage has been tested to be capped at ~150MB). + Convenience methods are available, anyway, to get directly an object content, if + the user wants this for simplicty, and knows that the content fits in RAM. + +#### Further design choices + +- Each given object will get a random UUID as a key (its generation cost is negligible, about + 4 microseconds per UUID). + It's up to the caller to track this into a filename or a folder structure. + The UUID is generated by the implementation and cannot be passed from the outside. + This guarantees random distribution of objects in packs, and avoids to have to + check for objects already existing (that can be very expensive). + +- Pack naming and strategy is not determined by the user. Anyway it would be difficult + to provide easy options to the user to customize the behavior, while implementing + a packing strategy that is efficient. Moreover, with the current packing strategy, + it is immediate to know in which pack to check without having to keep also an index + of the packs (this, however, would be possible to implement in case we want to extend the behavior, + since anyway we have an index file). But at the moment it does not seem necessary. + +- For each object, the SQLite database contains the following fields, that can be considered + to be the "metadata" of the packed object: its key (`uuid`), the `offset` (starting + position of the bytestream in the pack file), the `length` (number of bytes to read), + a boolean `compressed` flag, meaning if the bytestream has been zlib-compressed, + and the `size` of the actual data (equal to `length` if `compressed` is false, + otherwise the size of the uncompressed stream, useful for statistics for instance, or + to inform the reader beforehand of how much data will be returned, before it starts + reading, so the reader can decide to store in RAM the whole object or to process it + in a streaming way). + +- A single index file is used. Having one pack index per file, while reducing a bit + the size of the index (one could skip storing the first part of the UUID, determined + by the pack naming) turns out not to be very effective. Either one would keep all + indexes open (but then one quickly hits the maximum number of open files, that e.g. + on Mac OS is of the order of ~256), or open the index, at every request, that risks to + be quite inefficient (not only to open, but also to load the DB, perform the query, + return the results, and close again the file). Also for bulk requests, anyway, this + would prevent making very few DB requests (unless you keep all files open, that + again is not feasible). + +- We decided to use SQLite because of the following reasons: + - it's a database, so it's efficient in searching for the metadata + of a given object; + - it does not require a running server; + - in [WAL mode](https://www.sqlite.org/wal.html), allows many concurrent readers and one writer, + useful to allow to continue normal operations by many Unix processes during repacking; + - we access it via the SQLAlchemy library that anyway is already + a dependency of AiiDA, and is actually the only dependency of the + current object-store implementation. + - it's quite widespread and so the libary to read and write should be reliable. + +- Deletion can just occur efficiently as either a deletion of the loose object, or + a removal from the index file (if the object is already packed). Later repacking of the packs can be used to recover + the disk space still occupied in the pack files. + +- The object store does not need to provide functionality to modify + a node. In AiiDA files of nodes are typically only added, and once + added they are immutable (and very rarely they can be deleted). + + If modification is needed, this can be achieved by creation of a new + object and deletion of the old one, since this is an extremely + rare operation (actually it should never happen). + +- The current packing format is `rsync`-friendly (that is one of the original requirements). + `rsync` has a clever rolling algorithm that can divides each file in blocks and + detects if the same block is already in the destination file, even at a different position. + Therefore, if content is appended to a pac file, or even if a pack is "repacked" (e.g. reordering + objects inside it, or removing deleted objects) this does not prevent efficient + rsync transfer (this has been tested in the implementation). + +- Appending content to a single file does not prevent the Linux disk cache to work efficiently. + Indeed, the caches are per blocks/pages in linux, not per file. + Concatenating to files does not impact performance on cache efficiency. What is costly is opening a file as the filesystem + has to provide some guarantees e.g. on concurrent access. + As a note, seeking a file to a given position is what one typically does when watching a + video and jumping to a different section. + +### Why a custom implementation of the library +We have been investigating if existing codes could be used for the current purpose. + +However, in our preliminary analysis we couldn't find an existing robust software that was satisfying all criteria. +In particular: + +- existing object storage implementations (e.g. Open Stack Swift, or others that typically provide a + S3 or similar API) are not suitable since 1) they require a server running, and we would like to + avoid the complexity of asking users to run yet another server, and most importantly 2) they usually work + via a REST API that is extremely inefficient when retrieving a lot of small objects (latencies can + be even of tens of ms, that is clearly unacceptable if we want to retrieve millions of objects). + +- the closest tool we could find is the git object store, for which there exists also a pure-python + implementation ([dulwich](https://github.com/dulwich/dulwich)). We have been benchmarking it, but + the feeling is that it is designed to do something else, and adapting to our needs might not be worth it. + Some examples of issues we envision: managing re-packing (dulwich can only pack loose objects, but not repack existing packs); this can be done via git itself, but then we need to ensuring that when repacking, objects are not garbage-collected and deleted because they are not referenced within git; + it's not possible (apparently) to decide if we want to compress an object or not (e.g., in case we want maximum performance at the cost of disk space), but they are always compressed. + Also we did not test concurrency access and packing of the git repository, which requires some stress test to assess if it works. + +- One possible solution could be a format like HDF5. However we need a very efficient hash-table like access + (given a key, get the object), while HDF5 is probably designed to append data to the last dimension of + a multidimensional array. Variables exist, but (untested) probably they cannot scale well + at the scale of tens of millions. + +- One option we didn't investigate is the mongodb object store. Note that this would require running a server + though (but could be acceptable if for other reasons it becomes advantageous). + +Finally, as a note, we stress that an efficient implementation in about 1'000 lines of code has been +already implemented, so the complexity of writing an object store library is not huge (at the cost, however, +of having to stress test ourselves that the implementation is bug-free). + +### UUIDs vs SHA hashes + +One thing to discuss is whether to replace a random UUID with a strong hash of the content +(e.g. SHA1 as git does, or some stronger hash). +The clear advantage is that one would get "for free" the possibility to deduplicate identical +files. Moreover, computation of hashes even for large files is very fast (comparable to the generation of a UUID) +even for GB files. + +However, this poses a few of potential problem: + +- if one wants to work in streaming mode, the hash could be computed only at the *end*, + after the whole stream is processed. While this is still ok for loose objects, this is a problem + when writing directly to packs. One possible workaround could be to decide that objects are added + to random packs, and store the pack as an additional column in the index file. + +- Even worse, it is not possible to know if the object already exists before it has been completely received + (and therefore written somewhere, because it might not fit in memory). Then one would need to perform a search + if an object with the same hash exists, and possibly discard the file (that might actually have been already written to a pack). + Finally, one needs to do this in a way that works also for concurrent writes, and does not fail if two processes + write objects with the same SHA at the same time. + +- One could *partially* address the problem, for bulk writes, by asking the user to provide the hash beforehand. However, + I think it is a not a good idea; instead, it is better keep stronger guarantees at the library level: + otherwise one has to have logic to decide what to do if the hash provided by the user turns out to be wrong. + +- Instead, the managing of hashing could be done at a higher level, by the AiiDA repository implementation: anyway + the repository needs to keep track of the filename and relative path of each file (within the node repository), + and the corresponding object-store UUIDs. The implementation could then also compute and store the hash in some + efficient way, and then keep a table mapping SHA hashes to the UUIDs in the object store, and take care of the + deduplication. + This wouldn't add too much cost, and would keep a separation of concerns so that the object-store implementation can be simpler, + give higher guarantees, be more robust, and make it simpler to guarantee that data is not corrupted even for concurrent access. + +## Pros and Cons + +### Pros +* Is very efficient also with hundreds of thousands of objects (bulk reads can read all objects in a couple of seconds). +* Is rsync-friendly, so one can suggest to use directly rsync for backups instead of custom scripts. +* An implementation exists, seems to be very efficient, and is + relatively short (and already has tests with 100% coverage, + including concurrent writes, reads, and one repacking process, and checking on three platforms: linux, mac os, and windows). +* It does not add any additional python depenency to AiiDA, and it + does not require a service running. +* It can be used also for the internal format of AiiDA export files, + and this should be very efficient, even to later retrieve just + a subset of the files. +* Implements compression in a transparent way. +* It shouldn't be slower than the current AiiDA implementation in writing + during normal operation, since objects are stored as loose objects. + Actually, it might be faster for nodes without files, as no disk + access will be needed anymore. + +### Cons +* Extreme care is needed to convince ourselves that there are no + bugs and no risk of corrupting or losing the users' data. +* Object metadata must be tracked by the caller in some other database. + Similarly, deduplication via SHA hashes is not implemented and need to be + implemented by the caller. +* It is not possible anymore to access directly the folder of a node by opening + a bash shell and using `cd` to go the folder, + e.g. to quickly check the content. + However, we have `verdi node repo dump` + so direct access should not be needed anymore, and actually this + might be good to prevent that people corrupt by + mistake the repository. diff --git a/README.md b/README.md index 239e253..1002e94 100644 --- a/README.md +++ b/README.md @@ -17,6 +17,7 @@ ecosystem and to document the decision making process. | 000 | active | [AEP guidelines](000_aep_guidelines/readme.md) | | 001 | active | [Drop support for Python 2.7](001_drop_python2/) | | 002 | active | [AiiDA Dependency Management](002_dependency_management/) | +| 003 | submitted | [Efficient object store for the AiiDA repository](003_efficient_object_store_for_repository/) | ## Submitting an AEP The submission process is described in the [AEP guidelines](000_aep_guidelines/readme.md) which also act as a template for new AEPs. From 70790e460ea31d8a1b4d8de7b96b9a081712e4e5 Mon Sep 17 00:00:00 2001 From: Giovanni Pizzi Date: Tue, 28 Apr 2020 12:01:33 +0200 Subject: [PATCH 2/8] Addressing comments by Dominik --- .../readme.md | 24 +++++++++++++++++-- 1 file changed, 22 insertions(+), 2 deletions(-) diff --git a/003_efficient_object_store_for_repository/readme.md b/003_efficient_object_store_for_repository/readme.md index 81734f6..7fb6be8 100644 --- a/003_efficient_object_store_for_repository/readme.md +++ b/003_efficient_object_store_for_repository/readme.md @@ -12,7 +12,7 @@ ## Background AiiDA writes the "content" of each node in two places: attributes in the database, and files (that do not need fast query) in a disk repository. -These files include for instance raw inputs and otputs of a job calculation, but also other binary or +These files include for instance raw inputs and outputs of a job calculation, but also other binary or textual information best stored directly as a file (some notable examples: pseudopotential files, numpy arrays, crystal structures in CIF format). @@ -20,7 +20,7 @@ Currently, each of these files is directly stored in a folder structure, where e is based on the node UUID with two levels of sharding (that is, if the node UUID is `4af3dd55-a1fd-44ec-b874-b00e19ec5adf`, the folder will be `4a/f3/dd55-a1fd-44ec-b874-b00e19ec5adf`). -Files of a nodes are stored within the node repository folder, +Files of a node are stored within the node repository folder, possibly within a folder structure. While quite efficient when retrieving a single file @@ -247,6 +247,26 @@ the different requirements, and represent what can be found in the current imple As a note, seeking a file to a given position is what one typically does when watching a video and jumping to a different section. +- Packing in general, at this stage, is left to the user. We can decide (at the object-store level, or probably + better at the AiiDA level) to suggest the user to repack, or to trigger the repacking automatically. + This can be a feature introduced at a second time. For instance, the first version we roll out could just suggest + to repack periodically in the docs to repack. + This could be a good approach, also to bind the repacking with the backing up (at the moment, + probably backups need to be executed using appropriate scripts to backup the DB index and the repository + in the "right order", and possibly using SQLite functions to get a dump). + As a note, even if repacking is never done, the situation is anyway as the current one in AiiDA, and actually + a bit better because getting the list of files for a node without files wouldn't need anymore to access the disk, + and similarly there wouldn't be anymore empty folders created for nodes without files. + + In a second phase, we can print suggestions, e.g. when restarting the daemon, + that suggests to repack, for instance if the number of loose objects is too large. + We can also provide `verdi` commands for this. + + Finally, if we are confident that this approach works fine, we can also automate the repacking. We need to be careful + that two different processes don't start packing at the same time, and that the user is aware that packing will be + triggered, that it might take some time, and that the packing process should not be killed + (this might be inconvenient, and this is why I would think twice before implementing an automatic repacking). + ### Why a custom implementation of the library We have been investigating if existing codes could be used for the current purpose. From c9f86f7c2ad359ad8236415af9c63da278e6a6f7 Mon Sep 17 00:00:00 2001 From: Giovanni Pizzi Date: Tue, 28 Apr 2020 13:10:00 +0200 Subject: [PATCH 3/8] Addressing comments by Sebastiaan --- .../readme.md | 31 +++++++++++++++++-- 1 file changed, 28 insertions(+), 3 deletions(-) diff --git a/003_efficient_object_store_for_repository/readme.md b/003_efficient_object_store_for_repository/readme.md index 7fb6be8..67d5bdf 100644 --- a/003_efficient_object_store_for_repository/readme.md +++ b/003_efficient_object_store_for_repository/readme.md @@ -140,11 +140,33 @@ the different requirements, and represent what can be found in the current imple The information for the offset and length of each pack is kept in a single SQLite database for the whole set of objects, as we describe below. -- Packed objects can optionally be compressed. Note that compression is on a per-object level. +- Packed objects can optionally be compressed. Note that compression is on a + per-object level. The fact an object is compressed is stored in the index. When + the users ask for an object, they always get back the uncompressed version (so + they don't have to worry if objects are compressed or not when retrieving them). This allows much greater flexibility, and avoid e.g. to recompress files that are already compressed. One could also think to clever logic ot heuristics to try to compress a file, but then store it uncompressed if it turns out that the compression ratio is not worth the time needed to further uncompress it later. + At the moment, one compression will be chosen and used by default (currently + zlib, but in issues it has been suggested to use more modern formats like + `xz`, or even better [snappy](https://github.com/google/snappy) that is very fast and designed for purposes + like this). + In the future, if we want, it would be easy to make the compression library + an option, and store this in the JSON that contains the settings of the + container. + +- A note on compression: the user can always compress objects first, and then store + a compressed version of them and take care of remembering if an object + was stored compressed or not. However, the implementation of compression + directly in the object store, as described in the previous point, has the two advantages that compression is done only while packing, + so there is no performance hit while just storing a new object, + and that is completely transparent to the user (while packing, the user can + decide to compress data or not; then, when retrieving an object from of the + object store, there will not be any difference - except possibly speed in + retrieving the data - because the API to retrieve the objects will be the same, + irrespective of whether the object has been stored as compressed or not; + and data is always returned uncompressed). - API exists both to get and write a single object but also, *importantly*, to write directly to pack files (this cannot be done by multiple processes at the same time, though), @@ -188,7 +210,10 @@ the different requirements, and represent what can be found in the current imple a packing strategy that is efficient. Moreover, with the current packing strategy, it is immediate to know in which pack to check without having to keep also an index of the packs (this, however, would be possible to implement in case we want to extend the behavior, - since anyway we have an index file). But at the moment it does not seem necessary. + since anyway we have an index file). But at the moment it does not seem necessary. Possible future changes of the internal packing format should not + affect the users of the library, since users only ask to get an object by UUID, + and in general they do not need to know if they are getting a loose object, + a packed object, from which pack, ... - For each object, the SQLite database contains the following fields, that can be considered to be the "metadata" of the packed object: its key (`uuid`), the `offset` (starting @@ -226,7 +251,7 @@ the different requirements, and represent what can be found in the current imple the disk space still occupied in the pack files. - The object store does not need to provide functionality to modify - a node. In AiiDA files of nodes are typically only added, and once + an object. In AiiDA, files of nodes are typically only added, and once added they are immutable (and very rarely they can be deleted). If modification is needed, this can be achieved by creation of a new From 5af4c7db327f8fefef4c707fff35ac0a5a0ae337 Mon Sep 17 00:00:00 2001 From: Giovanni Pizzi Date: Tue, 28 Apr 2020 13:28:55 +0200 Subject: [PATCH 4/8] Addressing comments by Espen --- .../readme.md | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/003_efficient_object_store_for_repository/readme.md b/003_efficient_object_store_for_repository/readme.md index 67d5bdf..0d9257d 100644 --- a/003_efficient_object_store_for_repository/readme.md +++ b/003_efficient_object_store_for_repository/readme.md @@ -50,8 +50,8 @@ We emphasize that having many inodes (files or folders) is a big problem. In par ## Proposed Enhancement -The goal of this project is to have a very efficient implementation of an "object store" that: -- works directly on a disk folder; +The goal of this project is to have a very efficient implementation of an "object store" (or, more correctly, a key-value store) that: +- works directly on a disk folder (i.e. only requires access to a folder on a disk); - ideally, does not require a service to be running in the background, to avoid to have to ask users to run yet another service to use AiiDA; - and addresses a number of performance issues mentioned above and discussed more in detail below. @@ -120,6 +120,8 @@ the different requirements, and represent what can be found in the current imple writers, the object store must guarantee that the central place is kept updated. - Packing can be triggered by the user periodically, whenever the user wants. + Here, and in the following, packing means bundling loose objects in a few + "pack" files, possibly (optionally) compressing the objects. It should be ideally possible to pack while the object store is in use, without the need to stop its use (which would in turn require to stop the use of AiiDA and the deamons during these operations). @@ -265,6 +267,10 @@ the different requirements, and represent what can be found in the current imple objects inside it, or removing deleted objects) this does not prevent efficient rsync transfer (this has been tested in the implementation). +- Since it works on a filesystem backend, this would allow to use also other + tools, e.g. [`rclone`](https://rclone.org), to move the data to some different + backend (a "real" object store, Google Drive, or anything else). + - Appending content to a single file does not prevent the Linux disk cache to work efficiently. Indeed, the caches are per blocks/pages in linux, not per file. Concatenating to files does not impact performance on cache efficiency. What is costly is opening a file as the filesystem @@ -377,7 +383,11 @@ However, this poses a few of potential problem: ### Cons * Extreme care is needed to convince ourselves that there are no - bugs and no risk of corrupting or losing the users' data. + bugs and no risk of corrupting or losing the users' data. This is clearly + non trivial and requires a lot of work. Note, however, that if packing is + not performed, the performance will be the same as the one currently of AiiDA, + that stores essentially only loose objects. Risks of data corruption need + to be carefully assessed mostly while packing. * Object metadata must be tracked by the caller in some other database. Similarly, deduplication via SHA hashes is not implemented and need to be implemented by the caller. From 4dae9b328aae002e6b6954fa87d57d726577475a Mon Sep 17 00:00:00 2001 From: Giovanni Pizzi Date: Fri, 10 Dec 2021 15:06:28 +0100 Subject: [PATCH 5/8] Updating the AEP with the current state (of disk-objectstore at version 0.6.0) --- .../readme.md | 166 ++++++------------ 1 file changed, 57 insertions(+), 109 deletions(-) diff --git a/006_efficient_object_store_for_repository/readme.md b/006_efficient_object_store_for_repository/readme.md index 0d9257d..dba4774 100644 --- a/006_efficient_object_store_for_repository/readme.md +++ b/006_efficient_object_store_for_repository/readme.md @@ -1,22 +1,22 @@ # Efficient object store for the AiiDA repository -| AEP number | 003 | +| AEP number | 006 | |------------|--------------------------------------------------------------| | Title | Efficient object store for the AiiDA repository | | Authors | [Giovanni Pizzi](mailto:giovanni.pizzi@epfl.ch) (giovannipizzi)| -| Champions | [Giovanni Pizzi](mailto:giovanni.pizzi@epfl.ch) (giovannipizzi), [Francisco Ramirez](mailto:francisco.ramirez@epfl.ch) (ramirezfranciscof), [Sebastiaan P. Huber](mailto:sebastiaan.huber@epfl.ch) (sphuber) | +| Champions | [Giovanni Pizzi](mailto:giovanni.pizzi@epfl.ch) (giovannipizzi), [Francisco Ramirez](mailto:francisco.ramirez@epfl.ch) (ramirezfranciscof), [Sebastiaan P. Huber](mailto:sebastiaan.huber@epfl.ch) (sphuber) [Chris J. Sewell](mailto:christopher.sewell@epfl.ch) (chrisjsewell) | | Type | S - Standard Track AEP | | Created | 17-Apr-2020 | -| Status | submitted | +| Status | implemented | ## Background -AiiDA writes the "content" of each node in two places: attributes in the database, and files +AiiDA 0.x and 1.x writes the "content" of each node in two places: attributes in the database, and files (that do not need fast query) in a disk repository. These files include for instance raw inputs and outputs of a job calculation, but also other binary or textual information best stored directly as a file (some notable examples: pseudopotential files, numpy arrays, crystal structures in CIF format). -Currently, each of these files is directly stored in a folder structure, where each node "owns" a folder whose name +In AiiDA 0.x and 1.x, each of these files is directly stored in a folder structure, where each node "owns" a folder whose name is based on the node UUID with two levels of sharding (that is, if the node UUID is `4af3dd55-a1fd-44ec-b874-b00e19ec5adf`, the folder will be `4a/f3/dd55-a1fd-44ec-b874-b00e19ec5adf`). @@ -50,18 +50,22 @@ We emphasize that having many inodes (files or folders) is a big problem. In par ## Proposed Enhancement -The goal of this project is to have a very efficient implementation of an "object store" (or, more correctly, a key-value store) that: +The goal of this proposal is to have a very efficient implementation of an "object store" (or, more correctly, a key-value store) that: - works directly on a disk folder (i.e. only requires access to a folder on a disk); - ideally, does not require a service to be running in the background, to avoid to have to ask users to run yet another service to use AiiDA; - and addresses a number of performance issues mentioned above and discussed more in detail below. -**NOTE**: This AEP does not address the actual implementation of the object store in AiiDA, but -rather the implementation of an object store to solve a number of performance issues. +**NOTE**: This AEP does not address the actual implementation of the object store within AiiDA, but +rather the implementation of an object-store-like service as an independent package, to solve a number of performance issues. This is now implemented as part of the [disk-objecstore](https://github.com/aiidateam/disk-objectstore) package that will be used by AiiDA from version 2.0. + The description of the integration with AiiDA will be discussed in a different AEP (see [PR #7](https://github.com/aiidateam/AEP/pull/7) for some preliminary discussion, written before this AEP). ## Detailed Explanation + +**NOTE**: This document discusses the reasoning behind the implementation of the `disk-objectstore`. The implementation details reflect what is in the libary as of version 0.6.0 (as of late 2021). It should be *not* considered as a documentation of the `disk-objectstore` package (as the implementation choices might be adapted in the future), but rather as a reference of the reason for the introduction of the package, of the design decisions, and of why they were made. + The goal of this AEP is to define some design decisions for a library (an "object store", to be used internally by AiiDA) that, given a stream of bytes, stores it as efficiently as possible somewhere, assigns a unique key (a "name") to it, and then allows to retrieve it with that key. @@ -82,8 +86,8 @@ The core idea behind the object store implementation is that, instead of writing are packed into a few big files, and an index of some sort keeps track of where each object is in the big file. This is for instance what also git does internally. -However, one cannot write directly into a pack file: this would be inefficient, especially because of a key requirement: -multiple Unix processes must be able to write efficiently and *concurrently*, i.e. at the same time, without data +However, one cannot write directly into a pack file: this would be inefficient and also not robust, especially because of a key requirement: +multiple Unix processes (AiiDA verdi shells, the various damon workers, ...) must be able to write efficiently and *concurrently*, i.e. at the same time, without data corruption (for instance, the AiiDA daemon is composed of multiple concurrent workers accessing the repository). Therefore, as also `git` does, we introduce the concept of `loose` objects (where each object is stored as a different @@ -91,13 +95,13 @@ file) and of `packed` objects when these are written as part of a bigger pack fi The key concept is that we want maximum performance when writing a new object. Therefore, each new object is written, by default, in loose format. -Periodically, a packing operation can be executed, taking loose objects and moving them into packs. +Periodically, a packing operation can be executed (in a first version this will be triggered manually by the users and properly documented; in the future, we might think to automatic triggering this based on performance heuristics), moving loose objects into packs. Accessing an object should only require that the user of the library provides its key, and the concept of packing should be hidden to the user (at least when retrieving, while it can be exposed for maintenance operations like repacking). #### Efficiency decisions -Here are some possible decisions. These are based on a compromise between +Here follows a discussion of some of the decisions that were made in the implementation. These are based on a compromise between the different requirements, and represent what can be found in the current implementation in the [disk-objectstore](https://github.com/giovannipizzi/disk-objectstore) package. @@ -117,25 +121,22 @@ the different requirements, and represent what can be found in the current imple In this way, we can rely simply on the fact that an object present in the folder of loose objects signifies that the object exists, without needing a centralised place to keep track of them. Having the latter would be a potential performance bottleneck, as if there are many concurrent - writers, the object store must guarantee that the central place is kept updated. + writers, the object store must guarantee that the central place is kept updated (and e.g. using a SQLite database for this - as it's used for the packs - is not a good solution because only one writer at a time can act on a SQLite database, and all others have to wait and risk to timeout). - Packing can be triggered by the user periodically, whenever the user wants. Here, and in the following, packing means bundling loose objects in a few "pack" files, possibly (optionally) compressing the objects. - It should be ideally possible to pack while the object store + A key requirement is that it should be possible to pack while the object store is in use, without the need to stop its use (which would in turn require to stop the use of AiiDA and the deamons during these operations). This is possible in the current implementation, but might temporarily impact read performance while repacking (which is probably acceptable). Instead, it is *not* required that packing can be performed in parallel by multiple processes (on the contrary, the current implementation actually tries to prevent multiple processes trying to perform - write operations on packs concurrently). + write operations on packs at the same time: i.e., only a single packer process should perform the operation at any given time). -- In practice, this operation takes all loose objects and puts them in a controllable number - of packs. The name of the packs is given by the first few letters of the UUID - (by default: 2, so 256 packs in total; configurable). A value of 2 is a good balance - between the size of each pack (on average, ~4GB/pack for a 1TB repository) and - the number of packs (having many packs means that, even when performing bulk access, - many different files need to be open, which slows down performance). +- In practice, this packing operation takes all loose objects and puts them in a controllable number + of packs. The name of the packs is given by an integer. A new pack is created when all previous ones are "full", where full is defined when the pack size goes beyond a threshold (by default 4GB/pack. + This size is a good compromise: it's similar to a "movie" file, so not too big to deal with (can fit e.g. in a disk, people are used to deal with files of a few GBs) so it's not too big, but it's big enough so that even for TB-large repositories, the number of pakcs is of the order of a few tens, and therefore this solves the issue of having millions of files. - Pack files are just concatenation of bytes of the packed objects. Any new object is appended to the pack (thanks to the efficiency of opening a file for appending). @@ -143,20 +144,19 @@ the different requirements, and represent what can be found in the current imple database for the whole set of objects, as we describe below. - Packed objects can optionally be compressed. Note that compression is on a - per-object level. The fact an object is compressed is stored in the index. When + per-object level. The information on whether an object is compressed or not is stored in the index. When the users ask for an object, they always get back the uncompressed version (so they don't have to worry if objects are compressed or not when retrieving them). - This allows much greater flexibility, and avoid e.g. to recompress files that are already compressed. - One could also think to clever logic ot heuristics to try to compress a file, but then store it + This allows much greater flexibility, and avoid e.g. to decide to avoid to recompress files that are already compressed or where compression would give little to no benefit. + In the future, one could also think to clever logic or heuristics to try to compress a file, but then store it uncompressed if it turns out that the compression ratio is not worth the time needed to further uncompress it later. At the moment, one compression will be chosen and used by default (currently zlib, but in issues it has been suggested to use more modern formats like `xz`, or even better [snappy](https://github.com/google/snappy) that is very fast and designed for purposes like this). - In the future, if we want, it would be easy to make the compression library - an option, and store this in the JSON that contains the settings of the - container. + The compression library is already an option, and which one to use is stored in the JSON file that contains the settings of the + object-store container. - A note on compression: the user can always compress objects first, and then store a compressed version of them and take care of remembering if an object @@ -186,8 +186,7 @@ the different requirements, and represent what can be found in the current imple compared to ~44.5s when using 100'000 independent calls (still probably acceptable). Moreover, even getting, in 10 bulk calls, 10 random chunks of the objects (eventually covering the whole set of 100'000 objects) only takes ~3.4s. This should demonstrate - that exporting a subset of the graph should be efficient (and the object store format - could be used also inside the export file). Also, this should be compared to minutes to hours + that exporting a subset of the graph should be efficient. Also, this should be compared to minutes to hours when storing each object as individual files. - All operations internally (storing to a loose object, storing to a pack, reading @@ -200,42 +199,29 @@ the different requirements, and represent what can be found in the current imple #### Further design choices -- Each given object will get a random UUID as a key (its generation cost is negligible, about - 4 microseconds per UUID). - It's up to the caller to track this into a filename or a folder structure. - The UUID is generated by the implementation and cannot be passed from the outside. - This guarantees random distribution of objects in packs, and avoids to have to - check for objects already existing (that can be very expensive). +- The key of an object will be its hash. For each container, one can decide which hash algorithm to use; the default one is `sha256` that offers a good compromise between speed and avoiding risk of collisions. Once an object is stored, it's the responsibility of the `disk-objectstore` library to return the hash of the object that was just stored. + +- Using a hash means that we automatically get deduplication of content: if an object is asked to be written, once the stream is received, if the library detects that the object is already present, it will still return the hash key but not store it twice. So, from the point of view of the end application (AiiDA), it does not need to know that deduplication is performed: it just has to send a sequence of bytes, and store the corresponding hash key returned by the library + +- The hashing library can be decided for each container and is configurable; however, for performance reasons, in AiiDA it will be better to decide and stick to one only algorithm: this will allow to compare e.g. two different repositories (e.g. when sharing data and/or syncing) and establish if data already exists by just comparing the hashes. If different hash algorithms are instead used by the two containers, one needs to do a full data transfer of the whole container, to discover if new data needs to be transfered or not. - Pack naming and strategy is not determined by the user. Anyway it would be difficult to provide easy options to the user to customize the behavior, while implementing - a packing strategy that is efficient. Moreover, with the current packing strategy, - it is immediate to know in which pack to check without having to keep also an index - of the packs (this, however, would be possible to implement in case we want to extend the behavior, - since anyway we have an index file). But at the moment it does not seem necessary. Possible future changes of the internal packing format should not - affect the users of the library, since users only ask to get an object by UUID, + a packing strategy that is efficient. Which object is in which pack is tracked in the SQLite database. + But at the moment it does not seem necessary. Possible future changes of the internal packing format should not + affect the users of the library, since users only ask to get an object by hash key, and in general they do not need to know if they are getting a loose object, a packed object, from which pack, ... - For each object, the SQLite database contains the following fields, that can be considered - to be the "metadata" of the packed object: its key (`uuid`), the `offset` (starting + to be the "metadata" of the packed object: its key (`hashkey`), the `offset` (starting position of the bytestream in the pack file), the `length` (number of bytes to read), a boolean `compressed` flag, meaning if the bytestream has been zlib-compressed, and the `size` of the actual data (equal to `length` if `compressed` is false, otherwise the size of the uncompressed stream, useful for statistics for instance, or to inform the reader beforehand of how much data will be returned, before it starts reading, so the reader can decide to store in RAM the whole object or to process it - in a streaming way). - -- A single index file is used. Having one pack index per file, while reducing a bit - the size of the index (one could skip storing the first part of the UUID, determined - by the pack naming) turns out not to be very effective. Either one would keep all - indexes open (but then one quickly hits the maximum number of open files, that e.g. - on Mac OS is of the order of ~256), or open the index, at every request, that risks to - be quite inefficient (not only to open, but also to load the DB, perform the query, - return the results, and close again the file). Also for bulk requests, anyway, this - would prevent making very few DB requests (unless you keep all files open, that - again is not feasible). + in a streaming way). In addition, it tracks the number of the pack in which the object is stored. - We decided to use SQLite because of the following reasons: - it's a database, so it's efficient in searching for the metadata @@ -245,12 +231,12 @@ the different requirements, and represent what can be found in the current imple useful to allow to continue normal operations by many Unix processes during repacking; - we access it via the SQLAlchemy library that anyway is already a dependency of AiiDA, and is actually the only dependency of the - current object-store implementation. - - it's quite widespread and so the libary to read and write should be reliable. + current object-store implementation; + - it's quite widespread and so the libary to read and write should be reliable; in addition, SQLite has a clear [long-term support planning](https://www.sqlite.org/lts.html). - Deletion can just occur efficiently as either a deletion of the loose object, or a removal from the index file (if the object is already packed). Later repacking of the packs can be used to recover - the disk space still occupied in the pack files. + the disk space still occupied in the pack files. It is hard to find a better strategy that does not require manual repacking but gives all other guarantees especially for fast live operations (as a reference, also essentially all databases do the same and have a "vacuum" operation that is in a sense equivalent to the concept of repacking here). - The object store does not need to provide functionality to modify an object. In AiiDA, files of nodes are typically only added, and once @@ -263,7 +249,7 @@ the different requirements, and represent what can be found in the current imple - The current packing format is `rsync`-friendly (that is one of the original requirements). `rsync` has a clever rolling algorithm that can divides each file in blocks and detects if the same block is already in the destination file, even at a different position. - Therefore, if content is appended to a pac file, or even if a pack is "repacked" (e.g. reordering + Therefore, if content is appended to a pack file, or even if a pack is "repacked" (e.g. reordering objects inside it, or removing deleted objects) this does not prevent efficient rsync transfer (this has been tested in the implementation). @@ -273,7 +259,7 @@ the different requirements, and represent what can be found in the current imple - Appending content to a single file does not prevent the Linux disk cache to work efficiently. Indeed, the caches are per blocks/pages in linux, not per file. - Concatenating to files does not impact performance on cache efficiency. What is costly is opening a file as the filesystem + Concatenating to files does not impact performance on cache efficiency. What is costly is opening a file, as the filesystem has to provide some guarantees e.g. on concurrent access. As a note, seeking a file to a given position is what one typically does when watching a video and jumping to a different section. @@ -281,19 +267,23 @@ the different requirements, and represent what can be found in the current imple - Packing in general, at this stage, is left to the user. We can decide (at the object-store level, or probably better at the AiiDA level) to suggest the user to repack, or to trigger the repacking automatically. This can be a feature introduced at a second time. For instance, the first version we roll out could just suggest - to repack periodically in the docs to repack. - This could be a good approach, also to bind the repacking with the backing up (at the moment, + to repack periodically in the docs. + This could be a good approach, also to bind the repacking with the backups (at the moment, probably backups need to be executed using appropriate scripts to backup the DB index and the repository in the "right order", and possibly using SQLite functions to get a dump). - As a note, even if repacking is never done, the situation is anyway as the current one in AiiDA, and actually - a bit better because getting the list of files for a node without files wouldn't need anymore to access the disk, - and similarly there wouldn't be anymore empty folders created for nodes without files. + +* As a note: even if repacking is never done by the user, the situation is anyway improved with respect to the current one in AiiDA: + * an index of files associated with an AiiDA node will now be stored in the AiiDA DB, so getting the list of files associated to a node without content will not need anymore to access the disk; + + * there wouldn't be anymore empty folders created for nodes without files; + + * automatic deduplication of the data is now done transparently. In a second phase, we can print suggestions, e.g. when restarting the daemon, that suggests to repack, for instance if the number of loose objects is too large. - We can also provide `verdi` commands for this. + We will also provide `verdi` commands to facilitate the user in these maintenance operations. - Finally, if we are confident that this approach works fine, we can also automate the repacking. We need to be careful + Finally, in the future if we are confident that this approach works fine, we can also automate the repacking. We need to be careful that two different processes don't start packing at the same time, and that the user is aware that packing will be triggered, that it might take some time, and that the packing process should not be killed (this might be inconvenient, and this is why I would think twice before implementing an automatic repacking). @@ -325,61 +315,21 @@ In particular: - One option we didn't investigate is the mongodb object store. Note that this would require running a server though (but could be acceptable if for other reasons it becomes advantageous). -Finally, as a note, we stress that an efficient implementation in about 1'000 lines of code has been -already implemented, so the complexity of writing an object store library is not huge (at the cost, however, -of having to stress test ourselves that the implementation is bug-free). - -### UUIDs vs SHA hashes - -One thing to discuss is whether to replace a random UUID with a strong hash of the content -(e.g. SHA1 as git does, or some stronger hash). -The clear advantage is that one would get "for free" the possibility to deduplicate identical -files. Moreover, computation of hashes even for large files is very fast (comparable to the generation of a UUID) -even for GB files. - -However, this poses a few of potential problem: - -- if one wants to work in streaming mode, the hash could be computed only at the *end*, - after the whole stream is processed. While this is still ok for loose objects, this is a problem - when writing directly to packs. One possible workaround could be to decide that objects are added - to random packs, and store the pack as an additional column in the index file. - -- Even worse, it is not possible to know if the object already exists before it has been completely received - (and therefore written somewhere, because it might not fit in memory). Then one would need to perform a search - if an object with the same hash exists, and possibly discard the file (that might actually have been already written to a pack). - Finally, one needs to do this in a way that works also for concurrent writes, and does not fail if two processes - write objects with the same SHA at the same time. - -- One could *partially* address the problem, for bulk writes, by asking the user to provide the hash beforehand. However, - I think it is a not a good idea; instead, it is better keep stronger guarantees at the library level: - otherwise one has to have logic to decide what to do if the hash provided by the user turns out to be wrong. - -- Instead, the managing of hashing could be done at a higher level, by the AiiDA repository implementation: anyway - the repository needs to keep track of the filename and relative path of each file (within the node repository), - and the corresponding object-store UUIDs. The implementation could then also compute and store the hash in some - efficient way, and then keep a table mapping SHA hashes to the UUIDs in the object store, and take care of the - deduplication. - This wouldn't add too much cost, and would keep a separation of concerns so that the object-store implementation can be simpler, - give higher guarantees, be more robust, and make it simpler to guarantee that data is not corrupted even for concurrent access. - ## Pros and Cons ### Pros * Is very efficient also with hundreds of thousands of objects (bulk reads can read all objects in a couple of seconds). * Is rsync-friendly, so one can suggest to use directly rsync for backups instead of custom scripts. -* An implementation exists, seems to be very efficient, and is +* A library (`disk-objectstore`) has already been implemented, seems to be very efficient, and is relatively short (and already has tests with 100% coverage, including concurrent writes, reads, and one repacking process, and checking on three platforms: linux, mac os, and windows). * It does not add any additional python depenency to AiiDA, and it does not require a service running. -* It can be used also for the internal format of AiiDA export files, - and this should be very efficient, even to later retrieve just - a subset of the files. * Implements compression in a transparent way. * It shouldn't be slower than the current AiiDA implementation in writing during normal operation, since objects are stored as loose objects. Actually, it might be faster for nodes without files, as no disk - access will be needed anymore. + access will be needed anymore, and automatically provides deduplication of files. ### Cons * Extreme care is needed to convince ourselves that there are no @@ -388,9 +338,7 @@ However, this poses a few of potential problem: not performed, the performance will be the same as the one currently of AiiDA, that stores essentially only loose objects. Risks of data corruption need to be carefully assessed mostly while packing. -* Object metadata must be tracked by the caller in some other database. - Similarly, deduplication via SHA hashes is not implemented and need to be - implemented by the caller. +* Object metadata must be tracked by the caller in some other database (e.g. AiiDA will have, for each node, a list of filenames and the corresponding hash key in the disk-objectstore). * It is not possible anymore to access directly the folder of a node by opening a bash shell and using `cd` to go the folder, e.g. to quickly check the content. From c4a5563fb5f2eb906309b9cd1a6843f6adefe865 Mon Sep 17 00:00:00 2001 From: Sebastiaan Huber Date: Tue, 14 Dec 2021 10:34:29 +0100 Subject: [PATCH 6/8] Fix formatting - One sentence per line and one line per sentence - Consistent enumeration symbols - Escape special markdown characters --- .../readme.md | 384 +++++++----------- 1 file changed, 142 insertions(+), 242 deletions(-) diff --git a/006_efficient_object_store_for_repository/readme.md b/006_efficient_object_store_for_repository/readme.md index dba4774..7065b02 100644 --- a/006_efficient_object_store_for_repository/readme.md +++ b/006_efficient_object_store_for_repository/readme.md @@ -10,283 +10,202 @@ | Status | implemented | ## Background -AiiDA 0.x and 1.x writes the "content" of each node in two places: attributes in the database, and files -(that do not need fast query) in a disk repository. -These files include for instance raw inputs and outputs of a job calculation, but also other binary or -textual information best stored directly as a file (some notable examples: pseudopotential files, -numpy arrays, crystal structures in CIF format). - -In AiiDA 0.x and 1.x, each of these files is directly stored in a folder structure, where each node "owns" a folder whose name -is based on the node UUID with two levels of sharding -(that is, if the node UUID is `4af3dd55-a1fd-44ec-b874-b00e19ec5adf`, -the folder will be `4a/f3/dd55-a1fd-44ec-b874-b00e19ec5adf`). -Files of a node are stored within the node repository folder, -possibly within a folder structure. - -While quite efficient when retrieving a single file -(keeping too many files or subfolders in the same folder is -inefficient), the current implementation suffers of a number of problems -when starting to have hundreds of thousands of nodes or more -(we already have databases with about ten million files). +AiiDA 0.x and 1.x writes the "content" of each node in two places: attributes in the database, and files (that do not need fast query) in a disk repository. +These files include for instance raw inputs and outputs of a job calculation, but also other binary or textual information best stored directly as a file (some notable examples: pseudopotential files, numpy arrays, crystal structures in CIF format). + +In AiiDA 0.x and 1.x, each of these files is directly stored in a folder structure, where each node "owns" a folder whose name is based on the node UUID with two levels of sharding (that is, if the node UUID is `4af3dd55-a1fd-44ec-b874-b00e19ec5adf`, the folder will be `4a/f3/dd55-a1fd-44ec-b874-b00e19ec5adf`). +Files of a node are stored within the node repository folder, possibly within a folder structure. + +While quite efficient when retrieving a single file (keeping too many files or subfolders in the same folder is inefficient), the current implementation suffers of a number of problems when starting to have hundreds of thousands of nodes or more (we already have databases with about ten million files). In particular: - there is no way to compress files unless the AiiDA plugin does it; - if a code generates hundreds of files, each of them is stored as an individual file; - there are many empty folders that are generated for each node, even when no file needs to be stored. -We emphasize that having many inodes (files or folders) is a big problem. In particular: -- a common filesystem (e.g. EXT4) has a maximum number of inodes that can be stored, and we already have - experience of AiiDA repositories that hit this limit. The disk can be reformatted to increase this limit, - but this is clearly not an option that can be easily suggested to users. -- When performing "bulk" operations on many files (two common cases are when exporting a portion of the DB, - or when performing a backup), accessing hundreds of thousands of files is extremely slow. Disk caches - (in Linux, for instance) make this process much faster, but this works only as long as the system has a lot - of RAM, and the files have been accessed recently. Note that this is not an issue of AiiDA. Even just running - `rsync` on a folder with hundreds of thousands of files (not recently accessed) can take minutes or even hours - just to check if the files need to be updated, while if the same content is in a single big file, the operation - would take much less (possibly even less than a second for a lot of small files). As a note, this is the reason - why at the moment we provide a backup script to perform incremental backups of the repository (checking the - timestamp in the AiiDA database and only transferring node repository folders of new nodes), but the goal would be to rely instead on standard tools like `rsync`. +We emphasize that having many inodes (files or folders) is a big problem. +In particular: +- a common filesystem (e.g. EXT4) has a maximum number of inodes that can be stored, and we already have experience of AiiDA repositories that hit this limit. + The disk can be reformatted to increase this limit, but this is clearly not an option that can be easily suggested to users. +- When performing "bulk" operations on many files (two common cases are when exporting a portion of the DB, or when performing a backup), accessing hundreds of thousands of files is extremely slow. + Disk caches (in Linux, for instance) make this process much faster, but this works only as long as the system has a lot of RAM, and the files have been accessed recently. + Note that this is not an issue of AiiDA. + Even just running `rsync` on a folder with hundreds of thousands of files (not recently accessed) can take minutes or even hours just to check if the files need to be updated, while if the same content is in a single big file, the operation would take much less (possibly even less than a second for a lot of small files). + As a note, this is the reason why at the moment we provide a backup script to perform incremental backups of the repository (checking the timestamp in the AiiDA database and only transferring node repository folders of new nodes), but the goal would be to rely instead on standard tools like `rsync`. ## Proposed Enhancement The goal of this proposal is to have a very efficient implementation of an "object store" (or, more correctly, a key-value store) that: - works directly on a disk folder (i.e. only requires access to a folder on a disk); -- ideally, does not require a service to be running in the background, - to avoid to have to ask users to run yet another service to use AiiDA; +- ideally, does not require a service to be running in the background, to avoid to have to ask users to run yet another service to use AiiDA; - and addresses a number of performance issues mentioned above and discussed more in detail below. -**NOTE**: This AEP does not address the actual implementation of the object store within AiiDA, but -rather the implementation of an object-store-like service as an independent package, to solve a number of performance issues. This is now implemented as part of the [disk-objecstore](https://github.com/aiidateam/disk-objectstore) package that will be used by AiiDA from version 2.0. +**NOTE**: This AEP does not address the actual implementation of the object store within AiiDA, but rather the implementation of an object-store-like service as an independent package, to solve a number of performance issues. +This is now implemented as part of the [disk-objecstore](https://github.com/aiidateam/disk-objectstore) package that will be used by AiiDA from version 2.0. -The description of the integration with AiiDA will be discussed in a different AEP -(see [PR #7](https://github.com/aiidateam/AEP/pull/7) for some preliminary discussion, written before this AEP). +The description of the integration with AiiDA will be discussed in a different AEP (see [PR #7](https://github.com/aiidateam/AEP/pull/7) for some preliminary discussion, written beforet his AEP). ## Detailed Explanation -**NOTE**: This document discusses the reasoning behind the implementation of the `disk-objectstore`. The implementation details reflect what is in the libary as of version 0.6.0 (as of late 2021). It should be *not* considered as a documentation of the `disk-objectstore` package (as the implementation choices might be adapted in the future), but rather as a reference of the reason for the introduction of the package, of the design decisions, and of why they were made. +**NOTE**: This document discusses the reasoning behind the implementation of the `disk-objectstore`. +The implementation details reflect what is in the libary as of version 0.6.0 (as of late 2021). +It should be *not* considered as a documentation of the `disk-objectstore` package (as the implementation choices might be adapted in the future), but rather as a reference of the reason for the introduction of the package, of the design decisions, and of why they were made. -The goal of this AEP is to define some design decisions for a library (an "object store", to be used internally by AiiDA) that, -given a stream of bytes, stores it as efficiently as possible somewhere, assigns a unique key (a "name") to it, -and then allows to retrieve it with that key. +The goal of this AEP is to define some design decisions for a library (an "object store", to be used internally by AiiDA) that, given a stream of bytes, stores it as efficiently as possible somewhere, assigns a unique key (a "name") to it, and then allows to retrieve it with that key. -The libary also supports efficient bulk write and read operations -to cover the cases of bulk operations in AiiDA (exporting, importing, -backups). +The libary also supports efficient bulk write and read operations to cover the cases of bulk operations in AiiDA (exporting, importing, backups). Here we describe some design criteria for such an object store. -We also provide an implementation that follows these criteria in the -[disk-objectstore](https://github.com/giovannipizzi/disk-objectstore) -repository, where efficiency has also been benchmarked. +We also provide an implementation that follows these criteria in the [disk-objectstore](https://github.com/giovannipizzi/disk-objectstore) repository, where efficiency has also been benchmarked. ### Performance guidelines and requirements #### Packing and loose objects -The core idea behind the object store implementation is that, instead of writing a lot of small files, these -are packed into a few big files, and an index of some sort keeps track of where each object is in the big file. +The core idea behind the object store implementation is that, instead of writing a lot of small files, these are packed into a few big files, and an index of some sort keeps track of where each object is in the big file. This is for instance what also git does internally. -However, one cannot write directly into a pack file: this would be inefficient and also not robust, especially because of a key requirement: -multiple Unix processes (AiiDA verdi shells, the various damon workers, ...) must be able to write efficiently and *concurrently*, i.e. at the same time, without data -corruption (for instance, the AiiDA daemon is composed of multiple +However, one cannot write directly into a pack file: this would be inefficient and also not robust, especially because of a key requirement: multiple Unix processes (AiiDA verdi shells, the various damon workers, ...) must be able to write efficiently and *concurrently*, i.e. at the same time, without data corruption (for instance, the AiiDA daemon is composed of multiple concurrent workers accessing the repository). -Therefore, as also `git` does, we introduce the concept of `loose` objects (where each object is stored as a different -file) and of `packed` objects when these are written as part of a bigger pack file. +Therefore, as also `git` does, we introduce the concept of `loose` objects (where each object is stored as a different file) and of `packed` objects when these are written as part of a bigger pack file. The key concept is that we want maximum performance when writing a new object. Therefore, each new object is written, by default, in loose format. Periodically, a packing operation can be executed (in a first version this will be triggered manually by the users and properly documented; in the future, we might think to automatic triggering this based on performance heuristics), moving loose objects into packs. -Accessing an object should only require that the user of the library provides its key, and the concept of packing should -be hidden to the user (at least when retrieving, while it can be exposed for maintenance operations like repacking). +Accessing an object should only require that the user of the library provides its key, and the concept of packing should be hidden to the user (at least when retrieving, while it can be exposed for maintenance operations like repacking). #### Efficiency decisions -Here follows a discussion of some of the decisions that were made in the implementation. These are based on a compromise between -the different requirements, and represent what can be found in the current implementation in the -[disk-objectstore](https://github.com/giovannipizzi/disk-objectstore) package. +Here follows a discussion of some of the decisions that were made in the implementation. +These are based on a compromise between the different requirements, and represent what can be found in the current implementation in the [disk-objectstore](https://github.com/giovannipizzi/disk-objectstore) package. -- As discussed above, objects are written by default as loose objects, with one file per object. - They are also stored uncompressed. This gives maximum performance when writing a file (e.g. while storing - a new node). Moreover, it ensures that many writers can write at the same time without data corruption. +- As discussed above, objects are written by default as loose objects, with one file per object. + They are also stored uncompressed. This gives maximum performance when writing a file (e.g. while storing a new node). + Moreover, it ensures that many writers can write at the same time without data corruption. - Loose objects are stored with a one-level sharding format: `4a/f3dd55a1fd44ecb874b00e19ec5adf`. - Current experience (with AiiDA) shows that it is actually not so good to use two - levels of nesting, that was employed to avoid to have too many files in the same folder. And anyway the core idea of this implementationis that when there are too many loose objects, the user will pack them. - The number of characters in the first part can be chosen in the library, but a good compromise - after testing is 2 (the default, and also the value used internally by git). + Current experience (with AiiDA) shows that it is actually not so good to use two levels of nesting, that was employed to avoid to have too many files in the same folder. + And anyway the core idea of this implementationis that when there are too many loose objects, the user will pack them. + The number of characters in the first part can be chosen in the library, but a good compromise after testing is 2 (the default, and also the value used internally by git). -- Loose objects are actually written first to a sandbox folder (in the same filesystem), - and then moved in place with the correct UUID only when the file is closed, with an atomic operation. +- Loose objects are actually written first to a sandbox folder (in the same filesystem), and then moved in place with the correct UUID only when the file is closed, with an atomic operation. This should prevent having leftover objects if the process dies. In this way, we can rely simply on the fact that an object present in the folder of loose objects signifies that the object exists, without needing a centralised place to keep track of them. - Having the latter would be a potential performance bottleneck, as if there are many concurrent - writers, the object store must guarantee that the central place is kept updated (and e.g. using a SQLite database for this - as it's used for the packs - is not a good solution because only one writer at a time can act on a SQLite database, and all others have to wait and risk to timeout). + Having the latter would be a potential performance bottleneck, as if there are many concurrent writers, the object store must guarantee that the central place is kept updated (and e.g. using a SQLite database for this - as it's used for the packs - is not a good solution because only one writer at a time can act on a SQLite database, and all others have to wait and risk to timeout). - Packing can be triggered by the user periodically, whenever the user wants. - Here, and in the following, packing means bundling loose objects in a few - "pack" files, possibly (optionally) compressing the objects. - A key requirement is that it should be possible to pack while the object store - is in use, without the need to stop its use (which would in turn - require to stop the use of AiiDA and the deamons during these operations). + Here, and in the following, packing means bundling loose objects in a few "pack" files, possibly (optionally) compressing the objects. + A key requirement is that it should be possible to pack while the object store is in use, without the need to stop its use (which would in turn require to stop the use of AiiDA and the deamons during these operations). This is possible in the current implementation, but might temporarily impact read performance while repacking (which is probably acceptable). - Instead, it is *not* required that packing can be performed in parallel by multiple processes - (on the contrary, the current implementation actually tries to prevent multiple processes trying to perform - write operations on packs at the same time: i.e., only a single packer process should perform the operation at any given time). + Instead, it is *not* required that packing can be performed in parallel by multiple processes (on the contrary, the current implementation actually tries to prevent multiple processes trying to perform write operations on packs at the same time: i.e., only a single packer process should perform the operation at any given time). -- In practice, this packing operation takes all loose objects and puts them in a controllable number - of packs. The name of the packs is given by an integer. A new pack is created when all previous ones are "full", where full is defined when the pack size goes beyond a threshold (by default 4GB/pack. +- In practice, this packing operation takes all loose objects and puts them in a controllable number of packs. + The name of the packs is given by an integer. + A new pack is created when all previous ones are "full", where full is defined when the pack size goes beyond a threshold (by default 4GB/pack). This size is a good compromise: it's similar to a "movie" file, so not too big to deal with (can fit e.g. in a disk, people are used to deal with files of a few GBs) so it's not too big, but it's big enough so that even for TB-large repositories, the number of pakcs is of the order of a few tens, and therefore this solves the issue of having millions of files. -- Pack files are just concatenation of bytes of the packed objects. Any new object - is appended to the pack (thanks to the efficiency of opening a file for appending). - The information for the offset and length of each pack is kept in a single SQLite - database for the whole set of objects, as we describe below. +- Pack files are just concatenation of bytes of the packed objects. + Any new object is appended to the pack (thanks to the efficiency of opening a file for appending). + The information for the offset and length of each pack is kept in a single SQLite database for the whole set of objects, as we describe below. -- Packed objects can optionally be compressed. Note that compression is on a - per-object level. The information on whether an object is compressed or not is stored in the index. When - the users ask for an object, they always get back the uncompressed version (so - they don't have to worry if objects are compressed or not when retrieving them). +- Packed objects can optionally be compressed. + Note that compression is on a per-object level. + The information on whether an object is compressed or not is stored in the index. + When the users ask for an object, they always get back the uncompressed version (so they don't have to worry if objects are compressed or not when retrieving them). This allows much greater flexibility, and avoid e.g. to decide to avoid to recompress files that are already compressed or where compression would give little to no benefit. - In the future, one could also think to clever logic or heuristics to try to compress a file, but then store it - uncompressed if it turns out that the compression ratio is not worth the time - needed to further uncompress it later. - At the moment, one compression will be chosen and used by default (currently - zlib, but in issues it has been suggested to use more modern formats like - `xz`, or even better [snappy](https://github.com/google/snappy) that is very fast and designed for purposes - like this). - The compression library is already an option, and which one to use is stored in the JSON file that contains the settings of the - object-store container. - -- A note on compression: the user can always compress objects first, and then store - a compressed version of them and take care of remembering if an object - was stored compressed or not. However, the implementation of compression - directly in the object store, as described in the previous point, has the two advantages that compression is done only while packing, - so there is no performance hit while just storing a new object, - and that is completely transparent to the user (while packing, the user can - decide to compress data or not; then, when retrieving an object from of the - object store, there will not be any difference - except possibly speed in - retrieving the data - because the API to retrieve the objects will be the same, - irrespective of whether the object has been stored as compressed or not; - and data is always returned uncompressed). - -- API exists both to get and write a single object but also, *importantly*, to write directly - to pack files (this cannot be done by multiple processes at the same time, though), - and to read in bulk a given number of objects. - This is particularly convenient when using the object store for bulk import and - export, and very fast. Also, it could be useful when getting all files of one or more nodes. Actually, for export files, one could internally - use the same object-store implementation to store files in the export file. - - During normal operation, however, as discussed above, we expect the library user to write loose objects, - to be repacked periodically (e.g. once a week, or when needed). - - Some reference results for bulk operations in the current implementation: - Storing 100'000 small objects directly to the packs takes about 10s. - The time to retrieve all of them is ~2.2s when using a single bulk call, - compared to ~44.5s when using 100'000 independent calls (still probably acceptable). - Moreover, even getting, in 10 bulk calls, 10 random chunks of the objects (eventually - covering the whole set of 100'000 objects) only takes ~3.4s. This should demonstrate - that exporting a subset of the graph should be efficient. Also, this should be compared to minutes to hours - when storing each object as individual files. - -- All operations internally (storing to a loose object, storing to a pack, reading - from a loose object or from a pack, compression) should happen via streaming - (at least internally, and there should be a public facing API for this). - So, even when dealing with huge files, these never fill the RAM (e.g. when reading - or writing a multi-GB file, the memory usage has been tested to be capped at ~150MB). - Convenience methods are available, anyway, to get directly an object content, if - the user wants this for simplicty, and knows that the content fits in RAM. + In the future, one could also think to clever logic or heuristics to try to compress a file, but then store it uncompressed if it turns out that the compression ratio is not worth the time needed to further uncompress it later. + At the moment, one compression will be chosen and used by default (currently zlib, but in issues it has been suggested to use more modern formats like `xz`, or even better [snappy](https://github.com/google/snappy) that is very fast and designed for purposes like this). + The compression library is already an option, and which one to use is stored in the JSON file that contains the settings of the object-store container. + +- A note on compression: the user can always compress objects first, and then store a compressed version of them and take care of remembering if an object was stored compressed or not. + However, the implementation of compression directly in the object store, as described in the previous point, has the two advantages that compression is done only while packing, so there is no performance hit while just storing a new object, and that is completely transparent to the user (while packing, the user can decide to compress data or not; then, when retrieving an object from of the object store, there will not be any difference - except possibly speed in retrieving the data - because the API to retrieve the objects will be the same, irrespective of whether the object has been stored as compressed or not; and data is always returned uncompressed). + +- API exists both to get and write a single object but also, *importantly*, to write directly to pack files (this cannot be done by multiple processes at the same time, though), and to read in bulk a given number of objects. + This is particularly convenient when using the object store for bulk import and export, and very fast. + Also, it could be useful when getting all files of one or more nodes. + Actually, for export files, one could internally use the same object-store implementation to store files in the export file. + + During normal operation, however, as discussed above, we expect the library user to write loose objects, to be repacked periodically (e.g. once a week, or when needed). + + Some reference results for bulk operations in the current implementation: Storing 100'000 small objects directly to the packs takes about 10s. + The time to retrieve all of them is \~2.2s when using a single bulk call, compared to \~44.5s when using 100'000 independent calls (still probably acceptable). + Moreover, even getting, in 10 bulk calls, 10 random chunks of the objects (eventually covering the whole set of 100'000 objects) only takes \~3.4s. + This should demonstrate that exporting a subset of the graph should be efficient. + Also, this should be compared to minutes to hours when storing each object as individual files. + +- All operations internally (storing to a loose object, storing to a pack, reading from a loose object or from a pack, compression) should happen via streaming (at least internally, and there should be a public facing API for this). + So, even when dealing with huge files, these never fill the RAM (e.g. when reading or writing a multi-GB file, the memory usage has been tested to be capped at \~150MB). + Convenience methods are available, anyway, to get directly an object content, if the user wants this for simplicty, and knows that the content fits in RAM. #### Further design choices -- The key of an object will be its hash. For each container, one can decide which hash algorithm to use; the default one is `sha256` that offers a good compromise between speed and avoiding risk of collisions. Once an object is stored, it's the responsibility of the `disk-objectstore` library to return the hash of the object that was just stored. +- The key of an object will be its hash. + For each container, one can decide which hash algorithm to use; the default one is `sha256` that offers a good compromise between speed and avoiding risk of collisions. + Once an object is stored, it's the responsibility of the `disk-objectstore` library to return the hash of the object that was just stored. -- Using a hash means that we automatically get deduplication of content: if an object is asked to be written, once the stream is received, if the library detects that the object is already present, it will still return the hash key but not store it twice. So, from the point of view of the end application (AiiDA), it does not need to know that deduplication is performed: it just has to send a sequence of bytes, and store the corresponding hash key returned by the library +- Using a hash means that we automatically get deduplication of content: if an object is asked to be written, once the stream is received, if the library detects that the object is already present, it will still return the hash key but not store it twice. + So, from the point of view of the end application (AiiDA), it does not need to know that deduplication is performed: it just has to send a sequence of bytes, and store the corresponding hash key returned by the library. -- The hashing library can be decided for each container and is configurable; however, for performance reasons, in AiiDA it will be better to decide and stick to one only algorithm: this will allow to compare e.g. two different repositories (e.g. when sharing data and/or syncing) and establish if data already exists by just comparing the hashes. If different hash algorithms are instead used by the two containers, one needs to do a full data transfer of the whole container, to discover if new data needs to be transfered or not. +- The hashing library can be decided for each container and is configurable; however, for performance reasons, in AiiDA it will be better to decide and stick to one only algorithm. + This will allow to compare e.g. two different repositories (e.g. when sharing data and/or syncing) and establish if data already exists by just comparing the hashes. + If different hash algorithms are instead used by the two containers, one needs to do a full data transfer of the whole container, to discover if new data needs to be transfered or not. -- Pack naming and strategy is not determined by the user. Anyway it would be difficult - to provide easy options to the user to customize the behavior, while implementing - a packing strategy that is efficient. Which object is in which pack is tracked in the SQLite database. - But at the moment it does not seem necessary. Possible future changes of the internal packing format should not - affect the users of the library, since users only ask to get an object by hash key, - and in general they do not need to know if they are getting a loose object, - a packed object, from which pack, ... +- Pack naming and strategy is not determined by the user. + Anyway it would be difficult to provide easy options to the user to customize the behavior, while implementing a packing strategy that is efficient. + Which object is in which pack is tracked in the SQLite database. + But at the moment it does not seem necessary. + Possible future changes of the internal packing format should not affect the users of the library, since users only ask to get an object by hash key, and in general they do not need to know if they are getting a loose object, a packed object, from which pack, ... -- For each object, the SQLite database contains the following fields, that can be considered - to be the "metadata" of the packed object: its key (`hashkey`), the `offset` (starting - position of the bytestream in the pack file), the `length` (number of bytes to read), - a boolean `compressed` flag, meaning if the bytestream has been zlib-compressed, - and the `size` of the actual data (equal to `length` if `compressed` is false, - otherwise the size of the uncompressed stream, useful for statistics for instance, or - to inform the reader beforehand of how much data will be returned, before it starts - reading, so the reader can decide to store in RAM the whole object or to process it - in a streaming way). In addition, it tracks the number of the pack in which the object is stored. +- For each object, the SQLite database contains the following fields, that can be considered to be the "metadata" of the packed object: its key (`hashkey`), the `offset` (starting + position of the bytestream in the pack file), the `length` (number of bytes to read), a boolean `compressed` flag, meaning if the bytestream has been zlib-compressed, and the `size` of the actual data (equal to `length` if `compressed` is false, otherwise the size of the uncompressed stream, useful for statistics for instance, or to inform the reader beforehand of how much data will be returned, before it starts reading, so the reader can decide to store in RAM the whole object or to process it in a streaming way). + In addition, it tracks the number of the pack in which the object is stored. - We decided to use SQLite because of the following reasons: - - it's a database, so it's efficient in searching for the metadata - of a given object; + - it's a database, so it's efficient in searching for the metadata of a given object; - it does not require a running server; - - in [WAL mode](https://www.sqlite.org/wal.html), allows many concurrent readers and one writer, - useful to allow to continue normal operations by many Unix processes during repacking; - - we access it via the SQLAlchemy library that anyway is already - a dependency of AiiDA, and is actually the only dependency of the - current object-store implementation; + - in [WAL mode](https://www.sqlite.org/wal.html), allows many concurrent readers and one writer, useful to allow to continue normal operations by many Unix processes during repacking; + - we access it via the SQLAlchemy library that anyway is already a dependency of AiiDA, and is actually the only dependency of the current object-store implementation; - it's quite widespread and so the libary to read and write should be reliable; in addition, SQLite has a clear [long-term support planning](https://www.sqlite.org/lts.html). -- Deletion can just occur efficiently as either a deletion of the loose object, or - a removal from the index file (if the object is already packed). Later repacking of the packs can be used to recover - the disk space still occupied in the pack files. It is hard to find a better strategy that does not require manual repacking but gives all other guarantees especially for fast live operations (as a reference, also essentially all databases do the same and have a "vacuum" operation that is in a sense equivalent to the concept of repacking here). +- Deletion can just occur efficiently as either a deletion of the loose object, or a removal from the index file (if the object is already packed). + Later repacking of the packs can be used to recover the disk space still occupied in the pack files. + It is hard to find a better strategy that does not require manual repacking but gives all other guarantees especially for fast live operations (as a reference, also essentially all databases do the same and have a "vacuum" operation that is in a sense equivalent to the concept of repacking here). -- The object store does not need to provide functionality to modify - an object. In AiiDA, files of nodes are typically only added, and once - added they are immutable (and very rarely they can be deleted). +- The object store does not need to provide functionality to modify an object. + In AiiDA, files of nodes are typically only added, and once added they are immutable (and very rarely they can be deleted). - If modification is needed, this can be achieved by creation of a new - object and deletion of the old one, since this is an extremely - rare operation (actually it should never happen). + If modification is needed, this can be achieved by creation of a new object and deletion of the old one, since this is an extremely rare operation (actually it should never happen). -- The current packing format is `rsync`-friendly (that is one of the original requirements). - `rsync` has a clever rolling algorithm that can divides each file in blocks and - detects if the same block is already in the destination file, even at a different position. - Therefore, if content is appended to a pack file, or even if a pack is "repacked" (e.g. reordering - objects inside it, or removing deleted objects) this does not prevent efficient - rsync transfer (this has been tested in the implementation). +- The current packing format is `rsync`-friendly (that is one of the original requirements). + `rsync` has a clever rolling algorithm that can divides each file in blocks and detects if the same block is already in the destination file, even at a different position. + Therefore, if content is appended to a pack file, or even if a pack is "repacked" (e.g. reordering objects inside it, or removing deleted objects) this does not prevent efficient rsync transfer (this has been tested in the implementation). -- Since it works on a filesystem backend, this would allow to use also other - tools, e.g. [`rclone`](https://rclone.org), to move the data to some different - backend (a "real" object store, Google Drive, or anything else). +- Since it works on a filesystem backend, this would allow to use also other tools, e.g. [`rclone`](https://rclone.org), to move the data to some different backend (a "real" object store, Google Drive, or anything else). - Appending content to a single file does not prevent the Linux disk cache to work efficiently. Indeed, the caches are per blocks/pages in linux, not per file. - Concatenating to files does not impact performance on cache efficiency. What is costly is opening a file, as the filesystem - has to provide some guarantees e.g. on concurrent access. - As a note, seeking a file to a given position is what one typically does when watching a - video and jumping to a different section. - -- Packing in general, at this stage, is left to the user. We can decide (at the object-store level, or probably - better at the AiiDA level) to suggest the user to repack, or to trigger the repacking automatically. - This can be a feature introduced at a second time. For instance, the first version we roll out could just suggest - to repack periodically in the docs. - This could be a good approach, also to bind the repacking with the backups (at the moment, - probably backups need to be executed using appropriate scripts to backup the DB index and the repository - in the "right order", and possibly using SQLite functions to get a dump). - -* As a note: even if repacking is never done by the user, the situation is anyway improved with respect to the current one in AiiDA: - * an index of files associated with an AiiDA node will now be stored in the AiiDA DB, so getting the list of files associated to a node without content will not need anymore to access the disk; + Concatenating to files does not impact performance on cache efficiency. + What is costly is opening a file, as the filesystem has to provide some guarantees e.g. on concurrent access. + As a note, seeking a file to a given position is what one typically does when watching a video and jumping to a different section. + +- Packing in general, at this stage, is left to the user. + We can decide (at the object-store level, or probably better at the AiiDA level) to suggest the user to repack, or to trigger the repacking automatically. + This can be a feature introduced at a second time. + For instance, the first version we roll out could just suggest to repack periodically in the docs. + This could be a good approach, also to bind the repacking with the backups (at the moment, probably backups need to be executed using appropriate scripts to backup the DB index and the repository in the "right order", and possibly using SQLite functions to get a dump). + +- As a note: even if repacking is never done by the user, the situation is anyway improved with respect to the current one in AiiDA: + - an index of files associated with an AiiDA node will now be stored in the AiiDA DB, so getting the list of files associated to a node without content will not need anymore to access the disk; - * there wouldn't be anymore empty folders created for nodes without files; + - there wouldn't be anymore empty folders created for nodes without files; - * automatic deduplication of the data is now done transparently. + - automatic deduplication of the data is now done transparently. - In a second phase, we can print suggestions, e.g. when restarting the daemon, - that suggests to repack, for instance if the number of loose objects is too large. + In a second phase, we can print suggestions, e.g. when restarting the daemon, that suggests to repack, for instance if the number of loose objects is too large. We will also provide `verdi` commands to facilitate the user in these maintenance operations. - Finally, in the future if we are confident that this approach works fine, we can also automate the repacking. We need to be careful - that two different processes don't start packing at the same time, and that the user is aware that packing will be - triggered, that it might take some time, and that the packing process should not be killed - (this might be inconvenient, and this is why I would think twice before implementing an automatic repacking). + Finally, in the future if we are confident that this approach works fine, we can also automate the repacking. + We need to be careful that two different processes don't start packing at the same time, and that the user is aware that packing will be triggered, that it might take some time, and that the packing process should not be killed (this might be inconvenient, and this is why I would think twice before implementing an automatic repacking). ### Why a custom implementation of the library We have been investigating if existing codes could be used for the current purpose. @@ -294,55 +213,36 @@ We have been investigating if existing codes could be used for the current purpo However, in our preliminary analysis we couldn't find an existing robust software that was satisfying all criteria. In particular: -- existing object storage implementations (e.g. Open Stack Swift, or others that typically provide a - S3 or similar API) are not suitable since 1) they require a server running, and we would like to - avoid the complexity of asking users to run yet another server, and most importantly 2) they usually work - via a REST API that is extremely inefficient when retrieving a lot of small objects (latencies can - be even of tens of ms, that is clearly unacceptable if we want to retrieve millions of objects). - -- the closest tool we could find is the git object store, for which there exists also a pure-python - implementation ([dulwich](https://github.com/dulwich/dulwich)). We have been benchmarking it, but - the feeling is that it is designed to do something else, and adapting to our needs might not be worth it. - Some examples of issues we envision: managing re-packing (dulwich can only pack loose objects, but not repack existing packs); this can be done via git itself, but then we need to ensuring that when repacking, objects are not garbage-collected and deleted because they are not referenced within git; - it's not possible (apparently) to decide if we want to compress an object or not (e.g., in case we want maximum performance at the cost of disk space), but they are always compressed. +- existing object storage implementations (e.g. Open Stack Swift, or others that typically provide a S3 or similar API) are not suitable since 1) they require a server running, and we would like to avoid the complexity of asking users to run yet another server, and most importantly 2) they usually work via a REST API that is extremely inefficient when retrieving a lot of small objects (latencies can be even of tens of ms, that is clearly unacceptable if we want to retrieve millions of objects). + +- the closest tool we could find is the git object store, for which there exists also a pure-python implementation ([dulwich](https://github.com/dulwich/dulwich)). + We have been benchmarking it, but the feeling is that it is designed to do something else, and adapting to our needs might not be worth it. + Some examples of issues we envision: managing re-packing (dulwich can only pack loose objects, but not repack existing packs); this can be done via git itself, but then we need to ensuring that when repacking, objects are not garbage-collected and deleted because they are not referenced within git. + It's not possible (apparently) to decide if we want to compress an object or not (e.g., in case we want maximum performance at the cost of disk space), but they are always compressed. Also we did not test concurrency access and packing of the git repository, which requires some stress test to assess if it works. -- One possible solution could be a format like HDF5. However we need a very efficient hash-table like access - (given a key, get the object), while HDF5 is probably designed to append data to the last dimension of - a multidimensional array. Variables exist, but (untested) probably they cannot scale well - at the scale of tens of millions. +- One possible solution could be a format like HDF5. + However we need a very efficient hash-table like access (given a key, get the object), while HDF5 is probably designed to append data to the last dimension of a multidimensional array. Variables exist, but (untested) probably they cannot scale well at the scale of tens of millions. -- One option we didn't investigate is the mongodb object store. Note that this would require running a server - though (but could be acceptable if for other reasons it becomes advantageous). +- One option we didn't investigate is the mongodb object store. + Note that this would require running a server though (but could be acceptable if for other reasons it becomes advantageous). ## Pros and Cons ### Pros * Is very efficient also with hundreds of thousands of objects (bulk reads can read all objects in a couple of seconds). * Is rsync-friendly, so one can suggest to use directly rsync for backups instead of custom scripts. -* A library (`disk-objectstore`) has already been implemented, seems to be very efficient, and is - relatively short (and already has tests with 100% coverage, - including concurrent writes, reads, and one repacking process, and checking on three platforms: linux, mac os, and windows). -* It does not add any additional python depenency to AiiDA, and it - does not require a service running. +* A library (`disk-objectstore`) has already been implemented, seems to be very efficient, and is relatively short (and already has tests with 100% coverage, including concurrent writes, reads, and one repacking process, and checking on three platforms: linux, mac os, and windows). +* It does not add any additional python depenency to AiiDA, and it does not require a service running. * Implements compression in a transparent way. -* It shouldn't be slower than the current AiiDA implementation in writing - during normal operation, since objects are stored as loose objects. - Actually, it might be faster for nodes without files, as no disk - access will be needed anymore, and automatically provides deduplication of files. +* It shouldn't be slower than the current AiiDA implementation in writing during normal operation, since objects are stored as loose objects. + Actually, it might be faster for nodes without files, as no disk access will be needed anymore, and automatically provides deduplication of files. ### Cons -* Extreme care is needed to convince ourselves that there are no - bugs and no risk of corrupting or losing the users' data. This is clearly - non trivial and requires a lot of work. Note, however, that if packing is - not performed, the performance will be the same as the one currently of AiiDA, - that stores essentially only loose objects. Risks of data corruption need - to be carefully assessed mostly while packing. +* Extreme care is needed to convince ourselves that there are no bugs and no risk of corrupting or losing the users' data. + This is clearly non trivial and requires a lot of work. + Note, however, that if packing is not performed, the performance will be the same as the one currently of AiiDA, that stores essentially only loose objects. + Risks of data corruption need to be carefully assessed mostly while packing. * Object metadata must be tracked by the caller in some other database (e.g. AiiDA will have, for each node, a list of filenames and the corresponding hash key in the disk-objectstore). -* It is not possible anymore to access directly the folder of a node by opening - a bash shell and using `cd` to go the folder, - e.g. to quickly check the content. - However, we have `verdi node repo dump` - so direct access should not be needed anymore, and actually this - might be good to prevent that people corrupt by - mistake the repository. +* It is not possible anymore to access directly the folder of a node by opening a bash shell and using `cd` to go the folder, e.g. to quickly check the content. + However, we have `verdi node repo dump` so direct access should not be needed anymore, and actually this might be good to prevent that people corrupt by mistake the repository. From c6dcc023ce013b2bf9a8165d8b6e69f6c94e24e3 Mon Sep 17 00:00:00 2001 From: Giovanni Pizzi Date: Wed, 15 Dec 2021 19:48:14 +0100 Subject: [PATCH 7/8] Apply suggestions from code review Co-authored-by: Sebastiaan Huber --- .../readme.md | 89 ++++++++++--------- 1 file changed, 45 insertions(+), 44 deletions(-) diff --git a/006_efficient_object_store_for_repository/readme.md b/006_efficient_object_store_for_repository/readme.md index 7065b02..1878ff0 100644 --- a/006_efficient_object_store_for_repository/readme.md +++ b/006_efficient_object_store_for_repository/readme.md @@ -10,13 +10,13 @@ | Status | implemented | ## Background -AiiDA 0.x and 1.x writes the "content" of each node in two places: attributes in the database, and files (that do not need fast query) in a disk repository. +AiiDA 0.x and 1.x write the "content" of each node in two places: attributes in the database, and files (that do not need fast query) in a file repository on the local file system. These files include for instance raw inputs and outputs of a job calculation, but also other binary or textual information best stored directly as a file (some notable examples: pseudopotential files, numpy arrays, crystal structures in CIF format). -In AiiDA 0.x and 1.x, each of these files is directly stored in a folder structure, where each node "owns" a folder whose name is based on the node UUID with two levels of sharding (that is, if the node UUID is `4af3dd55-a1fd-44ec-b874-b00e19ec5adf`, the folder will be `4a/f3/dd55-a1fd-44ec-b874-b00e19ec5adf`). +In AiiDA 0.x and 1.x, each of these files is directly stored in a folder structure, where each node "owns" a folder whose name is based on the node UUID with two levels of sharding; if the node UUID is `4af3dd55-a1fd-44ec-b874-b00e19ec5adf`, the folder will be `4a/f3/dd55-a1fd-44ec-b874-b00e19ec5adf`. Files of a node are stored within the node repository folder, possibly within a folder structure. -While quite efficient when retrieving a single file (keeping too many files or subfolders in the same folder is inefficient), the current implementation suffers of a number of problems when starting to have hundreds of thousands of nodes or more (we already have databases with about ten million files). +While quite efficient when retrieving a single file (keeping too many files or subfolders in the same folder is inefficient), the current implementation suffers of a number of problems when a large number of files are stored. In particular: - there is no way to compress files unless the AiiDA plugin does it; @@ -36,24 +36,24 @@ In particular: ## Proposed Enhancement The goal of this proposal is to have a very efficient implementation of an "object store" (or, more correctly, a key-value store) that: -- works directly on a disk folder (i.e. only requires access to a folder on a disk); +- works directly on a folder on the local file system (could optionally be a remote file system that is mounted locally); - ideally, does not require a service to be running in the background, to avoid to have to ask users to run yet another service to use AiiDA; - and addresses a number of performance issues mentioned above and discussed more in detail below. **NOTE**: This AEP does not address the actual implementation of the object store within AiiDA, but rather the implementation of an object-store-like service as an independent package, to solve a number of performance issues. This is now implemented as part of the [disk-objecstore](https://github.com/aiidateam/disk-objectstore) package that will be used by AiiDA from version 2.0. -The description of the integration with AiiDA will be discussed in a different AEP (see [PR #7](https://github.com/aiidateam/AEP/pull/7) for some preliminary discussion, written beforet his AEP). +The description of the integration with AiiDA will be discussed in a different AEP (see [PR #7](https://github.com/aiidateam/AEP/pull/7) for some preliminary discussion, written before this AEP). ## Detailed Explanation **NOTE**: This document discusses the reasoning behind the implementation of the `disk-objectstore`. The implementation details reflect what is in the libary as of version 0.6.0 (as of late 2021). -It should be *not* considered as a documentation of the `disk-objectstore` package (as the implementation choices might be adapted in the future), but rather as a reference of the reason for the introduction of the package, of the design decisions, and of why they were made. +It should *not* be considered as a documentation of the `disk-objectstore` package (as the implementation choices might be adapted in the future), but rather as a reference of the reason for the introduction of the package, of the design decisions, and of why they were made. -The goal of this AEP is to define some design decisions for a library (an "object store", to be used internally by AiiDA) that, given a stream of bytes, stores it as efficiently as possible somewhere, assigns a unique key (a "name") to it, and then allows to retrieve it with that key. +The goal of this AEP is to define some design decisions for a library (an "object store", to be used internally by AiiDA) that, given a stream of bytes, stores it as efficiently as possible somewhere, assigns a unique key (a "name") to it, and then allows it to be retrieved using that key. -The libary also supports efficient bulk write and read operations to cover the cases of bulk operations in AiiDA (exporting, importing, backups). +The library also supports efficient bulk write and read operations to cover the cases of bulk operations in AiiDA (exporting, importing and backups). Here we describe some design criteria for such an object store. We also provide an implementation that follows these criteria in the [disk-objectstore](https://github.com/giovannipizzi/disk-objectstore) repository, where efficiency has also been benchmarked. @@ -64,12 +64,11 @@ We also provide an implementation that follows these criteria in the [disk-objec The core idea behind the object store implementation is that, instead of writing a lot of small files, these are packed into a few big files, and an index of some sort keeps track of where each object is in the big file. This is for instance what also git does internally. -However, one cannot write directly into a pack file: this would be inefficient and also not robust, especially because of a key requirement: multiple Unix processes (AiiDA verdi shells, the various damon workers, ...) must be able to write efficiently and *concurrently*, i.e. at the same time, without data corruption (for instance, the AiiDA daemon is composed of multiple -concurrent workers accessing the repository). -Therefore, as also `git` does, we introduce the concept of `loose` objects (where each object is stored as a different file) and of `packed` objects when these are written as part of a bigger pack file. +However, one cannot write directly into a pack file: this would be inefficient and also not robust, especially because of a key requirement: multiple Unix processes (interactive shells, daemon workers, jupyter notebooks, etc.) must be able to write efficiently and *concurrently* without data corruption. +Therefore, as also `git` does, we introduce the concept of "loose" objects (where each object is stored as a different file) and of "packed" objects, objects that are concatenated to a small number of pack files. The key concept is that we want maximum performance when writing a new object. -Therefore, each new object is written, by default, in loose format. +Therefore, each new object is written, by default, as a loose object. Periodically, a packing operation can be executed (in a first version this will be triggered manually by the users and properly documented; in the future, we might think to automatic triggering this based on performance heuristics), moving loose objects into packs. Accessing an object should only require that the user of the library provides its key, and the concept of packing should be hidden to the user (at least when retrieving, while it can be exposed for maintenance operations like repacking). @@ -79,11 +78,12 @@ Here follows a discussion of some of the decisions that were made in the impleme These are based on a compromise between the different requirements, and represent what can be found in the current implementation in the [disk-objectstore](https://github.com/giovannipizzi/disk-objectstore) package. - As discussed above, objects are written by default as loose objects, with one file per object. - They are also stored uncompressed. This gives maximum performance when writing a file (e.g. while storing a new node). + They are also stored uncompressed. + This gives maximum performance when writing a file (e.g. while storing a new node). Moreover, it ensures that many writers can write at the same time without data corruption. - Loose objects are stored with a one-level sharding format: `4a/f3dd55a1fd44ecb874b00e19ec5adf`. - Current experience (with AiiDA) shows that it is actually not so good to use two levels of nesting, that was employed to avoid to have too many files in the same folder. + Current experience (with AiiDA) shows that it is actually not so good to use two levels of nesting, which was employed to avoid having too many files in the same folder. And anyway the core idea of this implementationis that when there are too many loose objects, the user will pack them. The number of characters in the first part can be chosen in the library, but a good compromise after testing is 2 (the default, and also the value used internally by git). @@ -93,8 +93,8 @@ These are based on a compromise between the different requirements, and represen Having the latter would be a potential performance bottleneck, as if there are many concurrent writers, the object store must guarantee that the central place is kept updated (and e.g. using a SQLite database for this - as it's used for the packs - is not a good solution because only one writer at a time can act on a SQLite database, and all others have to wait and risk to timeout). -- Packing can be triggered by the user periodically, whenever the user wants. - Here, and in the following, packing means bundling loose objects in a few "pack" files, possibly (optionally) compressing the objects. +- Packing can be triggered by the user periodically, whenever they want. + Here, and in the following, packing means bundling loose objects in a few "pack" files, optionally compressing the objects. A key requirement is that it should be possible to pack while the object store is in use, without the need to stop its use (which would in turn require to stop the use of AiiDA and the deamons during these operations). This is possible in the current implementation, but might temporarily impact read performance while repacking (which is probably acceptable). Instead, it is *not* required that packing can be performed in parallel by multiple processes (on the contrary, the current implementation actually tries to prevent multiple processes trying to perform write operations on packs at the same time: i.e., only a single packer process should perform the operation at any given time). @@ -102,18 +102,19 @@ These are based on a compromise between the different requirements, and represen - In practice, this packing operation takes all loose objects and puts them in a controllable number of packs. The name of the packs is given by an integer. A new pack is created when all previous ones are "full", where full is defined when the pack size goes beyond a threshold (by default 4GB/pack). - This size is a good compromise: it's similar to a "movie" file, so not too big to deal with (can fit e.g. in a disk, people are used to deal with files of a few GBs) so it's not too big, but it's big enough so that even for TB-large repositories, the number of pakcs is of the order of a few tens, and therefore this solves the issue of having millions of files. + This size is a good compromise: it's similar to a "movie" file, so not too big to deal with (can fit e.g. in a disk, people are used to deal with files of a few GBs) so it's not too big. + At the same time it is big enough so that even for TB-large repositories, the number of packs is of the order of a few tens, and therefore this solves the issue of having millions of files. -- Pack files are just concatenation of bytes of the packed objects. - Any new object is appended to the pack (thanks to the efficiency of opening a file for appending). +- Pack files are just a concatenation of bytes of the packed objects. + Any new object is appended to the pack, making use of the efficiency of opening a file for appending. The information for the offset and length of each pack is kept in a single SQLite database for the whole set of objects, as we describe below. - Packed objects can optionally be compressed. Note that compression is on a per-object level. The information on whether an object is compressed or not is stored in the index. When the users ask for an object, they always get back the uncompressed version (so they don't have to worry if objects are compressed or not when retrieving them). - This allows much greater flexibility, and avoid e.g. to decide to avoid to recompress files that are already compressed or where compression would give little to no benefit. - In the future, one could also think to clever logic or heuristics to try to compress a file, but then store it uncompressed if it turns out that the compression ratio is not worth the time needed to further uncompress it later. + This allows much greater flexibility, and avoids, e.g., having to make the decision to avoid recompressing files that are already compressed or where compression would give little to no benefit. + In the future, one could also think of clever logic or heuristics to try to compress a file, but then store it uncompressed if it turns out that the compression ratio is not worth the time needed to further uncompress it later. At the moment, one compression will be chosen and used by default (currently zlib, but in issues it has been suggested to use more modern formats like `xz`, or even better [snappy](https://github.com/google/snappy) that is very fast and designed for purposes like this). The compression library is already an option, and which one to use is stored in the JSON file that contains the settings of the object-store container. @@ -121,7 +122,7 @@ These are based on a compromise between the different requirements, and represen However, the implementation of compression directly in the object store, as described in the previous point, has the two advantages that compression is done only while packing, so there is no performance hit while just storing a new object, and that is completely transparent to the user (while packing, the user can decide to compress data or not; then, when retrieving an object from of the object store, there will not be any difference - except possibly speed in retrieving the data - because the API to retrieve the objects will be the same, irrespective of whether the object has been stored as compressed or not; and data is always returned uncompressed). - API exists both to get and write a single object but also, *importantly*, to write directly to pack files (this cannot be done by multiple processes at the same time, though), and to read in bulk a given number of objects. - This is particularly convenient when using the object store for bulk import and export, and very fast. + This is particularly convenient when using the object store for bulk import and export, and it is very efficient. Also, it could be useful when getting all files of one or more nodes. Actually, for export files, one could internally use the same object-store implementation to store files in the export file. @@ -131,22 +132,22 @@ These are based on a compromise between the different requirements, and represen The time to retrieve all of them is \~2.2s when using a single bulk call, compared to \~44.5s when using 100'000 independent calls (still probably acceptable). Moreover, even getting, in 10 bulk calls, 10 random chunks of the objects (eventually covering the whole set of 100'000 objects) only takes \~3.4s. This should demonstrate that exporting a subset of the graph should be efficient. - Also, this should be compared to minutes to hours when storing each object as individual files. + Also, this should be compared to the minutes up to hours it will take to store all objects as individual files. - All operations internally (storing to a loose object, storing to a pack, reading from a loose object or from a pack, compression) should happen via streaming (at least internally, and there should be a public facing API for this). So, even when dealing with huge files, these never fill the RAM (e.g. when reading or writing a multi-GB file, the memory usage has been tested to be capped at \~150MB). - Convenience methods are available, anyway, to get directly an object content, if the user wants this for simplicty, and knows that the content fits in RAM. + Convenience methods are available, anyway, to directly get the entire content of an object, if the user wants this for simplicity, and knows that the content fits in RAM. #### Further design choices - The key of an object will be its hash. - For each container, one can decide which hash algorithm to use; the default one is `sha256` that offers a good compromise between speed and avoiding risk of collisions. + For each container, one can decide which hash algorithm to use; the default one is `sha256`, which offers a good compromise between speed and avoiding risk of collisions. Once an object is stored, it's the responsibility of the `disk-objectstore` library to return the hash of the object that was just stored. - Using a hash means that we automatically get deduplication of content: if an object is asked to be written, once the stream is received, if the library detects that the object is already present, it will still return the hash key but not store it twice. So, from the point of view of the end application (AiiDA), it does not need to know that deduplication is performed: it just has to send a sequence of bytes, and store the corresponding hash key returned by the library. -- The hashing library can be decided for each container and is configurable; however, for performance reasons, in AiiDA it will be better to decide and stick to one only algorithm. +- The hashing library can be decided for each container and is configurable; however, for performance reasons, in AiiDA it will be better to decide and stick to only one algorithm. This will allow to compare e.g. two different repositories (e.g. when sharing data and/or syncing) and establish if data already exists by just comparing the hashes. If different hash algorithms are instead used by the two containers, one needs to do a full data transfer of the whole container, to discover if new data needs to be transfered or not. @@ -154,10 +155,10 @@ These are based on a compromise between the different requirements, and represen Anyway it would be difficult to provide easy options to the user to customize the behavior, while implementing a packing strategy that is efficient. Which object is in which pack is tracked in the SQLite database. But at the moment it does not seem necessary. - Possible future changes of the internal packing format should not affect the users of the library, since users only ask to get an object by hash key, and in general they do not need to know if they are getting a loose object, a packed object, from which pack, ... + Possible future changes of the internal packing format should not affect the users of the library, since users only ask to get an object by hash key, and in general they do not need to know if they are getting a loose object, a packed object, from which pack, etc. -- For each object, the SQLite database contains the following fields, that can be considered to be the "metadata" of the packed object: its key (`hashkey`), the `offset` (starting - position of the bytestream in the pack file), the `length` (number of bytes to read), a boolean `compressed` flag, meaning if the bytestream has been zlib-compressed, and the `size` of the actual data (equal to `length` if `compressed` is false, otherwise the size of the uncompressed stream, useful for statistics for instance, or to inform the reader beforehand of how much data will be returned, before it starts reading, so the reader can decide to store in RAM the whole object or to process it in a streaming way). +- For each object, the SQLite database contains the following fields, that can be considered to be the "metadata" of the packed object: its `hashkey`, the `offset` (starting position of the bytestream in the pack file), the `length` (number of bytes to read), a boolean `compressed` flag, meaning if the bytestream has been zlib-compressed, and the `size` of the actual data (which is equal to `length` if `compressed` is false, or the size of the uncompressed stream otherwise). +The latter is useful for statistics for instance, or to inform the reader beforehand of how much data will be returned, before it starts reading, so the reader can decide to store the whole object in RAM or to process it in a streaming way. In addition, it tracks the number of the pack in which the object is stored. - We decided to use SQLite because of the following reasons: @@ -165,28 +166,29 @@ These are based on a compromise between the different requirements, and represen - it does not require a running server; - in [WAL mode](https://www.sqlite.org/wal.html), allows many concurrent readers and one writer, useful to allow to continue normal operations by many Unix processes during repacking; - we access it via the SQLAlchemy library that anyway is already a dependency of AiiDA, and is actually the only dependency of the current object-store implementation; - - it's quite widespread and so the libary to read and write should be reliable; in addition, SQLite has a clear [long-term support planning](https://www.sqlite.org/lts.html). + - it's quite widespread and so the libary to read and write should be reliable; + - SQLite has a clear [long-term support planning](https://www.sqlite.org/lts.html). -- Deletion can just occur efficiently as either a deletion of the loose object, or a removal from the index file (if the object is already packed). +- Deletion can be performed efficiently by simply deleting the loose object, or removing the entry from the index file if the object is already packed. Later repacking of the packs can be used to recover the disk space still occupied in the pack files. It is hard to find a better strategy that does not require manual repacking but gives all other guarantees especially for fast live operations (as a reference, also essentially all databases do the same and have a "vacuum" operation that is in a sense equivalent to the concept of repacking here). - The object store does not need to provide functionality to modify an object. In AiiDA, files of nodes are typically only added, and once added they are immutable (and very rarely they can be deleted). - If modification is needed, this can be achieved by creation of a new object and deletion of the old one, since this is an extremely rare operation (actually it should never happen). + If modification is needed, this can be achieved by creation of a new object and deletion of the old one, but this is expected to be a extremely rarely needed operation. -- The current packing format is `rsync`-friendly (that is one of the original requirements). - `rsync` has a clever rolling algorithm that can divides each file in blocks and detects if the same block is already in the destination file, even at a different position. +- The current packing format is `rsync`-friendly, which is one of the original requirements. + `rsync` has a clever rolling algorithm that divides each file in blocks and detects if the same block is already in the destination file, even at a different position. Therefore, if content is appended to a pack file, or even if a pack is "repacked" (e.g. reordering objects inside it, or removing deleted objects) this does not prevent efficient rsync transfer (this has been tested in the implementation). -- Since it works on a filesystem backend, this would allow to use also other tools, e.g. [`rclone`](https://rclone.org), to move the data to some different backend (a "real" object store, Google Drive, or anything else). +- Since the `disk-objectstore` works on a normal file system, it is possible to use also other tools, e.g. [`rclone`](https://rclone.org), to move the data to some different backend (a "real" object store, Google Drive, or anything else). - Appending content to a single file does not prevent the Linux disk cache to work efficiently. Indeed, the caches are per blocks/pages in linux, not per file. - Concatenating to files does not impact performance on cache efficiency. + Concatenating to files does not impact performance of cache efficiency. What is costly is opening a file, as the filesystem has to provide some guarantees e.g. on concurrent access. - As a note, seeking a file to a given position is what one typically does when watching a video and jumping to a different section. + As a note, seeking a file to a given position is what one typically does when watching a video and jumping to a different section, which is an efficient operation. - Packing in general, at this stage, is left to the user. We can decide (at the object-store level, or probably better at the AiiDA level) to suggest the user to repack, or to trigger the repacking automatically. @@ -195,9 +197,9 @@ These are based on a compromise between the different requirements, and represen This could be a good approach, also to bind the repacking with the backups (at the moment, probably backups need to be executed using appropriate scripts to backup the DB index and the repository in the "right order", and possibly using SQLite functions to get a dump). - As a note: even if repacking is never done by the user, the situation is anyway improved with respect to the current one in AiiDA: - - an index of files associated with an AiiDA node will now be stored in the AiiDA DB, so getting the list of files associated to a node without content will not need anymore to access the disk; + - an index of files associated with an AiiDA node will now be stored in the AiiDA DB, so getting the list of files associated to a node without content will no longer need access to the disk; - - there wouldn't be anymore empty folders created for nodes without files; + - empty folders created for nodes without files will no longer be created; - automatic deduplication of the data is now done transparently. @@ -209,15 +211,14 @@ These are based on a compromise between the different requirements, and represen ### Why a custom implementation of the library We have been investigating if existing codes could be used for the current purpose. - However, in our preliminary analysis we couldn't find an existing robust software that was satisfying all criteria. In particular: -- existing object storage implementations (e.g. Open Stack Swift, or others that typically provide a S3 or similar API) are not suitable since 1) they require a server running, and we would like to avoid the complexity of asking users to run yet another server, and most importantly 2) they usually work via a REST API that is extremely inefficient when retrieving a lot of small objects (latencies can be even of tens of ms, that is clearly unacceptable if we want to retrieve millions of objects). +- existing object storage implementations (e.g. Open Stack Swift, or others that typically provide a S3 or similar API) are not suitable since 1) they require a server running, and we would like to avoid the complexity of asking users to run yet another server, and most importantly 2) they usually work via a REST API that is extremely inefficient when retrieving a lot of small objects (latencies can be even of tens of ms, which is clearly unacceptable if we want to retrieve millions of objects). -- the closest tool we could find is the git object store, for which there exists also a pure-python implementation ([dulwich](https://github.com/dulwich/dulwich)). +- the closest tool we could find is the git object store, for which there exists also a pure Python implementation ([dulwich](https://github.com/dulwich/dulwich)). We have been benchmarking it, but the feeling is that it is designed to do something else, and adapting to our needs might not be worth it. - Some examples of issues we envision: managing re-packing (dulwich can only pack loose objects, but not repack existing packs); this can be done via git itself, but then we need to ensuring that when repacking, objects are not garbage-collected and deleted because they are not referenced within git. + Some examples of issues we envision: managing re-packing (dulwich can only pack loose objects, but not repack existing packs); this can be done via git itself, but then we need to ensure that when repacking, objects are not garbage-collected and deleted because they are not referenced within git. It's not possible (apparently) to decide if we want to compress an object or not (e.g., in case we want maximum performance at the cost of disk space), but they are always compressed. Also we did not test concurrency access and packing of the git repository, which requires some stress test to assess if it works. @@ -233,7 +234,7 @@ In particular: * Is very efficient also with hundreds of thousands of objects (bulk reads can read all objects in a couple of seconds). * Is rsync-friendly, so one can suggest to use directly rsync for backups instead of custom scripts. * A library (`disk-objectstore`) has already been implemented, seems to be very efficient, and is relatively short (and already has tests with 100% coverage, including concurrent writes, reads, and one repacking process, and checking on three platforms: linux, mac os, and windows). -* It does not add any additional python depenency to AiiDA, and it does not require a service running. +* It does not add any additional Python dependency to AiiDA, and it does not require a service running. * Implements compression in a transparent way. * It shouldn't be slower than the current AiiDA implementation in writing during normal operation, since objects are stored as loose objects. Actually, it might be faster for nodes without files, as no disk access will be needed anymore, and automatically provides deduplication of files. From f7dd57c8898bb4b12f6438c1207a0cfd4725101f Mon Sep 17 00:00:00 2001 From: Giovanni Pizzi Date: Wed, 15 Dec 2021 19:50:40 +0100 Subject: [PATCH 8/8] Apply suggestions from code review --- 006_efficient_object_store_for_repository/readme.md | 1 - 1 file changed, 1 deletion(-) diff --git a/006_efficient_object_store_for_repository/readme.md b/006_efficient_object_store_for_repository/readme.md index 1878ff0..bd3d225 100644 --- a/006_efficient_object_store_for_repository/readme.md +++ b/006_efficient_object_store_for_repository/readme.md @@ -154,7 +154,6 @@ These are based on a compromise between the different requirements, and represen - Pack naming and strategy is not determined by the user. Anyway it would be difficult to provide easy options to the user to customize the behavior, while implementing a packing strategy that is efficient. Which object is in which pack is tracked in the SQLite database. - But at the moment it does not seem necessary. Possible future changes of the internal packing format should not affect the users of the library, since users only ask to get an object by hash key, and in general they do not need to know if they are getting a loose object, a packed object, from which pack, etc. - For each object, the SQLite database contains the following fields, that can be considered to be the "metadata" of the packed object: its `hashkey`, the `offset` (starting position of the bytestream in the pack file), the `length` (number of bytes to read), a boolean `compressed` flag, meaning if the bytestream has been zlib-compressed, and the `size` of the actual data (which is equal to `length` if `compressed` is false, or the size of the uncompressed stream otherwise).