-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Archive format AEP #21
Add Archive format AEP #21
Conversation
I'm sure there is room for improvement, but I think this is now in a state for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good start. I've made a few of suggestions to improve format and content.
Edit: Accidentally submitted comment as "comment", this should be "changes requested".
005_exportformat/readme.md
Outdated
|------------|------------------------------------------------------------------| | ||
| Title | Implement a new archive export format | | ||
| Authors | [Chris Sewell](mailto:christopher.sewell@epfl.ch) (chrisjsewell) | | ||
| Champions | [Chris Sewell](mailto:christopher.sewell@epfl.ch) (chrisjsewell) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please identify a champion for this AEP, it cannot be yourself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@giovannipizzi do you want to champion? Is there a specific definition for a champion? Is it just someone else who thinks it is a good idea?
005_exportformat/readme.md
Outdated
|
||
## Proposed Enhancement | ||
|
||
The goal of this project is to first develop a set of agreed requirements for a new archive format, followed by a concrete implementation of the format, and accompanying export and import functions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revise the description to focus on the proposed enhancement, we can add detailed notes on its implementation in a later section. This should not be a meta-description of the AEP process itself.
005_exportformat/readme.md
Outdated
The alternative approach would be to use the newly implemented "packfile" object-store, with coordinating SQLite database. | ||
The pros and cons of this approach have been previously assessed in <https://github.com/aiidateam/AEP/pull/11>. | ||
|
||
### Archive compression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the ability to compress the archive should be discussed as part of the requirements and I would also rephrase it such that it is clear that the "compression" is the requirement, not the fact that one can compress the whole archive. That's always a given.
005_exportformat/readme.md
Outdated
|
||
### Archive compression | ||
|
||
For portability, it is desirable that the full archive be contained within a single zipped file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I think we should rephrase this, because I can always place some directories or files in a zip-file.
|
||
## Pros and Cons | ||
|
||
For implementing a new format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that this AEP is trying to combine two different arguments: 1. Whether a new format should be developed and 2. What format that would be. It is easy to combine these two questions by simply making "keep the current format" one of the alternatives.
In that sense, this AEP should identify a proposed specific solution as a result of the earlier discussion and the "Pros and Cons" of that specific solution should be discussed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeh, I wasn't quite sure what the initial scope and "completeness" of these AEPs had to be to merge in the submitted state?
The question of "whether to keep the current format", as you say, is somewhat tied to already having decided on a specific solution, which in-turn requires that all potential solutions (including Zarr 😉) are fully assessed, and most likely requires some prototyping of one or more of the solutions with tests/benchmarking to compare against the current format etc
Does all this work need to be completed before a submission is merged? Does it stay as an open PR until this is all done, or is it enough to initially provide an adequate justification for commencing the work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far we just kept AEPs in the PR stage until they appeared ready. I would be open to merge draft with subsequent updates, but that should probably be discussed in plenum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the outcome of the plenum discussion was to bring this AEP into an agreeable state concerning the requirements as soon as possible and then merge it with the status "submitted" and subsequent updates,, but also @ltalirz is going to have another look at this.
Co-authored-by: Carl Simon Adorf <carl.simon.adorf@gmail.com>
005_exportformat/readme.md
Outdated
* `data.json` contains all requisite information from the SQL Database. | ||
* `nodes` contains the "object store" files per node, organised by UUID: `xxyy-zz...`. | ||
|
||
Particularly for large export archives, writing to (export) and reading from (import) `data.json` represents a significant bottle-neck in performance for these processes, both in respect to memory usage and process speed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could add here the two principle reasons:
- The file repository is inefficiently stored. Each file is written as its own file and uncompressed, which requires a lot of inodes and just indexing all file content is slow.
- For the validity of
data.json
to be determined, the entire content has to be read into memory, which becomes a limiting factor for export size very quickly
005_exportformat/readme.md
Outdated
An additional feature to consider would be delta increments, such that an existing archive file could be amended during an export. | ||
This may allow for a push/pull interface for "syncing" an archive file to a particular AiiDA profile. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure, but are you referring here to what @giovannipizzi described that he would like to see as a feature? Because I am not sure this is exactly what he meant. What I think he meant was that we should ideally have push/pull functionality on AiiDA databases. And one way of implementing this was through means of export archives. Let's say you have a database, you create an export archive of it and import it in a second db. Then you create additional nodes in db1 and you want to "push" them to db2. The mechanism would be to compute the diff, export those into an archive and then just import that archive in db2. So I don't think we would require directly updating the archives themselves, as this section in the AEP seems to suggest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep I guess @giovannipizzi can clarify this
005_exportformat/readme.md
Outdated
When writing an export file, these issues should not necessarily be present. | ||
|
||
[JSON streaming](https://en.wikipedia.org/wiki/JSON_streaming) technologies, such as JSONL, also allow for JSON to be streamed, without the need to read the full file into memory. | ||
It is unclear though if this would actually provide any performance gains, since in many cases the full JSON will still need to be loaded into memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the advantage of "streamable JSON" if it still requires reading everything in memory 😕
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps I undersold it here lol. I guess you use it in "searching" for records; e.g. by iterating over each JSON "piece" (loading only that into memory) and checking if it contains the key(s) you are looking for.
I will give a deeper dive on this though
Co-authored-by: Sebastiaan Huber <mail@sphuber.net>
Points to address from meeting:
|
|
||
For small archives, this is most probably the best solution. | ||
However, when considering large archives, single and large JSON-files are an extremely poor database format; | ||
they must be read in full to access any data and don't support concurrency or ACID (atomicity, consistency, isolation, durability) transactions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove the note on ACID - the JSON would be dumped during export by a single process, and never edited, so I don't think this comment applies here.
|
||
To support this, one could consider extending the current format to move towards a "NoSQL" database type implementation, splitting the JSON into multiple files (see for example [MongoDB](https://en.wikipedia.org/wiki/MongoDB)). | ||
|
||
For example, node-level JSONs could be stored in the disk object store, together with a minimal index of UUID -> Hashkey mappings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small technical note - I think I was suggesting this, but in the end I checked and just writing/reading a multiline JSON is much faster, and if you want to keep an index (that would just increase performance but everything would work even without it) this can just be formed by the byte offset where the line of the JSON of a given node (with given UUID) starts.
At a node level, dumping data into a JSON and writing to disk, would also likely be faster than recreating database tables that must handle indexes, ACID, avoiding concurrency problems, etc. | ||
When writing an export file, these issues should not necessarily be present. | ||
|
||
[JSON streaming](https://en.wikipedia.org/wiki/JSON_streaming) technologies, such as JSONL, also allow for JSON to be streamed, without the need to read the full file into memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above, more than streaming, I think a multiline JSON would be a very good technical solution.
* It is a very stable and robust format with a clear long-term support plan until at least until 2050 (see <https://www.sqlite.org/lts.html>). | ||
|
||
The main drawback of using SQL is that it is a binary format and so inherently not directly human-readable. | ||
The format specification must be known before reading, and also SQLite version should be preserved within the archive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add that
- we can limit to SQLite 3 (and readers will know if the format is different)
- There are GUIs to inspect and edit graphically SQLite files, so this problem is mitigated.
Co-authored-by: Giovanni Pizzi <gio.piz@gmail.com>
I believe I've now addressed all comments, so merging this in "draft" status (see #23) |
commit b4b4053 Author: Giovanni Pizzi <giovanni.pizzi@epfl.ch> Date: Wed Dec 15 20:20:05 2021 +0100 AEP 006 - Efficient object store for the AiiDA repository (aiidateam#11) commit 0a5675d Author: Sebastiaan Huber <mail@sphuber.net> Date: Fri Sep 10 18:16:30 2021 +0200 Update README.md (aiidateam#26) commit 5b45258 Author: Sebastiaan Huber <mail@sphuber.net> Date: Fri Sep 10 18:14:31 2021 +0200 AEP 004: Infrastructure to import completed calculation jobs (aiidateam#12) commit 4855195 Author: Chris Sewell <chrisj_sewell@hotmail.com> Date: Sun Jan 10 15:20:52 2021 +0000 Add Archive format AEP (aiidateam#21)
submitted
README.md