-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stream directly to packed
during verdi archive import
#6417
base: main
Are you sure you want to change the base?
Conversation
573218d
to
5a2aea5
Compare
for more information, see https://pre-commit.ci
Thanks a lot! I'm happy to provide you with more technical details. In particular I think we need to properly take care of locking, it will not be automatic. But most importantly, it's essential to go via streams and not via get_object, as you say it will put all in memory and it will crash for few large objects in a machine with limited memory (think to a 4gb file on a small machine). With streams properly used, memory is limited to few MB. I think I had to solve the same problem of using properly context managers in disk object store, so we can look at the issue together. Also, the performance benefit for me is just not yet enough to merit making aiida more complex on this, I was hoping something better? If we really only gain 10%, let's discuss but we maybe drop it. If instead we can play a bit and get significant benefits, then it becomes relevant. Maybe instead of working directly on aiida, we can do a mock example where we just stream the files from the AiiDA zip file to a disk objectstore with the two approaches, to see if there is a major improvement or not. Happy to discuss this further |
I decided to go to the lowest level and do a bit of time comparison, to see if it's worth looking into writing (optionally) directly to packs when importing (that will however mean you can't do much on the profile, but it's OK when importing large DBs e.g. right after setting up a profile, or in special cases when people know what they want to do). And I hope I didn't burn my laptop HD doing these tests :-D (as a reference I used a SDD, not a rotating HD, and this was on a Intel 2019 MacBook Pro, with 32GB of RAM). TL;DR: I think it's worth to do it, and if done properly (do not read objects in memory but use properly streams), you can get a speed up of ~3x (between 1.7x and 7x depending on what you measure). Results of my analysis, where I consider the The code that I used (possibly changing the values of the variables import pathlib
import zipfile
import disk_objectstore as dostore
import time
directly_to_packs = False
#file = pathlib.Path('test.aiida')
file = pathlib.Path('MC3D-provenance-migrated.aiida')
add_max = 1000000 # Just if you want to try the code on a few objects first
container = dostore.Container('test_container')
container.init_container(clear=True)
subfolder = 'repo/'
if False:
with zipfile.ZipFile(file) as myzip:
added_counter = 0
for fname in myzip.namelist():
if not fname.startswith(subfolder):
continue
added_counter += 1
print(f"Number of files in the .aiida file: {added_counter}")
t = time.time()
with zipfile.ZipFile(file) as myzip:
added_counter = 0
if directly_to_packs:
# Create list of file paths - TODO: batch in groups - TODO: check if `.namelist` reads in the order files are written on disk in the .zip
fhs = []
for fname in myzip.namelist():
if not fname.startswith(subfolder):
continue
fhs.append(myzip.open(fname))
added_counter += 1
if added_counter >= add_max:
break
container.add_streamed_objects_to_pack(fhs)
# Close all now
for fh in fhs:
fh.close()
else:
for fname in myzip.namelist():
if not fname.startswith(subfolder):
continue
added_counter += 1
if added_counter >= add_max:
break
# In addition to different speed, here we are putting all in memory! Not very good
with myzip.open(fname) as fh:
data = fh.read()
container.add_object(data)
elapsed = time.time() - t
print(f"Elapsed time: {elapsed} s")
print(container.count_objects()) ResultsAs a note, the total number of files/objects in the .aiida file: 837443 Limiting to 100,000 objects onlyFor time reasons, I first do most test limiting only to the first 100,000 objects. Directly adding to packsNote: it's not compressing anything, the heuristics is not implemented in the
Final check on dostore: Output of
As an order of magnitude, running once full repacking with heuristics. import disk_objectstore as dostore
import time
container = dostore.Container('test_container')
t = time.time()
container.repack(compress_mode=dostore.CompressMode.AUTO)
elapsed = time.time() - t
print(f"Elapsed time: {elapsed} s") Result:
(note: if you do the repacking with As a comparison, compressing EVERYTHING directly from the beginning, with the test speed script (setting
(The heuristics takes time as it has to try partial compression, with the hope of avoiding to compress certain files and thus speeding up any further reading from those - particularly valuable for large already compressed files, in the other cases it's not so important) To loose directly (without any final repacking)
Final check on do store:
As an order of magnitude, running once
After repacking:
Summary on 100,000 objects
All objectsDirectly adding to packs
Final check on dostore: Output of
To loose directly (without any final repacking)
Final check on dostore: Output of
Summary on all 837,443 objects
As an order of magnitude, running once
Therefore: Considering writing + packing (and compressing only in the first case): (2474+1119)/379 = 9.5x slower without directly to packs (but a bit unfair as in the first case I'm also compressing) After repacking:
ConclusionTo conclude, I suggest that you try to do another test properly using the methods above (specifically IMPORTANT NOTE: it would be very important to check that the order in which you get the list of files (not only directly with |
Wow, thanks a lot, @giovannipizzi! This is extremely useful! Despite the initial timings, I've also been quite convinced that the problem was the early implementation and that it would be worth it, if done properly. This definitely gives me a good state to continue with it. I'll report back here. |
This is a very early proof-of-concept implementation of streaming repository files directly to
packed
during an archive import.Especially for large archives, in particular, with many (small) repository files, this could provide significant speed-up as it reduces the required file open and close operations when writing to the target repository. While many such operations to obtain the streams from the
ZipfileBackendRepository
can probably not be avoided, they can be significantly reduced by streaming topacked
instead ofloose
, as the latter would require the same number again as for reading the contents, while the former would require just one per pack (~4gb file).I just ran the command for the
acwf-verification_unaries-verification-PBE-v1-results_quantum_espresso-SSSP-1.3-PBE-precision.aiida archive (2gb, ~60k repository files)
and the
MC3D-provenance.aiida archive (12gb, ~840k repository files)
In the first case, the total runtime of the
verdi archive import
command was 176s and 107s when streaming toloose
orpacked
respectively, while in the second case they were 787s and 728s. Here, I was expecting a larger speed-up, but this might originate from the non-ideal preliminary implementation and benchmarking. Overall, I think this could be a promising feature when done properly.A few notes regarding the implementation:
ZipFileBackendRepository
) viarepository_from.list_objects()
and then retrieve the contents asBytesIO
viarepository_from.get_object_content()
. I didn't use therepository_from.iter_object_streams()
as that returnsZipExtFiles
which couldn't be opened outside the scope of a context manager, and, importantly, when passing toContainer.add_streamed_objects_to_pack()
.ZipfileBackendRepository
that returns the hashes and contents in batches (rather than one by one, such asiter_object_streams()
) so that might be something we have to implement from scratch for this feature, either based on the number of files, or their total size for each batch, using sensible defaults.The remaining todos to have a proper implementation (ties in with the notes above):
ZipFileBackendRepository
anddisk_objectstore
Container
methods if applicable and available (e.g. I saw there is alsoadd_objects_to_pack
in addition toadd_streamed_objects_to_pack
)packed
can provide (only timing the repository file additions)Also note that I added the
verdi profile flush
command as a convenience command to clean the data from a profile, which I use now during development. This can be moved into a separate PR if we deem it useful, or eventually be removed from this PR if not.Pinging @khsrali as he voiced his interest in being involved in the
disk_objectstore
, @agoscinski as we looked into this together for a while, @mbercx to keep in the loop, and @sphuber and @giovannipizzi with expertise on the backend implementations who can likely provide crucial pointers. I'm still familiarizing myself with the backend repository implementations and thedisk_objectstore
, so any such pointers are more than welcome. I'll keep working on these to-dos on my branch, updating the PR here along the way.