Decide if we should call `fullsync` on Mac (and if we should expose the decision to sync to the user) #54

giovannipizzi · 2020-07-12T06:44:45Z

On Mac fsync is not really enough, see docs:

For applications that require tighter guarantees about the integrity of
their data, Mac OS X provides the F_FULLFSYNC fcntl. The F_FULLFSYNC
fcntl asks the drive to flush all buffered data to permanent storage.
Applications, such as databases, that require a strict ordering of writes
should use F_FULLFSYNC to ensure that their data is written in the order
they expect. Please see fcntl(2) for more detail.

Decide if we want to give these guarantees on Mac

For reference, see e.g. this implementation

And this does not seem to be done by python yet (?)

The text was updated successfully, but these errors were encountered:

giovannipizzi · 2020-07-12T14:12:15Z

This is partially addressed in 95719c2

To be still investigated:

should we use FULLSYNC instead of fsync also for the folder (probably yes)
assess the performance hit for writing loose files and decide if we want to use it also in that case

giovannipizzi · 2020-07-12T14:32:23Z

Regarding performance: running the test test_loose_write that writes 1000 loose objects takes 7.1s with _MACOS_ALWAYS_USE_FULLSYNC = True, while if we leave it to False it takes only 715ms so there is a factor of 10! For the moment I will probably not do the FULLSYNC

giovannipizzi · 2020-07-13T22:48:56Z

Maybe we should simply give the user the possibility to decide, for all methods that allow to add objects, if they want to do a fsync or not

giovannipizzi · 2020-08-26T21:50:28Z

In 471ac9b we have added some control on when to do fsync. It might be good to extend this also to the calls that are in utils.py when writing to pack, and in general to all places, so it's more general, and the user has the control.

This merge collects a number of important efficiency improvements, and a few features that were tightly bound to these efficiency changes, so they are shipped together. In particular: - objects are now sorted and returned in the order in which they are on disk, with an important performance benefit. Fixes #92 - When there are many objects to list (currently set to 9500 objects, 10x the ones we could fit in a single IN SQL statement), performing many queries is slow, so we just resort to listing all objects and doing an efficient intersection (if the hash keys are sorted, both iterators can be looped over only once - fixes #93) - Since VACUUMing the DB is very important for efficiency, when the DB does not fit fully in the disk cache, `clean_storage` now provides an option to VACUUM the DB. VACUUM is also called after repacking. Fixes #94 - We implement now a function to perform a full repack of the repository (fixes #12). This is important and needed to reclaim space after deleting an object - For efficiency, we have moved the logic from an `export` function (still existing but deprecated) to a `import_objects` function - Still for efficiency, now functions like `pack_all_loose` and `import_objects` provide an option to perform a fsync to disk or not (see also #54 - there are still however calls that always use - or don't use - fsync and full_fsync on Mac). Also, `add_objects_to_pack` allows now to choose if you want to commit the changes to the SQLite DB, or not (delegating the responsibility to the caller, but this is important e.g. in `import_objects`: calling `commit` only once at the very end gives a factor of 2 speedup for very big repos). - A number of functions, including (but not exclusively) `import_objects` provide a callback to e.g. show a progress bar. - a `CallbackStreamWrapper` has been implemented, allowing to provide a callback (e.g. for a progress bar) when streaming a big file. - a new hash algorithm is now supported (`sha1`) in addition to the default `sha256` (fixes #82). This is faster even if a bit less robust. This was also needed to test completely some feature in `import_objects`, where the logic is optimised if both containers use the same algorithm. By default is still better to use everywhere sha256, also because then all export files that will be generated will use this algorithm and importing will be more efficient. - tests have been added for all new functionality, achieving again 100% coverage As a reference, with these changes, exporting the full large SDB database (6.8M nodes) takes ~ 50 minutes: ``` 6714808it [00:24, 274813.02it/s] All hashkeys listed in 24.444787740707397s. Listing objects: 100%|████████| 6714808/6714808 [00:06<00:00, 978896.65it/s] Copy objects: 100%|███████████| 6714808/6714808 [48:15<00:00, 2319.08it/s] Final flush: 100%|████████████| 63236/63236 [00:07<00:00, 8582.35it/s] Everything re-exported in 2960.980943918228s. ``` This can be compared to: - ~10 minutes to copy the whole 90GB, or ~15 minutes to read all and validate the packs. We will never be able to be faster than just copying the pack files, and we are only 3x slower. - ~2 days to just list all files in the old legacy AiiDA repo (or all objects if they are loose), and this does not take into account the time to rewrite everything, probably comparable. So we are almost 2 orders of magnitude faster than before.

giovannipizzi · 2023-06-21T21:27:27Z

I'm closing this as I think the current implementation has a good compromise between efficiency and performance. We can reopen if we see issues.

giovannipizzi added this to the Nice to have milestone Jul 12, 2020

giovannipizzi changed the title ~~Decide if we should call fullsync on Mac~~ Decide if we should call fullsync on Mac (and if we should expose the decision to sync to the user) Jul 16, 2020

giovannipizzi modified the milestones: Nice to have, Robustness and new features Aug 26, 2020

giovannipizzi mentioned this issue Aug 26, 2020

Collection of a number of important efficiency improvements #96

Merged

giovannipizzi closed this as completed Jun 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decide if we should call `fullsync` on Mac (and if we should expose the decision to sync to the user) #54

Decide if we should call `fullsync` on Mac (and if we should expose the decision to sync to the user) #54

giovannipizzi commented Jul 12, 2020 •

edited

Loading

giovannipizzi commented Jul 12, 2020 •

edited

Loading

giovannipizzi commented Jul 12, 2020

giovannipizzi commented Jul 13, 2020

giovannipizzi commented Aug 26, 2020

giovannipizzi commented Jun 21, 2023

Decide if we should call fullsync on Mac (and if we should expose the decision to sync to the user) #54

Decide if we should call fullsync on Mac (and if we should expose the decision to sync to the user) #54

Comments

giovannipizzi commented Jul 12, 2020 • edited Loading

giovannipizzi commented Jul 12, 2020 • edited Loading

giovannipizzi commented Jul 12, 2020

giovannipizzi commented Jul 13, 2020

giovannipizzi commented Aug 26, 2020

giovannipizzi commented Jun 21, 2023

Decide if we should call `fullsync` on Mac (and if we should expose the decision to sync to the user) #54

Decide if we should call `fullsync` on Mac (and if we should expose the decision to sync to the user) #54

giovannipizzi commented Jul 12, 2020 •

edited

Loading

giovannipizzi commented Jul 12, 2020 •

edited

Loading