VACUUMing the DB and performance on listing via the index #94

giovannipizzi · 2020-07-31T17:20:59Z

When creating a DB a bit at a time, the index is scattered on the file. On a big DB, if the DB file is not in the OS disk cache, it means a huge performance hit (~1000it/s instead of 600000it/s), e.g. on any listing using the covering index on the hash key (e.g. SELECT hashkey from db_object ORDER BY hashkey). E.g. test to perform the query above right after flushing the caches with sudo su -c "echo 3 > /proc/sys/vm/drop_caches":
```
from disk_objectstore import Container
import tqdm

container = Container('/scratch/TEST-DISK-OBJECTSTORE/test-newrepo-sdb/')
session = container._get_cached_session()

hashkeys_q = session.execute("SELECT hashkey FROM db_object ORDER BY hashkey")
list(tqdm.tqdm(hashkeys_q, total=6714808))
```
If the DB file fits in memory, just performing cat packs.idx > /dev/null will make all the rest of the operations (like listing ordered by hashkey) fast again (600000it/s vs 1000it/s)
a better long-term solution is to call VACUUM on the DB, as this will defragment the DB and the indexes (note: the content of the SQLite file, not how it's written on the filesystem). Also, this is needed when deleting entries to recall space. Note that the first iteration on SELECT hashkey from db_object ORDER BY hashkey will run at ~270000it/s instead of the later times (with caching) when it will run at 600000it/s but it is already fast enough. Also, looking at how much data goes into the disk OS cache, it seems that it only actually needs to read and keep in the cache 550MB out of 1.2GB of the sqlite file (probably, the size of the index on hashkey).
performing SELECT count(*) from db_object decides to use the index on the hash key that, if the DB is 'fragmented', it's very slow as above (it's much faster to count the length in python of SELECT id from db_object ORDER BY id). Therefore we can:
- change how we count objects
- have a trick to call cat on the whole file to pre-fetch the data (but only if it all fits in RAM!)
- have a maintenance operation that calls VACUUM (probably better choice, to combine with other maintenance operations)

Mentioning also #92 as the performance when looping over sorted results by hash key will be strongly dependent on either caching the file, or VACUUMing it first (that, however, is slow)

The text was updated successfully, but these errors were encountered:

giovannipizzi · 2020-07-31T17:35:54Z

According to this SQLite page, the following SQL can be used to estimate how 'fragmented' the DB is and if you need to call VACUUM:

CREATE TEMP TABLE s(rowid INTEGER PRIMARY KEY, pageno INT);
INSERT INTO s(pageno) SELECT pageno FROM dbstat ORDER BY path;
SELECT sum(s1.pageno+1==s2.pageno)*1.0/count(*)
  FROM s AS s1, s AS s2
 WHERE s1.rowid+1=s2.rowid;
DROP TABLE s;

As a reference here is the value on the big SDB DB (~6.8M entries, 1.2GB) before VACUUMING:

0.191833845709181
And after vacuuming:
0.882292652715071

giovannipizzi · 2020-08-08T22:30:00Z

To vacuum a Container c this can be done:

engine = c._get_cached_session().get_bind()
engine.execute("VACUUM")

This merge collects a number of important efficiency improvements, and a few features that were tightly bound to these efficiency changes, so they are shipped together. In particular: - objects are now sorted and returned in the order in which they are on disk, with an important performance benefit. Fixes #92 - When there are many objects to list (currently set to 9500 objects, 10x the ones we could fit in a single IN SQL statement), performing many queries is slow, so we just resort to listing all objects and doing an efficient intersection (if the hash keys are sorted, both iterators can be looped over only once - fixes #93) - Since VACUUMing the DB is very important for efficiency, when the DB does not fit fully in the disk cache, `clean_storage` now provides an option to VACUUM the DB. VACUUM is also called after repacking. Fixes #94 - We implement now a function to perform a full repack of the repository (fixes #12). This is important and needed to reclaim space after deleting an object - For efficiency, we have moved the logic from an `export` function (still existing but deprecated) to a `import_objects` function - Still for efficiency, now functions like `pack_all_loose` and `import_objects` provide an option to perform a fsync to disk or not (see also #54 - there are still however calls that always use - or don't use - fsync and full_fsync on Mac). Also, `add_objects_to_pack` allows now to choose if you want to commit the changes to the SQLite DB, or not (delegating the responsibility to the caller, but this is important e.g. in `import_objects`: calling `commit` only once at the very end gives a factor of 2 speedup for very big repos). - A number of functions, including (but not exclusively) `import_objects` provide a callback to e.g. show a progress bar. - a `CallbackStreamWrapper` has been implemented, allowing to provide a callback (e.g. for a progress bar) when streaming a big file. - a new hash algorithm is now supported (`sha1`) in addition to the default `sha256` (fixes #82). This is faster even if a bit less robust. This was also needed to test completely some feature in `import_objects`, where the logic is optimised if both containers use the same algorithm. By default is still better to use everywhere sha256, also because then all export files that will be generated will use this algorithm and importing will be more efficient. - tests have been added for all new functionality, achieving again 100% coverage As a reference, with these changes, exporting the full large SDB database (6.8M nodes) takes ~ 50 minutes: ``` 6714808it [00:24, 274813.02it/s] All hashkeys listed in 24.444787740707397s. Listing objects: 100%|████████| 6714808/6714808 [00:06<00:00, 978896.65it/s] Copy objects: 100%|███████████| 6714808/6714808 [48:15<00:00, 2319.08it/s] Final flush: 100%|████████████| 63236/63236 [00:07<00:00, 8582.35it/s] Everything re-exported in 2960.980943918228s. ``` This can be compared to: - ~10 minutes to copy the whole 90GB, or ~15 minutes to read all and validate the packs. We will never be able to be faster than just copying the pack files, and we are only 3x slower. - ~2 days to just list all files in the old legacy AiiDA repo (or all objects if they are loose), and this does not take into account the time to rewrite everything, probably comparable. So we are almost 2 orders of magnitude faster than before.

giovannipizzi added this to the Nice to have milestone Jul 31, 2020

giovannipizzi mentioned this issue Aug 26, 2020

Collection of a number of important efficiency improvements #96

Merged

chrisjsewell closed this as completed in #96 Oct 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VACUUMing the DB and performance on listing via the index #94

VACUUMing the DB and performance on listing via the index #94

giovannipizzi commented Jul 31, 2020 •

edited

Loading

giovannipizzi commented Jul 31, 2020

giovannipizzi commented Aug 8, 2020

VACUUMing the DB and performance on listing via the index #94

VACUUMing the DB and performance on listing via the index #94

Comments

giovannipizzi commented Jul 31, 2020 • edited Loading

giovannipizzi commented Jul 31, 2020

giovannipizzi commented Aug 8, 2020

giovannipizzi commented Jul 31, 2020 •

edited

Loading