-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VACUUMing the DB and performance on listing via the index #94
Milestone
Comments
According to this SQLite page, the following SQL can be used to estimate how 'fragmented' the DB is and if you need to call CREATE TEMP TABLE s(rowid INTEGER PRIMARY KEY, pageno INT);
INSERT INTO s(pageno) SELECT pageno FROM dbstat ORDER BY path;
SELECT sum(s1.pageno+1==s2.pageno)*1.0/count(*)
FROM s AS s1, s AS s2
WHERE s1.rowid+1=s2.rowid;
DROP TABLE s; As a reference here is the value on the big SDB DB (~6.8M entries, 1.2GB) before VACUUMING:
|
To vacuum a Container engine = c._get_cached_session().get_bind()
engine.execute("VACUUM") |
chrisjsewell
added a commit
that referenced
this issue
Oct 2, 2020
This merge collects a number of important efficiency improvements, and a few features that were tightly bound to these efficiency changes, so they are shipped together. In particular: - objects are now sorted and returned in the order in which they are on disk, with an important performance benefit. Fixes #92 - When there are many objects to list (currently set to 9500 objects, 10x the ones we could fit in a single IN SQL statement), performing many queries is slow, so we just resort to listing all objects and doing an efficient intersection (if the hash keys are sorted, both iterators can be looped over only once - fixes #93) - Since VACUUMing the DB is very important for efficiency, when the DB does not fit fully in the disk cache, `clean_storage` now provides an option to VACUUM the DB. VACUUM is also called after repacking. Fixes #94 - We implement now a function to perform a full repack of the repository (fixes #12). This is important and needed to reclaim space after deleting an object - For efficiency, we have moved the logic from an `export` function (still existing but deprecated) to a `import_objects` function - Still for efficiency, now functions like `pack_all_loose` and `import_objects` provide an option to perform a fsync to disk or not (see also #54 - there are still however calls that always use - or don't use - fsync and full_fsync on Mac). Also, `add_objects_to_pack` allows now to choose if you want to commit the changes to the SQLite DB, or not (delegating the responsibility to the caller, but this is important e.g. in `import_objects`: calling `commit` only once at the very end gives a factor of 2 speedup for very big repos). - A number of functions, including (but not exclusively) `import_objects` provide a callback to e.g. show a progress bar. - a `CallbackStreamWrapper` has been implemented, allowing to provide a callback (e.g. for a progress bar) when streaming a big file. - a new hash algorithm is now supported (`sha1`) in addition to the default `sha256` (fixes #82). This is faster even if a bit less robust. This was also needed to test completely some feature in `import_objects`, where the logic is optimised if both containers use the same algorithm. By default is still better to use everywhere sha256, also because then all export files that will be generated will use this algorithm and importing will be more efficient. - tests have been added for all new functionality, achieving again 100% coverage As a reference, with these changes, exporting the full large SDB database (6.8M nodes) takes ~ 50 minutes: ``` 6714808it [00:24, 274813.02it/s] All hashkeys listed in 24.444787740707397s. Listing objects: 100%|████████| 6714808/6714808 [00:06<00:00, 978896.65it/s] Copy objects: 100%|███████████| 6714808/6714808 [48:15<00:00, 2319.08it/s] Final flush: 100%|████████████| 63236/63236 [00:07<00:00, 8582.35it/s] Everything re-exported in 2960.980943918228s. ``` This can be compared to: - ~10 minutes to copy the whole 90GB, or ~15 minutes to read all and validate the packs. We will never be able to be faster than just copying the pack files, and we are only 3x slower. - ~2 days to just list all files in the old legacy AiiDA repo (or all objects if they are loose), and this does not take into account the time to rewrite everything, probably comparable. So we are almost 2 orders of magnitude faster than before.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
SELECT hashkey from db_object ORDER BY hashkey
). E.g. test to perform the query above right after flushing the caches withsudo su -c "echo 3 > /proc/sys/vm/drop_caches"
:cat packs.idx > /dev/null
will make all the rest of the operations (like listing ordered by hashkey) fast again (600000it/s vs 1000it/s)VACUUM
on the DB, as this will defragment the DB and the indexes (note: the content of the SQLite file, not how it's written on the filesystem). Also, this is needed when deleting entries to recall space. Note that the first iteration onSELECT hashkey from db_object ORDER BY hashkey
will run at ~270000it/s instead of the later times (with caching) when it will run at 600000it/s but it is already fast enough. Also, looking at how much data goes into the disk OS cache, it seems that it only actually needs to read and keep in the cache 550MB out of 1.2GB of the sqlite file (probably, the size of the index onhashkey
).SELECT count(*) from db_object
decides to use the index on the hash key that, if the DB is 'fragmented', it's very slow as above (it's much faster to count the length in python ofSELECT id from db_object ORDER BY id
). Therefore we can:cat
on the whole file to pre-fetch the data (but only if it all fits in RAM!)Mentioning also #92 as the performance when looping over sorted results by hash key will be strongly dependent on either caching the file, or VACUUMing it first (that, however, is slow)
The text was updated successfully, but these errors were encountered: