Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VACUUMing the DB and performance on listing via the index #94

Closed
giovannipizzi opened this issue Jul 31, 2020 · 2 comments · Fixed by #96
Closed

VACUUMing the DB and performance on listing via the index #94

giovannipizzi opened this issue Jul 31, 2020 · 2 comments · Fixed by #96
Milestone

Comments

@giovannipizzi
Copy link
Member

giovannipizzi commented Jul 31, 2020

  1. When creating a DB a bit at a time, the index is scattered on the file. On a big DB, if the DB file is not in the OS disk cache, it means a huge performance hit (~1000it/s instead of 600000it/s), e.g. on any listing using the covering index on the hash key (e.g. SELECT hashkey from db_object ORDER BY hashkey). E.g. test to perform the query above right after flushing the caches with sudo su -c "echo 3 > /proc/sys/vm/drop_caches":
    from disk_objectstore import Container
    import tqdm
    
    container = Container('/scratch/TEST-DISK-OBJECTSTORE/test-newrepo-sdb/')
    session = container._get_cached_session()
    
    hashkeys_q = session.execute("SELECT hashkey FROM db_object ORDER BY hashkey")
    list(tqdm.tqdm(hashkeys_q, total=6714808))
  2. If the DB file fits in memory, just performing cat packs.idx > /dev/null will make all the rest of the operations (like listing ordered by hashkey) fast again (600000it/s vs 1000it/s)
  3. a better long-term solution is to call VACUUM on the DB, as this will defragment the DB and the indexes (note: the content of the SQLite file, not how it's written on the filesystem). Also, this is needed when deleting entries to recall space. Note that the first iteration on SELECT hashkey from db_object ORDER BY hashkey will run at ~270000it/s instead of the later times (with caching) when it will run at 600000it/s but it is already fast enough. Also, looking at how much data goes into the disk OS cache, it seems that it only actually needs to read and keep in the cache 550MB out of 1.2GB of the sqlite file (probably, the size of the index on hashkey).
  4. performing SELECT count(*) from db_object decides to use the index on the hash key that, if the DB is 'fragmented', it's very slow as above (it's much faster to count the length in python of SELECT id from db_object ORDER BY id). Therefore we can:
    • change how we count objects
    • have a trick to call cat on the whole file to pre-fetch the data (but only if it all fits in RAM!)
    • have a maintenance operation that calls VACUUM (probably better choice, to combine with other maintenance operations)

Mentioning also #92 as the performance when looping over sorted results by hash key will be strongly dependent on either caching the file, or VACUUMing it first (that, however, is slow)

@giovannipizzi giovannipizzi added this to the Nice to have milestone Jul 31, 2020
@giovannipizzi
Copy link
Member Author

According to this SQLite page, the following SQL can be used to estimate how 'fragmented' the DB is and if you need to call VACUUM:

CREATE TEMP TABLE s(rowid INTEGER PRIMARY KEY, pageno INT);
INSERT INTO s(pageno) SELECT pageno FROM dbstat ORDER BY path;
SELECT sum(s1.pageno+1==s2.pageno)*1.0/count(*)
  FROM s AS s1, s AS s2
 WHERE s1.rowid+1=s2.rowid;
DROP TABLE s;

As a reference here is the value on the big SDB DB (~6.8M entries, 1.2GB) before VACUUMING:

  • 0.191833845709181
    And after vacuuming:
  • 0.882292652715071

@giovannipizzi
Copy link
Member Author

To vacuum a Container c this can be done:

engine = c._get_cached_session().get_bind()
engine.execute("VACUUM")

chrisjsewell added a commit that referenced this issue Oct 2, 2020
This merge collects a number of important efficiency improvements, and a few features that were tightly bound to these efficiency changes, so they are shipped together.

In particular:

- objects are now sorted and returned in the order in which they are on disk, with an important performance benefit. Fixes #92 
- When there are many objects to list (currently set to 9500 objects, 10x the ones we could fit in a single IN SQL statement), performing many queries is slow, so we just resort to listing all objects and doing an efficient intersection (if the hash keys are sorted, both iterators can be looped over only once - fixes #93)
- Since VACUUMing the DB is very important for efficiency, when the DB does not fit fully in the disk cache, `clean_storage` now provides an option to VACUUM the DB. VACUUM is also called after repacking. Fixes #94 
- We implement now a function to perform a full repack of the repository (fixes #12). This is important and needed to reclaim space after deleting an object
- For efficiency, we have moved the logic from an `export` function (still existing but deprecated) to a `import_objects` function
- Still for efficiency, now functions like `pack_all_loose` and `import_objects` provide an option to perform a fsync to disk or not (see also #54 - there are still however calls that always use - or don't use - fsync and full_fsync on Mac). Also, `add_objects_to_pack` allows now to choose if you want to commit the changes to the SQLite DB, or not (delegating the responsibility to the caller, but this is important e.g. in `import_objects`: calling `commit` only once at the very end gives a factor of 2 speedup for very big repos).
- A number of functions, including (but not exclusively) `import_objects` provide a callback to e.g. show a progress bar.
- a `CallbackStreamWrapper` has been implemented, allowing to provide a callback (e.g. for a progress bar) when streaming a big file.
- a new hash algorithm is now supported (`sha1`) in addition to the default `sha256` (fixes #82). This is faster even if a bit less robust. This was also needed to test completely some feature in `import_objects`, where the logic is optimised if both containers use the same algorithm. By default is still better to use everywhere sha256, also because then all export files that will be generated will use this algorithm and importing will be more efficient.
- tests have been added for all new functionality, achieving again 100% coverage

As a reference, with these changes, exporting the full large SDB database (6.8M nodes) takes ~ 50 minutes:
```
6714808it [00:24, 274813.02it/s]
All hashkeys listed in 24.444787740707397s.
Listing objects: 100%|████████| 6714808/6714808 [00:06<00:00, 978896.65it/s]
Copy objects: 100%|███████████| 6714808/6714808 [48:15<00:00, 2319.08it/s]
Final flush: 100%|████████████| 63236/63236 [00:07<00:00, 8582.35it/s]
Everything re-exported in 2960.980943918228s.
```

This can be compared to:

- ~10 minutes to copy the whole 90GB, or ~15 minutes to read all and validate the packs. We will never be able to be faster than just copying the pack files, and we are only 3x slower.
- ~2 days to just list all files in the old legacy AiiDA repo (or all objects if they are loose), and this does not take into account the time to rewrite everything, probably comparable. So we are almost 2 orders of magnitude faster than before.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant