Implement function to do a full repack #12

giovannipizzi · 2020-04-13T18:32:58Z

This would recreate each pack only with the known objects (e.g. if something was deleted, see also #5)

giovannipizzi · 2020-06-19T17:40:30Z

This becomes even more important with the hashkey implementation, because it can often happen (at least in the very first implementation I wrote in #26) that one tries to write directly to pack an already existing object, and this leaves regions in the file that are unaccounted for. This should probably be merged with #13 - during full repack, it would make much sense also to first validate everything.

giovannipizzi · 2020-06-24T09:38:19Z

For reference, a simple preliminary script to check the regions in the pack that are unaccounted for (just raising for overlaps, for now)

import os

from disk_objectstore import Container
from disk_objectstore.models import Obj


def get_known_size(container, pack_id): # Size ON DISK
    from sqlalchemy import func
    session = container._get_cached_session()
    query = session.query(func.sum(Obj.length).label('total_size')).group_by(Obj.pack_id).filter(Obj.pack_id == pack_id)
    return query.all()[0][0], os.path.getsize(container._get_pack_path_from_pack_id(pack_id))

    
def check_pack_holes(container, pack_id):
    session = container._get_cached_session()
    known_locations = session.query(Obj).filter(Obj.pack_id==pack_id, Obj.length != 0).with_entities(Obj.offset, Obj.length).order_by(Obj.offset).all() 

    #print('start known loc', known_locations[:10])
    #print('end known loc', known_locations[-10:])

    unaccounted_ranges = []

    current_pos = 0
    for offset, length in known_locations:
        if offset < current_pos:
            raise ValueError("Overlap! {} vs {}".format(offset, current_pos))
        if offset > current_pos:
            unaccounted_ranges.append((current_pos, offset))
        current_pos = offset + length

    # Final object:
    total_pack_size = os.path.getsize(container._get_pack_path_from_pack_id(pack_id))
    if current_pos != total_pack_size:
        unaccounted_ranges.append((current_pos, total_pack_size))
    
    return unaccounted_ranges

if __name__ == "__main__":
    container = Container('/tmp/test-container')
    unaccounted_ranges = check_pack_holes(container, 0)
    known_size, file_size = get_known_size(container, 0)
    unaccounted_size = sum(_[1] - _[0] for _ in unaccounted_ranges)
    #print('start unaccounted ranges', unaccounted_ranges[:10])
    #print('end unaccounted ranges', unaccounted_ranges[-10:])
    print("{} unaccounted ranges of a total of {} bytes (~ {} MB)".format(len(unaccounted_ranges), 
        unaccounted_size, unaccounted_size // 1024 // 1024))
    print("Pack file size: {} (~{} MB), known occupied size: {}, difference (should be the same as above): {} (~{} MB)".format(
        file_size, file_size // 1024 // 1204, known_size, file_size - known_size, (file_size - known_size) // 1024 // 1024))

    print("OK they are the same!" if unaccounted_size == file_size - known_size else "ERROR they are different!")

giovannipizzi · 2020-07-20T14:27:26Z

Note: with #85 fixed, this is less urgent, but still important

for the default behaviour of add_objects_to_pack with no_holes=False
when deleting objects (that leave back 'holes')

giovannipizzi · 2020-07-20T14:32:34Z

Note also that it would be ideal to

rewrite a temporary new pack (say tmp-pack, or -1 since it's never used) with data repacked (in the same order, without need to recompute or recompress)
validate this
update the DB with the new pack id and offsets
move the -1 in the place of the original pack id -> this is because then running rsync would detect only the changes and be very fast
re-update the DB with the correct pack id
move away the pack

However, the issue is that the last steps are not obvious to be done in a way that does not risk to loose data if something unexpected happens in between, and that we can easily recover in case of problems (we always have a moment in time in which we have both the old and the new pack file, and it's not clear to which of the two the DB is pointing to)

giovannipizzi · 2020-07-22T06:23:08Z

Please check performance notes in #92 where there are also scripts that I wrote to check performance and essentially they are very fast and essentially already implement the core part of the work to repack in a quite efficient way.

giovannipizzi · 2020-07-22T11:51:54Z

Important: Also call VACUUM on the SQLite DB

Addresses aiidateam#12 This is already quite safe to be called, but I didn't cover all corner cases also related to power loss etc.

giovannipizzi · 2020-07-29T16:33:54Z

Notes on the performance of the implementation of `53ad44c`

(TL;DR: it's very good).

I tried to do a repack of a single pack (number 0) for the big SDB DB, that contains 205454 objects, used to be 4295689889 bytes (~4GB) and will be 4178255084 after repacking.

Here is the time (note that the DB was already in the OS disk cache):

In [6]: t = time.time() ; c.repack_pack(0) ; print(time.time() - t)
107.0739016532898

So less that 2 minutes. The time to create the new pack is very fast, some time is required to update the DB, but it seems to be more than reasonable.

Notes on `rsync` after the repack

Trying rsync between the two files (note: I use a SSH connection to localhost because otherwise it does not do delta transfer on the same machine, see e.g. option --no-whole-file and this page.

Note also that I am doing the rsync in the opposite direction (old onto new):

rsync -v localhost:/scratch/TEST-DISK-OBJECTSTORE/test-newrepo-sdb/packs/0  0
sent 517,247 bytes  received 1,147,952,445 bytes  15,211,519.10 bytes/sec
total size is 4,295,689,889  speedup is 3.74

Second time:

rsync -v localhost:/scratch/TEST-DISK-OBJECTSTORE/test-newrepo-sdb/packs/0  0
0

sent 524,459 bytes  received 262,284 bytes  12,792.57 bytes/sec
total size is 4,295,689,889  speedup is 5,460.09

NOTE: it still takes quite some time, it's checking the two files because timestamp is different.
With the -a option (starting again from different files), timestamp is also synced, and the second time it's much faster.

rsync -av localhost:/scratch/TEST-DISK-OBJECTSTORE/test-newrepo-sdb/packs/0  0
receiving incremental file list
0

sent 517,247 bytes  received 1,147,952,467 bytes  15,415,700.86 bytes/sec
total size is 4,295,689,889  speedup is 3.74

Second time:

rsync -av localhost:/scratch/TEST-DISK-OBJECTSTORE/test-newrepo-sdb/packs/0  0
sent 20 bytes  received 55 bytes  16.67 bytes/sec
total size is 4,295,689,889  speedup is 57,275,865.19

(i.e. it's instantaneous)

Rsync in the 'right' direction

For completeness, here are the results when doing the repack in the 'right' direction (overwriting the old with the new repacked file), that are very similar to those above:

rsync -av localhost:/scratch/TEST-DISK-OBJECTSTORE/test-repack-sdb/packs/0.repackbackup  0
receiving incremental file list
0.repackbackup

sent 524,459 bytes  received 1,064,887,442 bytes  14,695,336.57 bytes/sec
total size is 4,178,255,084  speedup is 3.92

Second time (instantaneous):

pizzi@theospc36:/scratch/TEST-DISK-OBJECTSTORE/test-repack-sdb/packs$ rsync -av localhost:/scratch/TEST-DISK-OBJECTSTORE/test-repack-sdb/packs/0.repackbackup  0
receiving incremental file list

sent 20 bytes  received 68 bytes  19.56 bytes/sec
total size is 4,178,255,084  speedup is 47,480,171.41

giovannipizzi · 2020-07-29T16:42:10Z

This issue is fixed by 53ad44c so will be closed when merged (it's now in a branch with other performance issues that are anyway related, see #92, because to make this routine performant, we are following the patterns discussed in #92).

This merge collects a number of important efficiency improvements, and a few features that were tightly bound to these efficiency changes, so they are shipped together. In particular: - objects are now sorted and returned in the order in which they are on disk, with an important performance benefit. Fixes #92 - When there are many objects to list (currently set to 9500 objects, 10x the ones we could fit in a single IN SQL statement), performing many queries is slow, so we just resort to listing all objects and doing an efficient intersection (if the hash keys are sorted, both iterators can be looped over only once - fixes #93) - Since VACUUMing the DB is very important for efficiency, when the DB does not fit fully in the disk cache, `clean_storage` now provides an option to VACUUM the DB. VACUUM is also called after repacking. Fixes #94 - We implement now a function to perform a full repack of the repository (fixes #12). This is important and needed to reclaim space after deleting an object - For efficiency, we have moved the logic from an `export` function (still existing but deprecated) to a `import_objects` function - Still for efficiency, now functions like `pack_all_loose` and `import_objects` provide an option to perform a fsync to disk or not (see also #54 - there are still however calls that always use - or don't use - fsync and full_fsync on Mac). Also, `add_objects_to_pack` allows now to choose if you want to commit the changes to the SQLite DB, or not (delegating the responsibility to the caller, but this is important e.g. in `import_objects`: calling `commit` only once at the very end gives a factor of 2 speedup for very big repos). - A number of functions, including (but not exclusively) `import_objects` provide a callback to e.g. show a progress bar. - a `CallbackStreamWrapper` has been implemented, allowing to provide a callback (e.g. for a progress bar) when streaming a big file. - a new hash algorithm is now supported (`sha1`) in addition to the default `sha256` (fixes #82). This is faster even if a bit less robust. This was also needed to test completely some feature in `import_objects`, where the logic is optimised if both containers use the same algorithm. By default is still better to use everywhere sha256, also because then all export files that will be generated will use this algorithm and importing will be more efficient. - tests have been added for all new functionality, achieving again 100% coverage As a reference, with these changes, exporting the full large SDB database (6.8M nodes) takes ~ 50 minutes: ``` 6714808it [00:24, 274813.02it/s] All hashkeys listed in 24.444787740707397s. Listing objects: 100%|████████| 6714808/6714808 [00:06<00:00, 978896.65it/s] Copy objects: 100%|███████████| 6714808/6714808 [48:15<00:00, 2319.08it/s] Final flush: 100%|████████████| 63236/63236 [00:07<00:00, 8582.35it/s] Everything re-exported in 2960.980943918228s. ``` This can be compared to: - ~10 minutes to copy the whole 90GB, or ~15 minutes to read all and validate the packs. We will never be able to be faster than just copying the pack files, and we are only 3x slower. - ~2 days to just list all files in the old legacy AiiDA repo (or all objects if they are loose), and this does not take into account the time to rewrite everything, probably comparable. So we are almost 2 orders of magnitude faster than before.

giovannipizzi added this to the Additional robustness and features for usable hash key implementation milestone Jun 19, 2020

giovannipizzi added a commit to giovannipizzi/disk-objectstore that referenced this issue Jul 29, 2020

Implementing basic repacking logic

53ad44c

Addresses aiidateam#12 This is already quite safe to be called, but I didn't cover all corner cases also related to power loss etc.

giovannipizzi mentioned this issue Jul 29, 2020

Efficiency issue when looping over many objects #92

Closed

giovannipizzi mentioned this issue Aug 26, 2020

Collection of a number of important efficiency improvements #96

Merged

chrisjsewell closed this as completed in #96 Oct 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement function to do a full repack #12

Implement function to do a full repack #12

giovannipizzi commented Apr 13, 2020

giovannipizzi commented Jun 19, 2020

giovannipizzi commented Jun 24, 2020

giovannipizzi commented Jul 20, 2020

giovannipizzi commented Jul 20, 2020

giovannipizzi commented Jul 22, 2020

giovannipizzi commented Jul 22, 2020

giovannipizzi commented Jul 29, 2020 •

edited

Loading

giovannipizzi commented Jul 29, 2020

Implement function to do a full repack #12

Implement function to do a full repack #12

Comments

giovannipizzi commented Apr 13, 2020

giovannipizzi commented Jun 19, 2020

giovannipizzi commented Jun 24, 2020

giovannipizzi commented Jul 20, 2020

giovannipizzi commented Jul 20, 2020

giovannipizzi commented Jul 22, 2020

giovannipizzi commented Jul 22, 2020

giovannipizzi commented Jul 29, 2020 • edited Loading

Notes on the performance of the implementation of 53ad44c

Notes on rsync after the repack

Rsync in the 'right' direction

giovannipizzi commented Jul 29, 2020

giovannipizzi commented Jul 29, 2020 •

edited

Loading

Notes on the performance of the implementation of `53ad44c`

Notes on `rsync` after the repack