Collection of a number of important efficiency improvements #96

giovannipizzi · 2020-08-26T22:17:11Z

This PR collects a number of important efficiency improvements, and a few features that were tightly bound to these efficiency changes, so they are shipped together.

In particular:

objects are now sorted and returned in the order in which they are on disk, with an important performance benefit. Fixes Efficiency issue when looping over many objects #92
When there are many objects to list (currently set to 9500 objects, 10x the ones we could fit in a single IN SQL statement), performing many queries is slow, so we just resort to listing all objects and doing an efficient intersection (if the hash keys are sorted, both iterators can be looped over only once - fixes Efficient intersections and getting all elements #93)
Since VACUUMing the DB is very important for efficiency, when the DB does not fit fully in the disk cache, clean_storage now provides an option to VACUUM the DB. VACUUM is also called after repacking. Fixes VACUUMing the DB and performance on listing via the index #94
We implement now a function to perform a full repack of the repository (fixes Implement function to do a full repack #12). This is important and needed to reclaim space after deleting an object
For efficiency, we have moved the logic from an export function (still existing but deprecated) to a import_objects function
Still for efficiency, now functions like pack_all_loose and import_objects provide an option to perform a fsync to disk or not (see also Decide if we should call fullsync on Mac (and if we should expose the decision to sync to the user) #54 - there are still however calls that always use - or don't use - fsync and full_fsync on Mac). Also, add_objects_to_pack allows now to choose if you want to commit the changes to the SQLite DB, or not (delegating the responsibility to the caller, but this is important e.g. in import_objects: calling commit only once at the very end gives a factor of 2 speedup for very big repos).
A number of functions, including (but not exclusively) import_objects provide a callback to e.g. show a progress bar.
a CallbackStreamWrapper has been implemented, allowing to provide a callback (e.g. for a progress bar) when streaming a big file.
a new hash algorithm is now supported (sha1) in addition to the default sha256 (fixes Add support for multiple hashing algorithms? #82). This is faster even if a bit less robust. This was also needed to test completely some feature in import_objects, where the logic is optimised if both containers use the same algorithm. By default is still better to use everywhere sha256, also because then all export files that will be generated will use this algorithm and importing will be more efficient.
tests have been added for all new functionality, achieving again 100% coverage

As a reference, with these changes, exporting the full large SDB database (6.8M nodes) takes ~ 50 minutes:

6714808it [00:24, 274813.02it/s]
All hashkeys listed in 24.444787740707397s.
Listing objects: 100%|████████| 6714808/6714808 [00:06<00:00, 978896.65it/s]
Copy objects: 100%|███████████| 6714808/6714808 [48:15<00:00, 2319.08it/s]
Final flush: 100%|████████████| 63236/63236 [00:07<00:00, 8582.35it/s]
Everything re-exported in 2960.980943918228s.

I think this is a excellent result. This can be compared to:

~10 minutes to copy the whole 90GB, or ~15 minutes to read all and validate the packs. We will never be able to be faster than just copying the pack files, and we are only 3x slower.
~2 days to just list all files in the old legacy AiiDA repo (or all objects if they are loose), and this does not take into account the time to rewrite everything, probably comparable. So we are almost 2 orders of magnitude faster than before.

The main reason why the functions like `get_objects_stream*` do not return objects in the same order is for efficiency, and this is achieved by reorganising objects per pack, and reading the pack file in sequential order. During the code changes, this was not happening anymore. Ensuring now that packed objects are returned in disk order (test also added) Moreover, adding a lot of objects one to one is much less efficient than doing a bulk insert with lower-level non-ORM functionality of SQLAlchemy. For this reason, I am now committing in bulk all data to the DB at the end of each pack file. Some comments on the performance in issue aiidateam#92.

Callback tests still missing

It now loops over the objects to export

Addresses aiidateam#12 This is already quite safe to be called, but I didn't cover all corner cases also related to power loss etc.

Now we define an `import_objects` function instead, that (when importing from a container using the same hashing algorithm) can be much more efficient by deciding beforehand which objects to import, and then importing only those. This uses the efficient method of aiidateam#93 to iterate only once through two sorted unique lists to get their intersection or left/right difference. The performance still needs to be benchmarked. The `export` function might be deprecated if this is more efficient.

I both add a private _vacuum() operation, and I call it: - optionally (False by default) in clean_storage() - at the end of a full repack of all packs

…many hash keys Above a certain threshold, it's just faster to iterate on all items rather than making an SQL `IN` query on chunks of `_IN_SQL_MAX_LENGTH = 950` (e.g. if we have millions of objects). For now, this is set as `_MAX_CHUNK_ITERATE_LENGTH = 9500`, but this needs to be benchmarked. Also, this is still missing tests and a real benchmark.

… writing methods These changes are motivated by benchmarks on a large repository ("SDB") with 6.8M nodes.

- Added a callback_instance fixture to facilitate tests - Added complete coverage of the new callback-related functions in the utils module - Started to implement coverage of some callback-related function in the container file.

The addition of sha1 was required to get full coverage, as there is a part of the code (in import_object) that is different if the hash algorithms of the two containers are the same or are different (in order to optimize speed if the algorithm is the same).

giovannipizzi · 2020-08-26T22:18:22Z

Anybody brave enough to review this PR? :-)
I'm happy to explain the changes in a call, if it helps.

codecov · 2020-08-26T22:22:20Z

Codecov Report

Merging #96 into develop will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           develop       #96    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files            7         7            
  Lines         1199      1502   +303     
==========================================
+ Hits          1199      1502   +303

Impacted Files	Coverage Δ
disk_objectstore/__init__.py	`100.00% <100.00%> (ø)`
disk_objectstore/container.py	`100.00% <100.00%> (ø)`
disk_objectstore/utils.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e7e3627...34d6671. Read the comment docs.

giovannipizzi · 2020-08-27T16:32:58Z

@chrisjsewell If possible I prefer merging (rather than squashing or rebasing) since I've been referencing the individual commits in multiple issues

chrisjsewell · 2020-09-29T06:56:31Z

Heya, gradually getting through all the code, I'm just going to update notes here as I go

mention container.clean_storage() in basic usage
add simple CLI
delete empty folders after container.clean_storage()
as a standalone library, integrated metadata storage would be ideal (e.g. file format, file path)
drop python 3.5
use type annotations (and add mypy CI + py.typed file)
use * before key-word arguments
use pathlib over os.path
specify "utf8" for config JSON read/write?
checking the size of the pack every time is maybe not performant: can we memoize full packs? (I guess you'd have to be careful to remove a pack from the set, if you removed anything from it):

tests/test_benchmark.py::test_pack_write
ncalls  tottime percall cumtime percall filename:lineno(function)
1       0.0000  0.0000  0.4833  0.4833  disk-objectstore/disk_objectstore/container.py:1556(add_objects_to_pack)
1       0.0484  0.0484  0.4670  0.4670  disk-objectstore/disk_objectstore/container.py:1297(add_streamed_objects_to_pack)
10001   0.0180  0.0000  0.2115  0.0000  disk-objectstore/disk_objectstore/container.py:242(_get_pack_id_to_write_to)
20003   0.1005  0.0000  0.1005  0.0000  ~:0(<built-in method posix.stat>)
10002   0.0145  0.0000  0.0799  0.0000  disk-objectstore/disk_objectstore/container.py:226(_get_pack_path_from_pack_id)

_get_objects_stream_meta_generator yield named tuple or dataclass (see https://pypi.org/project/dataclasses/ and https://github.com/python-attrs/attrs), also for the meta key, this should be a dataclass rather than a dict.
and actually anywhere you are returning dicts, it would be better to return a dataclass (I'm a big fan of TypeScript!), like count_objects and get_total_size
It feels like _get_objects_stream_meta_generator could be refactored some what to remove the duplication(pytest is right, it is too complex lol). Perhaps the number of retries for loose_not_found should be configurable.
disk-objectstore/disk_objectstore/container.py

Lines 991 to 992 in e7e3627

# I need to use a comma because I want to create a tuple

loose_objects.difference_update((hashkey,))

just use a set 🤷 {hashkey}
change get_hash -> get_hash_cls

Documention notes:

Current experience (with AiiDA) shows that it's actually not so good to use two
levels of nesting.

out of interest, why?

the move operation should be hopefully a fast atomic operation on most filesystems

are there any you know of where this is not the case?

add tox.ini:

[tox]
envlist = py37

[testenv]
usedevelop=True

[testenv:py{35,36,37,38}]
description = run pytest
extras = dev
commands = pytest {posargs:--cov=disk_objectstore}

[testenv:py{36,37,38}-pre-commit]
description = run pre-commit
extras = dev
commands = pre-commit {posargs:run}

[testenv:py{35,36,37,38}-ipython]
description = start an ipython console
deps =
    ipython
commands = ipython

Profiling snapshot (for later discussion):

$ tox -e py37 -- --benchmark-only --benchmark-cprofile=cumtime

tests/test_benchmark.py::test_has_objects
ncalls  tottime percall cumtime percall filename:lineno(function)
1       0.0048  0.0048  0.4209  0.4209  disk-objectstore/disk_objectstore/container.py:778(has_objects)
10001   0.0322  0.0000  0.4123  0.0000  disk-objectstore/disk_objectstore/container.py:451(_get_objects_stream_meta_generator)
6       0.0000  0.0000  0.0896  0.0149  disk-objectstore/.tox/py37/lib/python3.7/site-packages/sqlalchemy/orm/query.py:3503(__iter__)
6       0.0000  0.0000  0.0886  0.0148  disk-objectstore/.tox/py37/lib/python3.7/site-packages/sqlalchemy/orm/query.py:3528(_execute_and_instances)
8       0.0000  0.0000  0.0874  0.0109  disk-objectstore/.tox/py37/lib/python3.7/site-packages/sqlalchemy/engine/base.py:943(execute)
7       0.0000  0.0000  0.0872  0.0125  disk-objectstore/.tox/py37/lib/python3.7/site-packages/sqlalchemy/sql/elements.py:296(_execute_on_connection)
7       0.0001  0.0000  0.0872  0.0125  disk-objectstore/.tox/py37/lib/python3.7/site-packages


tests/test_benchmark.py::test_list_all_loose
ncalls  tottime percall cumtime percall filename:lineno(function)
1       0.0000  0.0000  0.0000  0.0000  ~:0(<method 'disable' of '_lsprof.Profiler' objects>)

tests/test_benchmark.py::test_list_all_packed
ncalls  tottime percall cumtime percall filename:lineno(function)
1       0.0000  0.0000  0.0000  0.0000  ~:0(<method 'disable' of '_lsprof.Profiler' objects>)

tests/test_benchmark.py::test_loose_read
ncalls  tottime percall cumtime percall filename:lineno(function)
1       0.0019  0.0019  0.1003  0.1003  disk-objectstore/disk_objectstore/container.py:803(get_objects_content)
1001    0.0046  0.0000  0.0927  0.0001  disk-objectstore/disk_objectstore/container.py:451(_get_objects_stream_meta_generator)
1000    0.0321  0.0000  0.0321  0.0000  ~:0(<built-in method io.open>)

tests/test_benchmark.py::test_pack_read
ncalls  tottime percall cumtime percall filename:lineno(function)
1       0.0089  0.0089  0.1700  0.1700  disk-objectstore/disk_objectstore/container.py:803(get_objects_content)
10001   0.0300  0.0000  0.1407  0.0000  disk-objectstore/disk_objectstore/container.py:451(_get_objects_stream_meta_generator)
10001   0.0146  0.0000  0.0677  0.0000  disk-objectstore/disk_objectstore/utils.py:922(detect_where_sorted)
20004   0.0051  0.0000  0.0498  0.0000  ~:0(<built-in method builtins.next>)
10001   0.0030  0.0000  0.0447  0.0000  disk-objectstore/.tox/py37/lib/python3.7/site-packages/sqlalchemy/engine/result.py:1006(__iter__)
10001   0.0071  0.0000  0.0417  0.0000  disk-objectstore/.tox/py37/lib/python3.7/site-packages/sqlalchemy/engine/result.py:1320(fetchone)
20000   0.0099  0.0000  0.0253  0.0000  disk-objectstore/disk_objectstore/utils.py:385(_update_pos)
10000   0.0093  0.0000  0.0245  0.0000  disk-objectstore/disk_objectstore/utils.py:365(__init__)
10001   0.0028  0.0000  0.0237  0.0000  disk-objectstore/.tox/py37/lib/python3.7/site-packages/sqlalchemy/engine/result.py:1213(_fetchone_impl)
10001   0.0209  0.0000  0.0209  0.0000  ~:0(<method 'fetchone' of 'sqlite3.Cursor' objects>)
10000   0.0059  0.0000  0.0204  0.0000  disk-objectstore/disk_objectstore/utils.py:400(read)
20000   0.0154  0.0000  0.0154  0.0000  ~:0(<method 'tell' of '_io.BufferedReader' objects>)
10000   0.0080  0.0000  0.0109  0.0000  disk-objectstore/.tox/py37/lib/python3.7/site-packages

tests/test_benchmark.py::test_loose_write
ncalls  tottime percall cumtime percall filename:lineno(function)
1       0.0018  0.0018  0.5730  0.5730  disk-objectstore/tests/test_benchmark.py:28(write_loose)
1000    0.0039  0.0000  0.5710  0.0006  disk-objectstore/disk_objectstore/container.py:842(add_object)
1000    0.0051  0.0000  0.5671  0.0006  disk-objectstore/disk_objectstore/container.py:851(add_streamed_object)
1000    0.0150  0.0000  0.4302  0.0004  disk-objectstore/disk_objectstore/utils.py:158(__exit__)
1000    0.0079  0.0000  0.2387  0.0002  disk-objectstore/disk_objectstore/utils.py:852(safe_flush_to_disk)
2000    0.0052  0.0000  0.1639  0.0001  disk-objectstore/disk_objectstore/utils.py:870(<lambda>)
2000    0.1587  0.0001  0.1587  0.0001  ~:0(<built-in method posix.fsync>)
1000    0.0051  0.0000  0.1083  0.0001  disk-objectstore/disk_objectstore/utils.py:141(__enter__)
2000    0.1044  0.0001  0.1044  0.0001  ~:0(<built-in method io.open>)

tests/test_benchmark.py::test_pack_write
ncalls  tottime percall cumtime percall filename:lineno(function)
1       0.0000  0.0000  0.4961  0.4961  disk-objectstore/disk_objectstore/container.py:1556(add_objects_to_pack)
1       0.0522  0.0522  0.4945  0.4945  disk-objectstore/disk_objectstore/container.py:1297(add_streamed_objects_to_pack)
10001   0.0193  0.0000  0.2252  0.0000  disk-objectstore/disk_objectstore/container.py:242(_get_pack_id_to_write_to)
20003   0.1072  0.0000  0.1072  0.0000  ~:0(<built-in method posix.stat>)
10002   0.0150  0.0000  0.0848  0.0000  disk-objectstore/disk_objectstore/container.py:226(_get_pack_path_from_pack_id)

giovannipizzi added 11 commits July 21, 2020 19:07

Adding callbacks to export and add_objects_to_pack

3cebca7

Callback tests still missing

Changing the logic for the callback on the export

acc2186

It now loops over the objects to export

Implementing basic repacking logic

53ad44c

Addresses aiidateam#12 This is already quite safe to be called, but I didn't cover all corner cases also related to power loss etc.

We also want to test that after repacking content is correct

585351e

Adding VACUUMing operations

1f71b4d

I both add a private _vacuum() operation, and I call it: - optionally (False by default) in clean_storage() - at the end of a full repack of all packs

Adding parameters to decide whether to do fsync and to commit in some…

471ac9b

… writing methods These changes are motivated by benchmarks on a large repository ("SDB") with 6.8M nodes.

Adding callback tests

8df5e2b

- Added a callback_instance fixture to facilitate tests - Added complete coverage of the new callback-related functions in the utils module - Started to implement coverage of some callback-related function in the container file.

giovannipizzi requested review from ramirezfranciscof and sphuber August 26, 2020 22:17

Merge branch 'develop' into fix_92_efficiency_big

34d6671

giovannipizzi mentioned this pull request Aug 27, 2020

✨ Add the concept of a (unique) container ID #97

Merged

chrisjsewell self-requested a review August 27, 2020 15:44

giovannipizzi mentioned this pull request Aug 27, 2020

✨ Generalize compression algorithm #99

Merged

chrisjsewell mentioned this pull request Oct 2, 2020

General, non-essential improvements #101

Closed

10 tasks

chrisjsewell approved these changes Oct 2, 2020

View reviewed changes

chrisjsewell merged commit 64326c7 into aiidateam:develop Oct 2, 2020

giovannipizzi deleted the fix_92_efficiency_big branch November 11, 2020 14:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collection of a number of important efficiency improvements #96

Collection of a number of important efficiency improvements #96

giovannipizzi commented Aug 26, 2020

giovannipizzi commented Aug 26, 2020

codecov bot commented Aug 26, 2020 •

edited

Loading

giovannipizzi commented Aug 27, 2020

chrisjsewell commented Sep 29, 2020 •

edited

Loading

Collection of a number of important efficiency improvements #96

Collection of a number of important efficiency improvements #96

Conversation

giovannipizzi commented Aug 26, 2020

giovannipizzi commented Aug 26, 2020

codecov bot commented Aug 26, 2020 • edited Loading

Codecov Report

giovannipizzi commented Aug 27, 2020

chrisjsewell commented Sep 29, 2020 • edited Loading

codecov bot commented Aug 26, 2020 •

edited

Loading

chrisjsewell commented Sep 29, 2020 •

edited

Loading