-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collection of a number of important efficiency improvements #96
Collection of a number of important efficiency improvements #96
Conversation
The main reason why the functions like `get_objects_stream*` do not return objects in the same order is for efficiency, and this is achieved by reorganising objects per pack, and reading the pack file in sequential order. During the code changes, this was not happening anymore. Ensuring now that packed objects are returned in disk order (test also added) Moreover, adding a lot of objects one to one is much less efficient than doing a bulk insert with lower-level non-ORM functionality of SQLAlchemy. For this reason, I am now committing in bulk all data to the DB at the end of each pack file. Some comments on the performance in issue aiidateam#92.
Callback tests still missing
It now loops over the objects to export
Addresses aiidateam#12 This is already quite safe to be called, but I didn't cover all corner cases also related to power loss etc.
Now we define an `import_objects` function instead, that (when importing from a container using the same hashing algorithm) can be much more efficient by deciding beforehand which objects to import, and then importing only those. This uses the efficient method of aiidateam#93 to iterate only once through two sorted unique lists to get their intersection or left/right difference. The performance still needs to be benchmarked. The `export` function might be deprecated if this is more efficient.
I both add a private _vacuum() operation, and I call it: - optionally (False by default) in clean_storage() - at the end of a full repack of all packs
…many hash keys Above a certain threshold, it's just faster to iterate on all items rather than making an SQL `IN` query on chunks of `_IN_SQL_MAX_LENGTH = 950` (e.g. if we have millions of objects). For now, this is set as `_MAX_CHUNK_ITERATE_LENGTH = 9500`, but this needs to be benchmarked. Also, this is still missing tests and a real benchmark.
… writing methods These changes are motivated by benchmarks on a large repository ("SDB") with 6.8M nodes.
- Added a callback_instance fixture to facilitate tests - Added complete coverage of the new callback-related functions in the utils module - Started to implement coverage of some callback-related function in the container file.
The addition of sha1 was required to get full coverage, as there is a part of the code (in import_object) that is different if the hash algorithms of the two containers are the same or are different (in order to optimize speed if the algorithm is the same).
Anybody brave enough to review this PR? :-) |
Codecov Report
@@ Coverage Diff @@
## develop #96 +/- ##
==========================================
Coverage 100.00% 100.00%
==========================================
Files 7 7
Lines 1199 1502 +303
==========================================
+ Hits 1199 1502 +303
Continue to review full report at Codecov.
|
@chrisjsewell If possible I prefer merging (rather than squashing or rebasing) since I've been referencing the individual commits in multiple issues |
Heya, gradually getting through all the code, I'm just going to update notes here as I go
tests/test_benchmark.py::test_pack_write
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.0000 0.0000 0.4833 0.4833 disk-objectstore/disk_objectstore/container.py:1556(add_objects_to_pack)
1 0.0484 0.0484 0.4670 0.4670 disk-objectstore/disk_objectstore/container.py:1297(add_streamed_objects_to_pack)
10001 0.0180 0.0000 0.2115 0.0000 disk-objectstore/disk_objectstore/container.py:242(_get_pack_id_to_write_to)
20003 0.1005 0.0000 0.1005 0.0000 ~:0(<built-in method posix.stat>)
10002 0.0145 0.0000 0.0799 0.0000 disk-objectstore/disk_objectstore/container.py:226(_get_pack_path_from_pack_id)
Documention notes:
out of interest, why?
are there any you know of where this is not the case? add tox.ini: [tox]
envlist = py37
[testenv]
usedevelop=True
[testenv:py{35,36,37,38}]
description = run pytest
extras = dev
commands = pytest {posargs:--cov=disk_objectstore}
[testenv:py{36,37,38}-pre-commit]
description = run pre-commit
extras = dev
commands = pre-commit {posargs:run}
[testenv:py{35,36,37,38}-ipython]
description = start an ipython console
deps =
ipython
commands = ipython
Profiling snapshot (for later discussion): $ tox -e py37 -- --benchmark-only --benchmark-cprofile=cumtime
tests/test_benchmark.py::test_has_objects
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.0048 0.0048 0.4209 0.4209 disk-objectstore/disk_objectstore/container.py:778(has_objects)
10001 0.0322 0.0000 0.4123 0.0000 disk-objectstore/disk_objectstore/container.py:451(_get_objects_stream_meta_generator)
6 0.0000 0.0000 0.0896 0.0149 disk-objectstore/.tox/py37/lib/python3.7/site-packages/sqlalchemy/orm/query.py:3503(__iter__)
6 0.0000 0.0000 0.0886 0.0148 disk-objectstore/.tox/py37/lib/python3.7/site-packages/sqlalchemy/orm/query.py:3528(_execute_and_instances)
8 0.0000 0.0000 0.0874 0.0109 disk-objectstore/.tox/py37/lib/python3.7/site-packages/sqlalchemy/engine/base.py:943(execute)
7 0.0000 0.0000 0.0872 0.0125 disk-objectstore/.tox/py37/lib/python3.7/site-packages/sqlalchemy/sql/elements.py:296(_execute_on_connection)
7 0.0001 0.0000 0.0872 0.0125 disk-objectstore/.tox/py37/lib/python3.7/site-packages
tests/test_benchmark.py::test_list_all_loose
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.0000 0.0000 0.0000 0.0000 ~:0(<method 'disable' of '_lsprof.Profiler' objects>)
tests/test_benchmark.py::test_list_all_packed
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.0000 0.0000 0.0000 0.0000 ~:0(<method 'disable' of '_lsprof.Profiler' objects>)
tests/test_benchmark.py::test_loose_read
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.0019 0.0019 0.1003 0.1003 disk-objectstore/disk_objectstore/container.py:803(get_objects_content)
1001 0.0046 0.0000 0.0927 0.0001 disk-objectstore/disk_objectstore/container.py:451(_get_objects_stream_meta_generator)
1000 0.0321 0.0000 0.0321 0.0000 ~:0(<built-in method io.open>)
tests/test_benchmark.py::test_pack_read
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.0089 0.0089 0.1700 0.1700 disk-objectstore/disk_objectstore/container.py:803(get_objects_content)
10001 0.0300 0.0000 0.1407 0.0000 disk-objectstore/disk_objectstore/container.py:451(_get_objects_stream_meta_generator)
10001 0.0146 0.0000 0.0677 0.0000 disk-objectstore/disk_objectstore/utils.py:922(detect_where_sorted)
20004 0.0051 0.0000 0.0498 0.0000 ~:0(<built-in method builtins.next>)
10001 0.0030 0.0000 0.0447 0.0000 disk-objectstore/.tox/py37/lib/python3.7/site-packages/sqlalchemy/engine/result.py:1006(__iter__)
10001 0.0071 0.0000 0.0417 0.0000 disk-objectstore/.tox/py37/lib/python3.7/site-packages/sqlalchemy/engine/result.py:1320(fetchone)
20000 0.0099 0.0000 0.0253 0.0000 disk-objectstore/disk_objectstore/utils.py:385(_update_pos)
10000 0.0093 0.0000 0.0245 0.0000 disk-objectstore/disk_objectstore/utils.py:365(__init__)
10001 0.0028 0.0000 0.0237 0.0000 disk-objectstore/.tox/py37/lib/python3.7/site-packages/sqlalchemy/engine/result.py:1213(_fetchone_impl)
10001 0.0209 0.0000 0.0209 0.0000 ~:0(<method 'fetchone' of 'sqlite3.Cursor' objects>)
10000 0.0059 0.0000 0.0204 0.0000 disk-objectstore/disk_objectstore/utils.py:400(read)
20000 0.0154 0.0000 0.0154 0.0000 ~:0(<method 'tell' of '_io.BufferedReader' objects>)
10000 0.0080 0.0000 0.0109 0.0000 disk-objectstore/.tox/py37/lib/python3.7/site-packages
tests/test_benchmark.py::test_loose_write
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.0018 0.0018 0.5730 0.5730 disk-objectstore/tests/test_benchmark.py:28(write_loose)
1000 0.0039 0.0000 0.5710 0.0006 disk-objectstore/disk_objectstore/container.py:842(add_object)
1000 0.0051 0.0000 0.5671 0.0006 disk-objectstore/disk_objectstore/container.py:851(add_streamed_object)
1000 0.0150 0.0000 0.4302 0.0004 disk-objectstore/disk_objectstore/utils.py:158(__exit__)
1000 0.0079 0.0000 0.2387 0.0002 disk-objectstore/disk_objectstore/utils.py:852(safe_flush_to_disk)
2000 0.0052 0.0000 0.1639 0.0001 disk-objectstore/disk_objectstore/utils.py:870(<lambda>)
2000 0.1587 0.0001 0.1587 0.0001 ~:0(<built-in method posix.fsync>)
1000 0.0051 0.0000 0.1083 0.0001 disk-objectstore/disk_objectstore/utils.py:141(__enter__)
2000 0.1044 0.0001 0.1044 0.0001 ~:0(<built-in method io.open>)
tests/test_benchmark.py::test_pack_write
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.0000 0.0000 0.4961 0.4961 disk-objectstore/disk_objectstore/container.py:1556(add_objects_to_pack)
1 0.0522 0.0522 0.4945 0.4945 disk-objectstore/disk_objectstore/container.py:1297(add_streamed_objects_to_pack)
10001 0.0193 0.0000 0.2252 0.0000 disk-objectstore/disk_objectstore/container.py:242(_get_pack_id_to_write_to)
20003 0.1072 0.0000 0.1072 0.0000 ~:0(<built-in method posix.stat>)
10002 0.0150 0.0000 0.0848 0.0000 disk-objectstore/disk_objectstore/container.py:226(_get_pack_path_from_pack_id) |
This PR collects a number of important efficiency improvements, and a few features that were tightly bound to these efficiency changes, so they are shipped together.
In particular:
clean_storage
now provides an option to VACUUM the DB. VACUUM is also called after repacking. Fixes VACUUMing the DB and performance on listing via the index #94export
function (still existing but deprecated) to aimport_objects
functionpack_all_loose
andimport_objects
provide an option to perform a fsync to disk or not (see also Decide if we should callfullsync
on Mac (and if we should expose the decision to sync to the user) #54 - there are still however calls that always use - or don't use - fsync and full_fsync on Mac). Also,add_objects_to_pack
allows now to choose if you want to commit the changes to the SQLite DB, or not (delegating the responsibility to the caller, but this is important e.g. inimport_objects
: callingcommit
only once at the very end gives a factor of 2 speedup for very big repos).import_objects
provide a callback to e.g. show a progress bar.CallbackStreamWrapper
has been implemented, allowing to provide a callback (e.g. for a progress bar) when streaming a big file.sha1
) in addition to the defaultsha256
(fixes Add support for multiple hashing algorithms? #82). This is faster even if a bit less robust. This was also needed to test completely some feature inimport_objects
, where the logic is optimised if both containers use the same algorithm. By default is still better to use everywhere sha256, also because then all export files that will be generated will use this algorithm and importing will be more efficient.As a reference, with these changes, exporting the full large SDB database (6.8M nodes) takes ~ 50 minutes:
I think this is a excellent result. This can be compared to: