Release v0.5 #103

giovannipizzi · 2020-10-12T06:45:39Z

Various general (but very important) speed improvements
Adding callbacks to a number of functions (e.g. export, add_objects_to_pack, ...) to allow showing progress bars or similar indicators
Implement repacking (at least when not changing hashing or compression)
Remove export, implement import_objects function instead on the other side (more efficient)
Add support for VACUUMing operations (very important for efficiency)
Added support for multiple hashing algorithms
Added concept of (unique) container_id
Generalized the compression algorithm, multiple algorithms supported now

The main reason why the functions like `get_objects_stream*` do not return objects in the same order is for efficiency, and this is achieved by reorganising objects per pack, and reading the pack file in sequential order. During the code changes, this was not happening anymore. Ensuring now that packed objects are returned in disk order (test also added) Moreover, adding a lot of objects one to one is much less efficient than doing a bulk insert with lower-level non-ORM functionality of SQLAlchemy. For this reason, I am now committing in bulk all data to the DB at the end of each pack file. Some comments on the performance in issue aiidateam#92.

Callback tests still missing

It now loops over the objects to export

Addresses aiidateam#12 This is already quite safe to be called, but I didn't cover all corner cases also related to power loss etc.

Now we define an `import_objects` function instead, that (when importing from a container using the same hashing algorithm) can be much more efficient by deciding beforehand which objects to import, and then importing only those. This uses the efficient method of aiidateam#93 to iterate only once through two sorted unique lists to get their intersection or left/right difference. The performance still needs to be benchmarked. The `export` function might be deprecated if this is more efficient.

I both add a private _vacuum() operation, and I call it: - optionally (False by default) in clean_storage() - at the end of a full repack of all packs

…many hash keys Above a certain threshold, it's just faster to iterate on all items rather than making an SQL `IN` query on chunks of `_IN_SQL_MAX_LENGTH = 950` (e.g. if we have millions of objects). For now, this is set as `_MAX_CHUNK_ITERATE_LENGTH = 9500`, but this needs to be benchmarked. Also, this is still missing tests and a real benchmark.

They were listed in dev-requirements.txt with the correct version (and used by the tests) but they were missing in `setup.py`. They were `pytest-benchmark` and `coverage`.

… writing methods These changes are motivated by benchmarks on a large repository ("SDB") with 6.8M nodes.

- Added a callback_instance fixture to facilitate tests - Added complete coverage of the new callback-related functions in the utils module - Started to implement coverage of some callback-related function in the container file.

…al_deps Adding missing optional dev dependencies

The addition of sha1 was required to get full coverage, as there is a part of the code (in import_object) that is different if the hash algorithms of the two containers are the same or are different (in order to optimize speed if the algorithm is the same).

This merge collects a number of important efficiency improvements, and a few features that were tightly bound to these efficiency changes, so they are shipped together. In particular: - objects are now sorted and returned in the order in which they are on disk, with an important performance benefit. Fixes aiidateam#92 - When there are many objects to list (currently set to 9500 objects, 10x the ones we could fit in a single IN SQL statement), performing many queries is slow, so we just resort to listing all objects and doing an efficient intersection (if the hash keys are sorted, both iterators can be looped over only once - fixes aiidateam#93) - Since VACUUMing the DB is very important for efficiency, when the DB does not fit fully in the disk cache, `clean_storage` now provides an option to VACUUM the DB. VACUUM is also called after repacking. Fixes aiidateam#94 - We implement now a function to perform a full repack of the repository (fixes aiidateam#12). This is important and needed to reclaim space after deleting an object - For efficiency, we have moved the logic from an `export` function (still existing but deprecated) to a `import_objects` function - Still for efficiency, now functions like `pack_all_loose` and `import_objects` provide an option to perform a fsync to disk or not (see also aiidateam#54 - there are still however calls that always use - or don't use - fsync and full_fsync on Mac). Also, `add_objects_to_pack` allows now to choose if you want to commit the changes to the SQLite DB, or not (delegating the responsibility to the caller, but this is important e.g. in `import_objects`: calling `commit` only once at the very end gives a factor of 2 speedup for very big repos). - A number of functions, including (but not exclusively) `import_objects` provide a callback to e.g. show a progress bar. - a `CallbackStreamWrapper` has been implemented, allowing to provide a callback (e.g. for a progress bar) when streaming a big file. - a new hash algorithm is now supported (`sha1`) in addition to the default `sha256` (fixes aiidateam#82). This is faster even if a bit less robust. This was also needed to test completely some feature in `import_objects`, where the logic is optimised if both containers use the same algorithm. By default is still better to use everywhere sha256, also because then all export files that will be generated will use this algorithm and importing will be more efficient. - tests have been added for all new functionality, achieving again 100% coverage As a reference, with these changes, exporting the full large SDB database (6.8M nodes) takes ~ 50 minutes: ``` 6714808it [00:24, 274813.02it/s] All hashkeys listed in 24.444787740707397s. Listing objects: 100%|████████| 6714808/6714808 [00:06<00:00, 978896.65it/s] Copy objects: 100%|███████████| 6714808/6714808 [48:15<00:00, 2319.08it/s] Final flush: 100%|████████████| 63236/63236 [00:07<00:00, 8582.35it/s] Everything re-exported in 2960.980943918228s. ``` This can be compared to: - ~10 minutes to copy the whole 90GB, or ~15 minutes to read all and validate the packs. We will never be able to be faster than just copying the pack files, and we are only 3x slower. - ~2 days to just list all files in the old legacy AiiDA repo (or all objects if they are loose), and this does not take into account the time to rewrite everything, probably comparable. So we are almost 2 orders of magnitude faster than before.

Allows for the association of a container with an existing DB, or to uniquely refer to it. 🐛 This also fixes a bug, whereby config values were cached, but the cache was not cleared when re-initialising the container. To reduce the risk of such a problem, now only the whole configuration dictionary is cached, rather than each single config value.

`Container.is_initialised` is a costly operation, loading the config JSON every time. In 1d7c389, the config is now called on every call to `loose_prefix_len`, leading to a large performance degradation. This PR makes sure the `is_initialised` test is called only if the config has not already been loaded into memory.

The container configuration now accepts a variable for the compression algorithm to use. Currently, the supported values are zlib, with levels from 1 to 9, but this can be expanded in the future.

- Various general (but very important) speed improvements - Adding callbacks to a number of functions (e.g. export, add_objects_to_pack, ...) to allow showing progress bars or similar indicators - Implement repacking (at least when not changing hashing or compression) - Remove `export`, implement `import_objects` function instead on the other side (more efficient) - Add support for VACUUMing operations (very important for efficiency) - Added support for multiple hashing algorithms - Added concept of (unique) container_id - Generalized the compression algorithm, multiple algorithms supported now

giovannipizzi · 2020-10-12T06:46:08Z

Note: once it's merged in master, I will tag it, release it on pypi and merge back in develop

codecov · 2020-10-12T06:49:34Z

Codecov Report

Merging #103 (02caddd) into master (c718ffd) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##            master      #103    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files            7         7            
  Lines         1199      1541   +342     
==========================================
+ Hits          1199      1541   +342

Impacted Files	Coverage Δ
disk_objectstore/__init__.py	`100.00% <100.00%> (ø)`
disk_objectstore/container.py	`100.00% <100.00%> (ø)`
disk_objectstore/utils.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c718ffd...02caddd. Read the comment docs.

chrisjsewell

Should this not contain an update to the CHANGELOG?

Also is there a reason for this to not be a merge of develop into master, rather than a personal fork?
TBH I think this project is too small for the whole master/develop setup; it just overcomplicates things. But I'm not going to die on that hill 😉

chrisjsewell · 2020-10-12T08:19:58Z

You might also want to consider integrating the PyPi release into GitHub actions:

  publish:

    name: Publish to PyPi
    needs: [pre-commit, tests]
    if: github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags')
    runs-on: ubuntu-latest
    steps:
      - name: Checkout source
        uses: actions/checkout@v2
      - name: Set up Python 3.7
        uses: actions/setup-python@v1
        with:
          python-version: 3.7
      - name: Build package
        run: |
          pip install wheel
          python setup.py sdist bdist_wheel
      - name: Publish
        uses: pypa/gh-action-pypi-publish@v1.1.0
        with:
          user: __token__
          password: ${{ secrets.PYPI_KEY }}

giovannipizzi · 2020-11-04T16:39:03Z

I finally got to add the changelog. I think from now on it's enough to just mention the month (the exact date is anyway in the git history). Ready to be merged for me, please let me know when it's done so I release (thanks @chrisjsewell for the hints - I think we can improve the release cycle in the future, maybe you're right that it's more complex than needed, I'm just used to it so I try to do the same in all repos, but we can definitely improve for the future versions).

CHANGELOG.md

I am not adding it for earlier versions as they are still pre-production versions, v0.4.0 was really the first production-ready version

CHANGELOG.md

Changelog added and reviewed by Sebastiaan

giovannipizzi and others added 19 commits July 21, 2020 19:07

Adding callbacks to export and add_objects_to_pack

3cebca7

Callback tests still missing

Changing the logic for the callback on the export

acc2186

It now loops over the objects to export

Implementing basic repacking logic

53ad44c

Addresses aiidateam#12 This is already quite safe to be called, but I didn't cover all corner cases also related to power loss etc.

We also want to test that after repacking content is correct

585351e

Adding VACUUMing operations

1f71b4d

I both add a private _vacuum() operation, and I call it: - optionally (False by default) in clean_storage() - at the end of a full repack of all packs

Adding missing optional dev dependencies

690ac34

They were listed in dev-requirements.txt with the correct version (and used by the tests) but they were missing in `setup.py`. They were `pytest-benchmark` and `coverage`.

Adding parameters to decide whether to do fsync and to commit in some…

471ac9b

… writing methods These changes are motivated by benchmarks on a large repository ("SDB") with 6.8M nodes.

Adding callback tests

8df5e2b

- Added a callback_instance fixture to facilitate tests - Added complete coverage of the new callback-related functions in the utils module - Started to implement coverage of some callback-related function in the container file.

Merge pull request aiidateam#95 from giovannipizzi/fix_missing_option…

e7e3627

…al_deps Adding missing optional dev dependencies

Merge branch 'develop' into fix_92_efficiency_big

34d6671

✨ Generalize compression algorithm (aiidateam#99)

d786296

The container configuration now accepts a variable for the compression algorithm to use. Currently, the supported values are zlib, with levels from 1 to 9, but this can be expanded in the future.

giovannipizzi requested review from chrisjsewell and sphuber October 12, 2020 06:45

chrisjsewell previously requested changes Oct 12, 2020

View reviewed changes

Added Changelog

c3eabd0

giovannipizzi requested a review from chrisjsewell November 4, 2020 16:39

sphuber reviewed Nov 4, 2020

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

Adding links to PRs for change log of v0.5

02caddd

I am not adding it for earlier versions as they are still pre-production versions, v0.4.0 was really the first production-ready version

giovannipizzi requested a review from sphuber November 11, 2020 14:13

sphuber approved these changes Nov 11, 2020

View reviewed changes

CHANGELOG.md Show resolved Hide resolved

giovannipizzi merged commit 875bc98 into aiidateam:master Nov 11, 2020

giovannipizzi deleted the release_0_5 branch November 11, 2020 14:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release v0.5 #103

Release v0.5 #103

giovannipizzi commented Oct 12, 2020

giovannipizzi commented Oct 12, 2020

codecov bot commented Oct 12, 2020 •

edited

Loading

chrisjsewell left a comment

chrisjsewell commented Oct 12, 2020

giovannipizzi commented Nov 4, 2020

Release v0.5 #103

Release v0.5 #103

Conversation

giovannipizzi commented Oct 12, 2020

giovannipizzi commented Oct 12, 2020

codecov bot commented Oct 12, 2020 • edited Loading

Codecov Report

chrisjsewell left a comment

Choose a reason for hiding this comment

chrisjsewell commented Oct 12, 2020

giovannipizzi commented Nov 4, 2020

codecov bot commented Oct 12, 2020 •

edited

Loading