Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import/export progress bar #3599

Merged

Conversation

CasperWA
Copy link
Contributor

@CasperWA CasperWA commented Dec 3, 2019

Fixes #2776

This PR does a multitude of things/is a bit complex. Therefore, I have tried to keep each "thing" kept in its own commit. Upon merging, these may be squashed in a way that makes sense.
I will try to keep this commit segregation throughout the lifetime of the PR, meaning if I commit changes, they will be squashed into the respective commit, where they belong, or a new commit will be made, following a complete rebase. EDIT: I failed to do this, but at least the history is more clear now.

Some of the things introduced in this PR:

  • debug parameter for import/export.
    This is now logged at the debug level instead.
  • Meta data header summary for import/export.
  • Minor optimizations of import/export. This is done for the verdi commands, but also for the backend functions.

But most importantly, this PR creates a progress bar for import, export, and extracting export archives.

Possible updates (either in this PR or in the future):

  • Progress bar for export archive migrations.
  • Minimize waiting time "between" progress bars (to the degree it is possible in the scope of this PR).

@CasperWA
Copy link
Contributor Author

CasperWA commented Dec 3, 2019

The main thing I am not convinced about with this PR, is the design of the headers of meta data printed when im- and exporting. Feedback on this would be great, contributing commits or PR's against my branch are even more welcome.

@CasperWA CasperWA force-pushed the fix_2776_import_export_progress_bar branch 3 times, most recently from 7d2d336 to add686f Compare December 10, 2019 20:41
@CasperWA
Copy link
Contributor Author

The main thing I am not convinced about with this PR, is the design of the headers of meta data printed when im- and exporting. Feedback on this would be great, contributing commits or PR's against my branch are even more welcome.

This has now been fixed and brought into line with the rest of AiiDA, using tabulate.

@CasperWA CasperWA force-pushed the fix_2776_import_export_progress_bar branch from add686f to dff3e71 Compare December 10, 2019 20:44
@CasperWA
Copy link
Contributor Author

@ltalirz @sphuber @dev-zero @yakutovicha @chrisjsewell I consider this ready for review. Since this is very much a UX addition, it will probably be needed for you to checkout the branch and test ex- and importing.

@ltalirz
Copy link
Member

ltalirz commented Dec 11, 2019

So far tested it only on a small export - on first impression it looks great and seems to provide very useful information, well done!
I plan to test it on a larger export later this evening.
We should decide how to handle the code review as well.

I've also checked the new dependency tqdm.
It's used by 50k other repos and has 70 contributors, but only the maintainer had >3 commits during the last year, i.e. it is de facto a 1-person project (named Casper... coincidence? ;-) ).

I would therefore ask @CasperWA to motivate which features of tqdm warrant the addition of a new dependency (vs the various 50ish lines implementations in pure python).
This explanation should be recorded in the eventual commit message of the PR (if we keep it).

@CasperWA
Copy link
Contributor Author

I would therefore ask @CasperWA to motivate which features of tqdm warrant the addition of a new dependency (vs the various 50ish lines implementations in pure python).
This explanation should be recorded in the eventual commit message of the PR (if we keep it).

The features I expect from the progress bar:

  1. Easily implemented and manipulated to quickly show the exact information I want the user to see.
  2. Have automated statistics for an estimated time of success.
  3. Be non-invasive, i.e., it should be removable from stdout.

How tqdm does for these "requirements":

  1. It is easily manipulated, but depending on the use-case, the implementation can turn out to be more or less extensive.
    • As an example here, compare the "simple" implementation for the extract functions here, where the progress bar updates are handled by tqdm, with the more involved implementation in the general export function here, where the progress bar needs to be updated for each iteration.
  2. The statistical estimations of tqdm are calculated on-the-run without a lot of overhead. The logic for these numerical calculations seem to be "cleverly" handled by tqdm based on the inputs given: The less tqdm is left to itself, the less accurate the estimates, as a rule-of-thumb. That does not, however, render the estimates useless, I think.
    To re-implement these calculations would probably take more than what is suggested by the solutions in the stackoverflow thread, but they would make up a great start.
  3. This should be the case for all progress bars that rely on /r and /n, I guess?

The justification for tqdm is not stone-solid. It is also not without merit. I could definitely use some time to elaborate on the suggested implementations from the stackoverflow thread referenced by @ltalirz and come up with a solution that meets my requirements. But in the end, I value my time more precious, and can therefore not deny that this has influenced my addition of tqdm here.
I hope either that the dependency will find uses in other areas of the base code or a similarly fast and reliable implementation may be found in existing dependencies. It should here be said that I have looked at the one from click and found it wanting.

@CasperWA CasperWA force-pushed the fix_2776_import_export_progress_bar branch from cb7eaf9 to d94b54e Compare December 16, 2019 11:10
@ltalirz
Copy link
Member

ltalirz commented Dec 17, 2019

Some comments from a production test.

  • What is the "Initializing" phase? Would perhaps be good to make clear what the items are that you go through here
  • Second is the "Loading group" phase.
    Here, it seems to show one bar per group. In my case, however, we are dealing with tens of thousands of groups, and the subset of the label you are showing is always the same. I think it would be better to replace this progress bar with an overall progress over all groups.

The group loading phase seems extremely slow to me - perhaps this is a good point to look into optimizing the code.

@CasperWA
Copy link
Contributor Author

CasperWA commented Dec 17, 2019

Some comments from a production test.

* What is the "Initializing" phase? Would perhaps be good to make clear what the items are that you go through here

As far as I remember, it is indeed just this: An initialization. Nothing quantifiable is truly happening, other than some memory-caching/-reserving, I believe?

Edit: A closer look shows me that I can indeed me more explicit, since we are collecting Logs and Comments and more.
Please continue to strike down on all non-informative progress bar labels that should be changed, or should be re-considered.

* Second is the "Loading group" phase.
  Here, it seems to show one bar per group. In my case, however, we are dealing with tens of thousands of groups, and the subset of the label you are showing is always the same. I think it would be better to replace this progress bar with an overall progress over all groups.

Right!
So the idea (as it is now) was to go through each group to-be-exported and show progress for each node in the group.
I guess the choice to either do this, or take a "step back" and go through all groups instead, is subject to personal preference as much as it is what one is exporting?

In the beginning I played around with having multiple progress bars as well: An overarching one and then sub-progress bars for each "part". This may be the solution for this. What do you think?

The group loading phase seems extremely slow to me - perhaps this is a good point to look into optimizing the code.

It indeed is. The code section looks like this:

    # Add all the nodes contained within the specified groups
    for group in given_groups:
        if not silent:
            group_count = group.count()
            if group_count:
                progress_bar.reset(total=group_count)
                progress_bar.set_description_str('Loading Group - LABEL={}'.format(group.label))

        for entry in group.nodes:
            if not silent:
                progress_bar.update()
                progress_bar.refresh()

            entities_starting_set[NODE_ENTITY_NAME].add(entry.uuid)
            if issubclass(entry.__class__, orm.Data):
                given_data_entry_ids.add(entry.pk)
            elif issubclass(entry.__class__, orm.ProcessNode):
                given_calculation_entry_ids.add(entry.pk)

It may be that it is the count() that is slow or something?
If you provide me with an export file I can use to test these cases, I may be able to pinpoint the issue and optimize a bit.

On another note, I must say that putting a progress bar on all of this truly exposes the slow parts of both the import and export functions - something that is quite neat for optimization purposes, but may be bothersome for users. Although I still think it is nicer than the current "print a bunch of stuff" approach :)

(For future references of this/these comment(s), these concern the export part only.)

@ltalirz
Copy link
Member

ltalirz commented Dec 17, 2019

I'm working on fixing this now.

@ltalirz
Copy link
Member

ltalirz commented Dec 17, 2019

I've opened a PR for fixing making the search over groups more performant (for the moment on top of develop).

Furthermore, here is a branch that contains this fix + some further changes to the progress bar - in particular I have removed the "refresh" from a couple of places since constant refreshing is costly and already handled in a sensible way by tqdm (it has an internal timeout set to 100ms).
Note that updating the description triggers a refresh by default; there might be a couple more places where this is not what you want and you might want to pass refresh=False.

https://github.com/ltalirz/aiida-core/tree/fix-progress-bar-rebased

Edit: here is the link for comparing changes
https://github.com/CasperWA/aiida_core/compare/fix_2776_import_export_progress_bar...ltalirz:fix-progress-bar-rebased?expand=1

@CasperWA
Copy link
Contributor Author

I've opened a PR for fixing making the search over groups more performant (for the moment on top of develop).

Cheers. I see @lekah has already come up with some improvements.

Furthermore, here is a branch that contains this fix + some further changes to the progress bar - in particular I have removed the "refresh" from a couple of places since constant refreshing is costly and already handled in a sensible way by tqdm (it has an internal timeout set to 100ms).

Right! I knew there was a reason why I chose tqdm (I actually also mentioned this reason above).

Note that updating the description triggers a refresh by default; there might be a couple more places where this is not what you want and you might want to pass refresh=False.

I knew this, and I am also not doing this some places, but mostly I am, since I "forgot" about the refresh() cost.

I will take a look at it, thanks!

@CasperWA CasperWA force-pushed the fix_2776_import_export_progress_bar branch 2 times, most recently from 0b104a6 to 766527a Compare December 18, 2019 10:32
@CasperWA CasperWA force-pushed the fix_2776_import_export_progress_bar branch 2 times, most recently from 0f6e53d to dce4778 Compare February 4, 2020 20:38
@CasperWA
Copy link
Contributor Author

CasperWA commented Feb 5, 2020

@ltalirz I have incorporated some changes, especially concerning the refresh parameter for the set_description_str method for the progress bar.

Furthermore, I have gone through the export function export_tree again, after the updated graph traversal tool. Some places I have optimized the code accordingly and tried to remove some older concepts that are no longer necessary (like the many different given_*_ids sets. This also means most of the QueryBuilder filtering is done on the UUID column instead of PK/id .

@ltalirz
Copy link
Member

ltalirz commented Feb 8, 2020

As a first step, I'm now running the benchmark export I used for the AGE with current develop + this PR to get some timings.

@CasperWA
Copy link
Contributor Author

CasperWA commented Feb 8, 2020

As a first step, I'm now running the benchmark export I used for the AGE with current develop + this PR to get some timings.

Great!
I would however like to say that this PR has nothing to do with optimizing the import/export - it only deals with adding a progress bar to the two - and then there are some code tweaks as well, but these are considered minor and only done because they stood out.
Since the whole import/export is going to be updated (most likely from the bottom up) over the coming year, I see no reason to continue optimizing on it - again, other than what stands out (and it "easily" optimized).

Though to that effect, I applaud the benchmark tests you are currently running, since they will act as a "before" to the import/export remake.

CasperWA added 14 commits May 13, 2020 14:08
This removes the debug parameter in the importexport module, and instead
a logger is used to print all debug messages at the debug level.
The progress bar becomes a single global variable in the
`aiida.tools.importexport.common.progress_bar` module.
It should only be handled via the functions in said module.
`what` is deprecated in favor of `entities`.
`outfile` is deprecated in favor of `filename`.

Support for the deprecated parameters should be upheld until v2.
Use a new `override_log_formatter` to temporarily override the
Formatter for all AiiDA log handlers.
Then log console output at the custom AiiDA REPORT level.

If silent, set logging level to CRITICAL.
@CasperWA CasperWA force-pushed the fix_2776_import_export_progress_bar branch from 78df6ab to 7e2547e Compare May 13, 2020 16:38
@CasperWA CasperWA requested a review from ltalirz May 13, 2020 16:38
@CasperWA
Copy link
Contributor Author

@ltalirz and @sphuber Please see the latest commits for implementation of logging as well as how to deprecate the what and outfile export() parameters.

@CasperWA
Copy link
Contributor Author

@sphuber NOTE: I have very slightly altered the return of aiida.common.lang:type_check().

@CasperWA
Copy link
Contributor Author

This logging stuff is seriously messing up the tests now.

I think this continuous re-integration of "new" updates in an existing "old" PR is not at all desireable and should be discouraged as much as possible (in short, I'm tired of having to fix this PR due to seemingly pointless change requests).

@CasperWA
Copy link
Contributor Author

CasperWA commented May 13, 2020

Locally, only a verdi import test fails, since it's checking something from the output, which is logged and so it apparently can not capture it.
Edit: It seems to be the same with the CI tests. Skip it?

@CasperWA
Copy link
Contributor Author

I think this PR and I are properly angry with each other at this point. Please when reviewing, keep in mind that non-essential changes can be put into an issue and addressed later.

Due to the import/export functions logging summaries they can not be
captured properly for the test.
@CasperWA CasperWA force-pushed the fix_2776_import_export_progress_bar branch from c330592 to 2f217dc Compare May 14, 2020 08:09
Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @CasperWA

@sphuber sphuber dismissed ltalirz’s stale review May 14, 2020 08:57

changes addressed

@sphuber sphuber merged commit 9d918e0 into aiidateam:develop May 14, 2020
@CasperWA CasperWA deleted the fix_2776_import_export_progress_bar branch May 14, 2020 13:15
@sphuber sphuber added this to the v1.3.0 milestone May 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr/ready-for-review PR is ready to be reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

progress indicator for verdi export/import
5 participants