Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move aiida.cmdline.utils.common.get_database_summary to storage backend #5387

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Feb 23, 2022

Fixes #5384

This utility was used in verdi archive inspect and verdi storage info
to give information about the contents of an archive or normal storage
backend. Historically, a distinction used to be made between the contents
of the database and the repository, but in v2.0 these are unified by the
storage backend. The information about the repository is already
retrieved through StorageBackend.get_info but the contents of the
database were still built through the get_database_summary function.

The implementation is moved to the get_orm_entity_overview method of
the StorageBackend base class. This is done because the implementation
is currently still independent of the storage backend implementation. So
for now the implementations can simply call to this method in their
get_info implementation, but it leaves the option to override it with
a more performant implementation that doesn't go through the shared ORM
of AiiDA.

@sphuber
Copy link
Contributor Author

sphuber commented Feb 23, 2022

The things I think are definitely open for discussion are:

  • Name of the new method get_orm_entity_overview
  • Implementation of ZipfileBackendRepository.get_info Can we provide more useful information than that?

@ramirezfranciscof
Copy link
Member

Hey, thanks for considering my concerns! Indeed this seems like a quite interesting solution to the problem of balancing the agnosticity of using the ORM and allowing for further improvement for each specific backend.

However, what I have now noticed after thinking a bit about this, is that we seem to be mixing high level concepts from aiida (like nodes and links) with "low" level stats from the storage backend (size, number of files/tables, etc). Mind you, I'm not saying this is being introduced here: I'm just noticing it now that the backend has been more cleanly separated, making this aspect stick out more.

In other words, I think the number and type of nodes, groups, etc are not a "database" stat and should perhaps not be reported as such. Maybe we have another command for verdi profile content stats, or maybe in verdi storage stats we can have {'database': {}, 'repository': {}, 'high_level_content': {}} or something like that. We can still use this general idea of having a method to get this info implemented in the abstract class, and each concrete child could override it, so that still works great for this.

BTW just as a reference, database stats could be stuff like the number of tables, rows in each table, size in mb, etc. It can also be useful to have these for contrasting with the results for the more "general" content stats to see if they are consistent.

What do you think? Does this make sense?

@sphuber
Copy link
Contributor Author

sphuber commented Feb 23, 2022

I think your analysis makes sense. But if I understand it correctly, you agree that it still makes sense to have StorageBackend.get_info to return the overview of entities, but that in addition it can provide info that is specific to the actual storage implementation? So in short, we just rename the key database in the returned dictionary to something else, so that it remains open for PsqlDosBackend to return more specific info about the postgres contents.

For me that would be fine, just not sure what the other key should be named. Had similar problems with naming the method get_orm_entity_overview. I think "entity" makes sense, because that is the generic name we use to reference things stored in the database. Cannot use something like nodes or provenance graph, because that is only part of it. ORM makes sense in a way because these stats are retrieved through the ORM, or at least all of the stats can be retrieved through the ORM. So I think the name should be something with entity or ORM.

@chrisjsewell
Copy link
Member

Yeh lets not over-complicate too much 😬

Just to add context, I only added verdi database summary a year ago (which this has now morphed into) to get a quick overview of what was in a profile, and also to aid in tutorials (e.g. showing what is added to the database after running a calculation): #4737

@ramirezfranciscof
Copy link
Member

ramirezfranciscof commented Feb 23, 2022

But if I understand it correctly, you agree that it still makes sense to have StorageBackend.get_info to return the overview of entities, but that in addition it can provide info that is specific to the actual storage implementation?

Mmm 🤔, I'm still unsure of where I think would be the best place for these higher level stats, but I guess for now it could go there.

Naming is annoying, yes. I think entities for the key could be ok, and the method can just be get_orm_entities. I can review the code after the wrap up meeting now.

@sphuber sphuber force-pushed the fix/5384/move-get-database-summary-to-backend branch 2 times, most recently from 1e9d0ce to 77e39c0 Compare February 23, 2022 17:04
Copy link
Member

@ramirezfranciscof ramirezfranciscof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good for me, just maybe one comment / question below.

Also, just to check (because github interface is still a bit of a PITA to check moved content): the get_orm_entities has the exact same content that used to be in get_database_summary, right? You didn't change anything inside the method itself.

Comment on lines 204 to 205
def get_info(self, statistics: bool = False, **kwargs) -> dict:
raise NotImplementedError
return {'Objects': len(list(self.list_objects()))}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why this one doesn't use super as well? (also "objects" maybe in lowercase and with a nested dict to make it consistent with the others?)

    def get_info(self, statistics: bool = False, **kwargs) -> dict:
        results = super().get_info(statistics=statistics)
        results['objects'] = {'count': len(list(self.list_objects()))}
        return results

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine for me to make it lower case, but DiskObjectStoreRepositoryBackend.get_info actually uses Objects with a capital.

Not using super was just an omission, will add it now.

Regarding the get_orm_entities, yes I kept the implementation, just changing the verbose variable to statistics.

@sphuber sphuber force-pushed the fix/5384/move-get-database-summary-to-backend branch from 77e39c0 to 787e6fb Compare February 23, 2022 17:33
@ramirezfranciscof
Copy link
Member

ramirezfranciscof commented Feb 23, 2022

Fine for me to make it lower case, but DiskObjectStoreRepositoryBackend.get_info actually uses Objects with a capital.

Mmm true, it actually does a weird thing where it alternates capitalizing and lowercase...

database:
  summary:
    Users:
      count: 2
(...)
repository:
  SHA-hash algorithm: sha256
  Compression algorithm: zlib+1
  Packs: 1
  Objects: # <--- You mean here
    unpacked: 0
    packed: 26101

Should we set a standard for this? Or leave it free? In which case apologies for the pedantry 😅

(BTW good to go for me now, we can think about standardizing this later, I'm only waiting for the test to pass to approve)

…kend

This utility was used in `verdi archive inspect` and `verdi storage info`
to give information about the contents of an archive or normal storage
backend. Historically, a distinction used to be made between the contents
of the database and the repository, but in v2.0 these are unified by the
storage backend. The information about the repository is already
retrieved through `StorageBackend.get_info` but the contents of the
database were still built through the `get_database_summary` function.

The implementation is moved to the `get_orm_entity_overview` method of
the `StorageBackend` base class. This is done because the implementation
is currently still independent of the storage backend implementaiton. So
for now the implementations can simply call to this method in their
`get_info` implementation, but it leaves the option to override it with
a more performant implementation that doesn't go through the shared ORM
of AiiDA.
@sphuber sphuber force-pushed the fix/5384/move-get-database-summary-to-backend branch from 787e6fb to 9f02956 Compare February 23, 2022 18:30
Comment on lines 204 to +205
def get_info(self, statistics: bool = False, **kwargs) -> dict:
raise NotImplementedError
return {'objects': {'count': len(list(self.list_objects()))}}
Copy link
Member

@ramirezfranciscof ramirezfranciscof Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still forgot to add the super here, no? Oh bah, perhaps you were not finished with the modifications yet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't have one. Note this is the repository implementation, the base abstract class of which doesn't define a get_info. Only the implementation for the disk object store does.

The base class with a get_info implementation is the one for the storage backend. I think you may be confusing it with that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh ok, I see. Then my previous comment was also wrong, haha. Ok, all good now then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You were confused 😉 you put get_info and maintain under SqliteBackendQueryBuilder not SqliteZipBackend
Its ok, I fixed it in #5375

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You were confused 😉 you put get_info and maintain under SqliteBackendQueryBuilder not SqliteZipBackend
Its ok, I fixed it in #5375

Copy link
Member

@ramirezfranciscof ramirezfranciscof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@sphuber sphuber merged commit d0ff889 into aiidateam:develop Feb 24, 2022
@sphuber sphuber deleted the fix/5384/move-get-database-summary-to-backend branch February 24, 2022 08:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Move the functionality of aiida.cmdline.utils.common.get_database_summary to StorageBackend.get_info
3 participants