Move `aiida.cmdline.utils.common.get_database_summary` to storage backend #5387

sphuber · 2022-02-23T13:49:35Z

This utility was used in verdi archive inspect and verdi storage info
to give information about the contents of an archive or normal storage
backend. Historically, a distinction used to be made between the contents
of the database and the repository, but in v2.0 these are unified by the
storage backend. The information about the repository is already
retrieved through StorageBackend.get_info but the contents of the
database were still built through the get_database_summary function.

The implementation is moved to the get_orm_entity_overview method of
the StorageBackend base class. This is done because the implementation
is currently still independent of the storage backend implementation. So
for now the implementations can simply call to this method in their
get_info implementation, but it leaves the option to override it with
a more performant implementation that doesn't go through the shared ORM
of AiiDA.

sphuber · 2022-02-23T14:19:33Z

The things I think are definitely open for discussion are:

Name of the new method get_orm_entity_overview
Implementation of ZipfileBackendRepository.get_info Can we provide more useful information than that?

ramirezfranciscof · 2022-02-23T15:06:05Z

Hey, thanks for considering my concerns! Indeed this seems like a quite interesting solution to the problem of balancing the agnosticity of using the ORM and allowing for further improvement for each specific backend.

However, what I have now noticed after thinking a bit about this, is that we seem to be mixing high level concepts from aiida (like nodes and links) with "low" level stats from the storage backend (size, number of files/tables, etc). Mind you, I'm not saying this is being introduced here: I'm just noticing it now that the backend has been more cleanly separated, making this aspect stick out more.

In other words, I think the number and type of nodes, groups, etc are not a "database" stat and should perhaps not be reported as such. Maybe we have another command for verdi profile content stats, or maybe in verdi storage stats we can have {'database': {}, 'repository': {}, 'high_level_content': {}} or something like that. We can still use this general idea of having a method to get this info implemented in the abstract class, and each concrete child could override it, so that still works great for this.

BTW just as a reference, database stats could be stuff like the number of tables, rows in each table, size in mb, etc. It can also be useful to have these for contrasting with the results for the more "general" content stats to see if they are consistent.

What do you think? Does this make sense?

sphuber · 2022-02-23T15:34:42Z

I think your analysis makes sense. But if I understand it correctly, you agree that it still makes sense to have StorageBackend.get_info to return the overview of entities, but that in addition it can provide info that is specific to the actual storage implementation? So in short, we just rename the key database in the returned dictionary to something else, so that it remains open for PsqlDosBackend to return more specific info about the postgres contents.

For me that would be fine, just not sure what the other key should be named. Had similar problems with naming the method get_orm_entity_overview. I think "entity" makes sense, because that is the generic name we use to reference things stored in the database. Cannot use something like nodes or provenance graph, because that is only part of it. ORM makes sense in a way because these stats are retrieved through the ORM, or at least all of the stats can be retrieved through the ORM. So I think the name should be something with entity or ORM.

chrisjsewell · 2022-02-23T15:41:51Z

Yeh lets not over-complicate too much 😬

Just to add context, I only added verdi database summary a year ago (which this has now morphed into) to get a quick overview of what was in a profile, and also to aid in tutorials (e.g. showing what is added to the database after running a calculation): #4737

ramirezfranciscof · 2022-02-23T15:59:19Z

But if I understand it correctly, you agree that it still makes sense to have StorageBackend.get_info to return the overview of entities, but that in addition it can provide info that is specific to the actual storage implementation?

Mmm 🤔, I'm still unsure of where I think would be the best place for these higher level stats, but I guess for now it could go there.

Naming is annoying, yes. I think entities for the key could be ok, and the method can just be get_orm_entities. I can review the code after the wrap up meeting now.

ramirezfranciscof

Good for me, just maybe one comment / question below.

Also, just to check (because github interface is still a bit of a PITA to check moved content): the get_orm_entities has the exact same content that used to be in get_database_summary, right? You didn't change anything inside the method itself.

ramirezfranciscof · 2022-02-23T17:12:10Z

aiida/tools/archive/implementations/sqlite/backend.py

    def get_info(self, statistics: bool = False, **kwargs) -> dict:
-        raise NotImplementedError
+        return {'Objects': len(list(self.list_objects()))}


Is there a reason why this one doesn't use super as well? (also "objects" maybe in lowercase and with a nested dict to make it consistent with the others?)

def get_info(self, statistics: bool = False, **kwargs) -> dict: results = super().get_info(statistics=statistics) results['objects'] = {'count': len(list(self.list_objects()))} return results

Fine for me to make it lower case, but DiskObjectStoreRepositoryBackend.get_info actually uses Objects with a capital.

Not using super was just an omission, will add it now.

Regarding the get_orm_entities, yes I kept the implementation, just changing the verbose variable to statistics.

ramirezfranciscof · 2022-02-23T18:13:41Z

Fine for me to make it lower case, but DiskObjectStoreRepositoryBackend.get_info actually uses Objects with a capital.

Mmm true, it actually does a weird thing where it alternates capitalizing and lowercase...

database:
  summary:
    Users:
      count: 2
(...)
repository:
  SHA-hash algorithm: sha256
  Compression algorithm: zlib+1
  Packs: 1
  Objects: # <--- You mean here
    unpacked: 0
    packed: 26101

Should we set a standard for this? Or leave it free? In which case apologies for the pedantry 😅

(BTW good to go for me now, we can think about standardizing this later, I'm only waiting for the test to pass to approve)

…kend This utility was used in `verdi archive inspect` and `verdi storage info` to give information about the contents of an archive or normal storage backend. Historically, a distinction used to be made between the contents of the database and the repository, but in v2.0 these are unified by the storage backend. The information about the repository is already retrieved through `StorageBackend.get_info` but the contents of the database were still built through the `get_database_summary` function. The implementation is moved to the `get_orm_entity_overview` method of the `StorageBackend` base class. This is done because the implementation is currently still independent of the storage backend implementaiton. So for now the implementations can simply call to this method in their `get_info` implementation, but it leaves the option to override it with a more performant implementation that doesn't go through the shared ORM of AiiDA.

ramirezfranciscof · 2022-02-23T21:19:48Z

aiida/tools/archive/implementations/sqlite/backend.py

    def get_info(self, statistics: bool = False, **kwargs) -> dict:
-        raise NotImplementedError
+        return {'objects': {'count': len(list(self.list_objects()))}}


You still forgot to add the super here, no? Oh bah, perhaps you were not finished with the modifications yet?

It shouldn't have one. Note this is the repository implementation, the base abstract class of which doesn't define a get_info. Only the implementation for the disk object store does.

The base class with a get_info implementation is the one for the storage backend. I think you may be confusing it with that

Ahhh ok, I see. Then my previous comment was also wrong, haha. Ok, all good now then.

You were confused 😉 you put get_info and maintain under SqliteBackendQueryBuilder not SqliteZipBackend
Its ok, I fixed it in #5375

ramirezfranciscof

LGTM!

sphuber force-pushed the fix/5384/move-get-database-summary-to-backend branch from 4ae0d77 to eecf759 Compare February 23, 2022 14:00

sphuber mentioned this pull request Feb 23, 2022

Move the functionality of aiida.cmdline.utils.common.get_database_summary to StorageBackend.get_info #5384

Closed

sphuber requested review from chrisjsewell and ramirezfranciscof February 23, 2022 14:18

sphuber force-pushed the fix/5384/move-get-database-summary-to-backend branch 2 times, most recently from 1e9d0ce to 77e39c0 Compare February 23, 2022 17:04

ramirezfranciscof suggested changes Feb 23, 2022

View reviewed changes

sphuber force-pushed the fix/5384/move-get-database-summary-to-backend branch from 77e39c0 to 787e6fb Compare February 23, 2022 17:33

sphuber requested a review from ramirezfranciscof February 23, 2022 17:44

sphuber force-pushed the fix/5384/move-get-database-summary-to-backend branch from 787e6fb to 9f02956 Compare February 23, 2022 18:30

ramirezfranciscof reviewed Feb 23, 2022

View reviewed changes

ramirezfranciscof approved these changes Feb 24, 2022

View reviewed changes

sphuber merged commit d0ff889 into aiidateam:develop Feb 24, 2022

sphuber deleted the fix/5384/move-get-database-summary-to-backend branch February 24, 2022 08:19

chrisjsewell mentioned this pull request Mar 12, 2022

🔀 MERGE: Release v2.0.0b1 #5426

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move `aiida.cmdline.utils.common.get_database_summary` to storage backend #5387

Move `aiida.cmdline.utils.common.get_database_summary` to storage backend #5387

sphuber commented Feb 23, 2022

sphuber commented Feb 23, 2022

ramirezfranciscof commented Feb 23, 2022

sphuber commented Feb 23, 2022

chrisjsewell commented Feb 23, 2022

ramirezfranciscof commented Feb 23, 2022 •

edited

Loading

ramirezfranciscof left a comment

ramirezfranciscof Feb 23, 2022

sphuber Feb 23, 2022

ramirezfranciscof commented Feb 23, 2022 •

edited

Loading

ramirezfranciscof Feb 23, 2022 •

edited

Loading

sphuber Feb 23, 2022

ramirezfranciscof Feb 24, 2022

chrisjsewell Feb 24, 2022

chrisjsewell Feb 24, 2022

ramirezfranciscof left a comment

Move aiida.cmdline.utils.common.get_database_summary to storage backend #5387

Move aiida.cmdline.utils.common.get_database_summary to storage backend #5387

Conversation

sphuber commented Feb 23, 2022

sphuber commented Feb 23, 2022

ramirezfranciscof commented Feb 23, 2022

sphuber commented Feb 23, 2022

chrisjsewell commented Feb 23, 2022

ramirezfranciscof commented Feb 23, 2022 • edited Loading

ramirezfranciscof left a comment

Choose a reason for hiding this comment

ramirezfranciscof Feb 23, 2022

Choose a reason for hiding this comment

sphuber Feb 23, 2022

Choose a reason for hiding this comment

ramirezfranciscof commented Feb 23, 2022 • edited Loading

ramirezfranciscof Feb 23, 2022 • edited Loading

Choose a reason for hiding this comment

sphuber Feb 23, 2022

Choose a reason for hiding this comment

ramirezfranciscof Feb 24, 2022

Choose a reason for hiding this comment

chrisjsewell Feb 24, 2022

Choose a reason for hiding this comment

chrisjsewell Feb 24, 2022

Choose a reason for hiding this comment

ramirezfranciscof left a comment

Choose a reason for hiding this comment

Move `aiida.cmdline.utils.common.get_database_summary` to storage backend #5387

Move `aiida.cmdline.utils.common.get_database_summary` to storage backend #5387

ramirezfranciscof commented Feb 23, 2022 •

edited

Loading

ramirezfranciscof commented Feb 23, 2022 •

edited

Loading

ramirezfranciscof Feb 23, 2022 •

edited

Loading