Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

♻️ REFACTOR: Profile storage backend configuration #5320

Merged
merged 3 commits into from
Jan 25, 2022

Conversation

chrisjsewell
Copy link
Member

@chrisjsewell chrisjsewell commented Jan 20, 2022

This PR refactors the Profile class, and provides a complimentary config migration, in two steps (commits):

Storage backend

Configuration for the storage backend is changed from:

{
      "AIIDADB_BACKEND": "django",
      "AIIDADB_ENGINE": "postgresql_psycopg2",
      "AIIDADB_PASS": "some_random_password",
      "AIIDADB_NAME": "aiidadb_qs_some_user",
      "AIIDADB_HOST": "localhost",
      "AIIDADB_PORT": "5432",
      "AIIDADB_USER": "aiida_qs_greschd",
      "AIIDADB_REPOSITORY_URI": "file:////home/some_user/.aiida/repository-quicksetup/"
}

to

{
  "storage": {
      "backend": "django",
      "config": {
          "database_engine": "postgresql_psycopg2",
          "database_password": "some_random_password",
          "database_name": "aiidadb_qs_some_user",
          "database_hostname": "localhost",
          "database_port": "5432",
          "database_username": "aiida_qs_greschd",
          "repository_uri": "file:////home/some_user/.aiida/repository-quicksetup/"
      }
  }
}

Storage configuration should be specific to the storage backend (and in fact all operations on storage should go via the backend).
It is envisaged that eventually (as part of #5172) the storage_config will be directly parsed to the backend for validation/instantiation, rather than indirectly obtaining it from the (global) profile, e.g. something like:

backend_cls = get_backend_type(profile["storage_backend"])
backend = backend_cls(config=profile["storage_config"])

Rabbitmq configuration

Configuration for the storage backend is changed from:

{
      "broker_protocol": "amqp",
      "broker_username": "guest",
      "broker_password": "guest",
      "broker_host": "127.0.0.1",
      "broker_port": 5672,
      "broker_virtual_host": ""
}

to

{
  "process_control": {
      "backend": "rabbitmq",
      "config": {
          "broker_protocol": "amqp",
          "broker_username": "guest",
          "broker_password": "guest",
          "broker_host": "127.0.0.1",
          "broker_port": 5672,
          "broker_virtual_host": ""
      }
  }
}

It is highly possible that, in the future, RabbitMQ will be replaced (see aiidateam/AEP#30).
This change begins to move aiida-core away from "hard-coding" its use.
It also makes clearer, the purpose of these configuration variables.


This PR also removes the Profiles behaviour, to strip unknown keys from the config (which then may be subsequently written to disk). This stripping is unnecessary, and the keys may be there to aid in "lose-less" upgrade/downgrade of the config.

@chrisjsewell chrisjsewell requested a review from sphuber January 20, 2022 09:59
@codecov
Copy link

codecov bot commented Jan 20, 2022

Codecov Report

Merging #5320 (05aa2d1) into develop (fe1acf9) will decrease coverage by 0.03%.
The diff coverage is 79.32%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #5320      +/-   ##
===========================================
- Coverage    82.13%   82.11%   -0.02%     
===========================================
  Files          533      533              
  Lines        38478    38425      -53     
===========================================
- Hits         31601    31548      -53     
  Misses        6877     6877              
Flag Coverage Δ
django 77.18% <79.32%> (-0.02%) ⬇️
sqlalchemy 76.48% <77.94%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
aiida/cmdline/commands/cmd_setup.py 56.87% <0.00%> (+5.99%) ⬆️
aiida/cmdline/commands/cmd_status.py 84.77% <ø> (ø)
aiida/cmdline/params/options/commands/setup.py 54.55% <0.00%> (-2.29%) ⬇️
aiida/cmdline/params/types/profile.py 64.45% <0.00%> (ø)
aiida/manage/configuration/__init__.py 83.73% <50.00%> (+0.13%) ⬆️
aiida/manage/configuration/config.py 89.48% <71.43%> (+0.30%) ⬆️
aiida/manage/external/postgres.py 63.10% <75.00%> (+0.14%) ⬆️
aiida/backends/utils.py 93.34% <83.34%> (-6.66%) ⬇️
aiida/manage/configuration/profile.py 89.60% <84.91%> (-6.67%) ⬇️
...iida/manage/configuration/migrations/migrations.py 93.89% <90.91%> (-0.35%) ⬇️
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fe1acf9...05aa2d1. Read the comment docs.

@sphuber
Copy link
Contributor

sphuber commented Jan 21, 2022

I will wait with full review once you have merged the other PR and rebased this. That should make it easier. But I would already mention one point that I would consider changing and that is the structure of the new config. I am perfectly happy with grouping it under a storage key, but I think it would make sense to have separate keys within that for the database and the repository. In short I would propose the structure:

'storage': {
    'backend': 'django',
    'database': {
        ...
    },
    'repository': {
        ...
    }
}

The reason is that in the future we want to have the possibility to support other repository implementations and by separating those configuration details from the database one, makes it clearer. I don't think we will be introducing any backends that are purely database and no longer have a separate file repository so to me it makes sense to separate them.

In that same light, it is a bit incomplete to have the "backend" name just reference the database backend. Really it is a combination of a backend for the relational database and the file repository. Should we have a 'backend' key for both the database and repository? For the repository for now it will always be disk-objectstore, or something like that, but that would leave it open to be configured to a different backend implementation at some point.

@chrisjsewell
Copy link
Member Author

@sphuber I very much disagree, as we have already discussed, a storage backend is both the database AND the repository, they are intrinsically linked. For example, you have a single migration for both.
Also the archive is a backend which does not have separation between the repository and database; they are a single file.
You cannot swap just repository implementations in the same backend class, this would be a separate backend type.
The fact that there is a separate repository and database is really just an implementation detail

@chrisjsewell chrisjsewell force-pushed the migrate-config-storage branch 2 times, most recently from f6a5046 to 2dd32e2 Compare January 21, 2022 20:20
@chrisjsewell
Copy link
Member Author

chrisjsewell commented Jan 21, 2022

In that same light, it is a bit incomplete to have the "backend" name just reference the database backend

yes exactly, that is incorrect, and something I will be looking to "fix" at a later date

@chrisjsewell chrisjsewell force-pushed the migrate-config-storage branch from 262b9cc to 4d032f1 Compare January 21, 2022 20:23
@chrisjsewell
Copy link
Member Author

chrisjsewell commented Jan 21, 2022

I'd also note, we now have verdi storage, for which all commands act on the db+repo as a singular backend "entity".

In fact, I was going to open an issue, saying that all references to "backend" (which is just too nondescript) should be changed to "storage"

@sphuber
Copy link
Contributor

sphuber commented Jan 21, 2022

a storage backend is both the database AND the repository

I agree that the our "storage" comprises to components and form a whole and that we talk in term of the storage as one thing from the users perspective. That being said...

The fact that there is a separate repository and database is really just an implementation detail

That may well be the case, but there are still two different parts to the implementation that need to be configured. There is nothing preventing us using the PostgreSQL database with a file repository implemented on an actual object store, like say for example S3. Would you call that backend then still "sqlalchemy"? This is incorrect and there is no reason to limit ourselves to this.

So really you seem to be agreeing with me. The storage is one piece from the user's perspective in terms of where data gets stored (so there is one key for it in the configuration) but it comprises to separate data stores whose implementations can be mixed. Of course once data is stored, they are intrinsically linked. But you can use different backend implementations and so it makes perfect sense to have two separate config dictionaries, within the storage configuration as a whole.

@chrisjsewell
Copy link
Member Author

The archive is a storage backend, it has one config variable, the path to the archive. Should that variable then go under database or repository?

@chrisjsewell
Copy link
Member Author

The "django" and "sqlalchemy" backend names (again I would not personally have called them this) each point to a single backend class DjangoBackend/SqlaBackend (currently hard-coded, but in the future will likely be an entry-point name). This is what will get initialized with the storage configuration.
Unless you are insinuating that we should split these classes into two classes, then there should be one storage configuration: SqlaBackend(storage_config)

@sphuber
Copy link
Contributor

sphuber commented Jan 21, 2022

Unless you are insinuating that we should split these classes into two classes, then there should be one storage configuration: SqlaBackend(storage_config)

In my view, they are already separated in different classes. The database and repository both have their own separate interface, their own implementations and their own configuration (per implementation). The storage backend is just a wrapper class that wraps the two.

The "django" and "sqlalchemy" backend names (again I would not personally have called them this) each point to a single backend class DjangoBackend/SqlaBackend (currently hard-coded, but in the future will likely be an entry-point name). This is what will get initialized with the storage configuration.

What I am saying is that the names DjangoBackend and SqlaBackend are misnomers. The Django and Sqla part only refer to the backend implementation of the relational database, but it implicitly assumes the DiskObjectStoreRepositoryBackend for the file repository. This may not be a problem now since we only have one repository implementation, but we will soon have the need for another file repository implementation that is also perfectly compatible with the SqlAlchemy implementation for the database backend.

I don't see the problem to use additional substructuring in the storage config where it makes sense. It doesn't have to be required. For example, why does the following not make sense:

{
    'profile_one': {
        'broker': {},
        'storage': {
            'type': 'aiida.storage:core.archive',
            'filepath': '/some/path/archive.aiida',
        },
        ...
    },
    'profile_two': {
        'broker': {},
        'storage': {
            'type': 'aiida.storage:core.sqla-dos'
            'database': {
                'hostname': 'localhost',
                ...
            },
            'repository': {
                'filepath_container': '/some/filepath/container'
            }
        },
        ...
    },
    'profile_three': {
        'broker': {},
        'storage': {
            'type': 'aiida.storage:core.sqla-s3'
            'database': {
                'hostname': 'localhost',
                ...
            },
            'repository': {
                'hostname': 's3.aiida.net',
                'username':
                'container': '/some/filepath/container'
            }
        },
        ...
    }
}

I think we are really saying the same, I just think it is useful and makes sense to keep the exact structure of the storage key flexible and allow additional structuring where it makes sense.

@chrisjsewell
Copy link
Member Author

why does the following not make sense

because you need to load the storage backend class, to validate the schema of the storage config, they are two distinct things

@chrisjsewell
Copy link
Member Author

The storage backend is just a wrapper class that wraps the two.

No. The storage backend is the interface to the storage. You either completely separate repository and database: separate versions, separate migrations, separate CLI commands, etc, or you treat them as one, you can't have it both ways.

@chrisjsewell
Copy link
Member Author

What I am saying is that the names DjangoBackend and SqlaBackend are misnomers.

again, you keep fixating on the names, and I keep telling you they are the wrong names, and they should not have been named this in the first place.

Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really think that we are agreeing here in our long discussion. All I was saying is that I think it may make sense to allow for additional nesting under the storage_config key, that is all. It wouldn't stop anything from what you are doing here. But whatever, let's move on.

Since we are doing this, I would really do the same for the broker key and make that a dictionary. Besides that, there are just some minor suggestions and questions.

aiida/manage/configuration/migrations/migrations.py Outdated Show resolved Hide resolved
aiida/manage/configuration/migrations/migrations.py Outdated Show resolved Hide resolved
aiida/manage/configuration/profile.py Outdated Show resolved Hide resolved
aiida/manage/configuration/profile.py Outdated Show resolved Hide resolved
# to-do currently this is not actually used anywhere,
# because e.g. the documentation is loaded with an incomplete (dummy) configuration
# in actual usage though, this could lead to later key errors, when retrieving an attribute
if validate and not set(config.keys()).issuperset(self.REQUIRED_KEYS):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this excepting then if validate were to be True? If it is not needed, shouldn't we just get rid of validate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it says in the to-do comment above, I think this should be on always, since otherwise you could have obscure failures, where you get KeyError attribute retrieval on e.g. Profile.storage_backend. The reason why it is not at present, is that in some testing fixtures and the load_documentation function, an incomplete config is supplied to Profile, so this would fail. We should maybe just add "dummy" config to them (for e.g. rabbitmq config), so we can always run this validation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the to-do comment but I don't understand it. If you don't currently use the validate keyword anywhere, then why not just remove it. You say it is needed for the load_documentation one, but that means it is being used and so the to-do is incorrect and can be removed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ba998ffbc76785c1b5209089dc6ce42c2aa29bc9, I have "fixed" places that were loading incomplete config, and turned on validation by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cheers, I think this still needs to be revisited at some point, but we can do that later.

@chrisjsewell
Copy link
Member Author

I really think that we are agreeing here in our long discussion.

yeh just about 😄 I just wanted to stress that anything under the storage_config key will just be parsed to the storage_backend as a single "entity" (and validated by it similarly), so having nesting doesn't really matter.

I would really do the same for the broker key and make that a dictionary

Indeed, as was going to do that as a separate PR 👍

@sphuber
Copy link
Contributor

sphuber commented Jan 23, 2022

Indeed, as was going to do that as a separate PR +1

But wouldn't that require a new separate config migration? Might as well just do it here.

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Jan 23, 2022

But wouldn't that require a new separate config migration? Might as well just do it here.

Well not if you just append it to this config migration. After all people should not be using develop for actual work 😉.

Its more than just a config migration, and I don't think the two changes should be conflated; they should be separate commits.
If you want, I can create the PR for that (based on this) and we merge the two together (once that is reviewed), merge commit a PR with two commits.

There's some other "discussions" I want to have around that as well: I don't think it should be under a broker key: that's too specific to rabbitmq (which we may well be replacing), and also no end users are going to understand that terminology.

I'll probably open this as a separate issue, but basically I think we can do a bit more to standardise our terminology, and make it more understandable for users:

AiiDA is a workflow engine framework, which can be abstracted into five key concerns:

  1. Storage: How do we store generated inputs, outputs, and the provenance between them.
  2. Communication: How do we communicate with compute services (such as HPCs) and transfer data to/from them.
  3. Processing: How do we run calculations and workflows (locally and externally)
  4. Developer interface: How can developers create plugins to extend aiida-core (such as aiida-quantumespresso)
  5. User interface: How do users interact with AiiDA (Python API, CLI, web-based APIs etc.)

In this abstraction, the rabbitmq configuration would come under processing/process

@sphuber
Copy link
Contributor

sphuber commented Jan 23, 2022

Well not if you just append it to this config migration. After all people should not be using develop for actual work

Fair enough, but that does mean it has to happen soon, within the next week or so, since we want to be releasing very soon. If you think you'll add it before then (or want me to do it), fine to have it in separate commits or PRs.

@chrisjsewell chrisjsewell requested a review from sphuber January 24, 2022 09:55
@chrisjsewell
Copy link
Member Author

Fair enough, but that does mean it has to happen soon, within the next week or so

Once you "sign-off" the current code in this PR, I'll rebase/squash into one commit, and add this on top of that (in this PR) for you to review

@sphuber
Copy link
Contributor

sphuber commented Jan 24, 2022

Alright @chrisjsewell , considered this "signed-off" and feel free to do squash and do the second part

@chrisjsewell
Copy link
Member Author

cheers! will do it later tonight

@chrisjsewell chrisjsewell force-pushed the migrate-config-storage branch from ba998ff to fe28f48 Compare January 25, 2022 12:19
@chrisjsewell chrisjsewell force-pushed the migrate-config-storage branch from 8706dbe to ab9d156 Compare January 25, 2022 14:10
class AbstractStorage(SingleMigration):
"""Move the storage configuration under a top-level "storage" key.
class AbstractStorageAndProcess(SingleMigration):
"""Move the storage config under a top-level "storage" key and rabbitmq config under "processing".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not actually true I think, although I think that would actually be ideal:

{
    'storage': {
        'backend': 'string',
        'config': {}
    },
    'process_control': {
        'backend': 'string',
        'config': {}
    }
}

Think that is the clearest.

But at least if you keep the current layout, then please update the docstring.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I think that seems reasonable 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done; don't say I never give you anything 😉

if new in profile.get('storage_config', {}):
profile[old] = profile['storage_config'].pop(new)
profile.pop('storage_config', None)
if 'storage_backend' in profile:
profile['AIIDADB_BACKEND'] = profile.pop('storage_backend')
for key in self.process_keys:
if key in profile.get('process_control_config', {}):
profile[key] = profile['process_control_config'].pop(key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably add the same warning as for the storage conversion

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"description": "The configuration to parse to the storage backend",
"type": "object",
"properties": {
"database_engine": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although it is nice to have the explicit config options for the storage and process_control backends, this is problematic in principle since the config schema will be backend specific and so dynamic. Not sure if the JSON schema allows for such a concept. For now it is ok to keep since anyway we don't have a way to dynamically change the backends, but with your idea of making that possible in the future, just wanted to highlight this.

Copy link
Member Author

@chrisjsewell chrisjsewell Jan 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeh, I was well aware of this incongruity, just wanted to still have this "specification" somewhere. Probably want to put this validation onto a class method on the storage/process_control backend class, e.g.

storage_cls = load_storage_cls(config["storage_backend"])
storage_cls.validate_config(config["storage_config"])

tests/manage/configuration/migrations/test_migrations.py Outdated Show resolved Hide resolved
@chrisjsewell chrisjsewell force-pushed the migrate-config-storage branch from dad7a57 to 297da50 Compare January 25, 2022 18:43
@chrisjsewell chrisjsewell requested a review from sphuber January 25, 2022 19:43
@chrisjsewell chrisjsewell force-pushed the migrate-config-storage branch from 48817ab to 05aa2d1 Compare January 25, 2022 19:48
@sphuber sphuber merged commit 3af6f6a into aiidateam:develop Jan 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants