-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
♻️ REFACTOR: Profile storage backend configuration #5320
♻️ REFACTOR: Profile storage backend configuration #5320
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #5320 +/- ##
===========================================
- Coverage 82.13% 82.11% -0.02%
===========================================
Files 533 533
Lines 38478 38425 -53
===========================================
- Hits 31601 31548 -53
Misses 6877 6877
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
ea3b6aa
to
99dd5cc
Compare
I will wait with full review once you have merged the other PR and rebased this. That should make it easier. But I would already mention one point that I would consider changing and that is the structure of the new config. I am perfectly happy with grouping it under a 'storage': {
'backend': 'django',
'database': {
...
},
'repository': {
...
}
} The reason is that in the future we want to have the possibility to support other repository implementations and by separating those configuration details from the database one, makes it clearer. I don't think we will be introducing any backends that are purely database and no longer have a separate file repository so to me it makes sense to separate them. In that same light, it is a bit incomplete to have the "backend" name just reference the database backend. Really it is a combination of a backend for the relational database and the file repository. Should we have a 'backend' key for both the |
@sphuber I very much disagree, as we have already discussed, a storage backend is both the database AND the repository, they are intrinsically linked. For example, you have a single migration for both. |
f6a5046
to
2dd32e2
Compare
yes exactly, that is incorrect, and something I will be looking to "fix" at a later date |
262b9cc
to
4d032f1
Compare
I'd also note, we now have In fact, I was going to open an issue, saying that all references to "backend" (which is just too nondescript) should be changed to "storage" |
I agree that the our "storage" comprises to components and form a whole and that we talk in term of the storage as one thing from the users perspective. That being said...
That may well be the case, but there are still two different parts to the implementation that need to be configured. There is nothing preventing us using the PostgreSQL database with a file repository implemented on an actual object store, like say for example S3. Would you call that backend then still "sqlalchemy"? This is incorrect and there is no reason to limit ourselves to this. So really you seem to be agreeing with me. The storage is one piece from the user's perspective in terms of where data gets stored (so there is one key for it in the configuration) but it comprises to separate data stores whose implementations can be mixed. Of course once data is stored, they are intrinsically linked. But you can use different backend implementations and so it makes perfect sense to have two separate config dictionaries, within the |
The archive is a storage backend, it has one config variable, the path to the archive. Should that variable then go under database or repository? |
The "django" and "sqlalchemy" backend names (again I would not personally have called them this) each point to a single backend class |
In my view, they are already separated in different classes. The database and repository both have their own separate interface, their own implementations and their own configuration (per implementation). The storage backend is just a wrapper class that wraps the two.
What I am saying is that the names I don't see the problem to use additional substructuring in the {
'profile_one': {
'broker': {},
'storage': {
'type': 'aiida.storage:core.archive',
'filepath': '/some/path/archive.aiida',
},
...
},
'profile_two': {
'broker': {},
'storage': {
'type': 'aiida.storage:core.sqla-dos'
'database': {
'hostname': 'localhost',
...
},
'repository': {
'filepath_container': '/some/filepath/container'
}
},
...
},
'profile_three': {
'broker': {},
'storage': {
'type': 'aiida.storage:core.sqla-s3'
'database': {
'hostname': 'localhost',
...
},
'repository': {
'hostname': 's3.aiida.net',
'username':
'container': '/some/filepath/container'
}
},
...
}
} I think we are really saying the same, I just think it is useful and makes sense to keep the exact structure of the |
because you need to load the storage backend class, to validate the schema of the storage config, they are two distinct things |
No. The storage backend is the interface to the storage. You either completely separate repository and database: separate versions, separate migrations, separate CLI commands, etc, or you treat them as one, you can't have it both ways. |
again, you keep fixating on the names, and I keep telling you they are the wrong names, and they should not have been named this in the first place. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really think that we are agreeing here in our long discussion. All I was saying is that I think it may make sense to allow for additional nesting under the storage_config
key, that is all. It wouldn't stop anything from what you are doing here. But whatever, let's move on.
Since we are doing this, I would really do the same for the broker
key and make that a dictionary. Besides that, there are just some minor suggestions and questions.
# to-do currently this is not actually used anywhere, | ||
# because e.g. the documentation is loaded with an incomplete (dummy) configuration | ||
# in actual usage though, this could lead to later key errors, when retrieving an attribute | ||
if validate and not set(config.keys()).issuperset(self.REQUIRED_KEYS): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this excepting then if validate
were to be True
? If it is not needed, shouldn't we just get rid of validate
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it says in the to-do
comment above, I think this should be on always, since otherwise you could have obscure failures, where you get KeyError
attribute retrieval on e.g. Profile.storage_backend
. The reason why it is not at present, is that in some testing fixtures and the load_documentation
function, an incomplete config is supplied to Profile
, so this would fail. We should maybe just add "dummy" config to them (for e.g. rabbitmq config), so we can always run this validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read the to-do comment but I don't understand it. If you don't currently use the validate
keyword anywhere, then why not just remove it. You say it is needed for the load_documentation
one, but that means it is being used and so the to-do
is incorrect and can be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In ba998ffbc76785c1b5209089dc6ce42c2aa29bc9, I have "fixed" places that were loading incomplete config, and turned on validation by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cheers, I think this still needs to be revisited at some point, but we can do that later.
yeh just about 😄 I just wanted to stress that anything under the
Indeed, as was going to do that as a separate PR 👍 |
But wouldn't that require a new separate config migration? Might as well just do it here. |
Well not if you just append it to this config migration. After all people should not be using Its more than just a config migration, and I don't think the two changes should be conflated; they should be separate commits. There's some other "discussions" I want to have around that as well: I don't think it should be under a I'll probably open this as a separate issue, but basically I think we can do a bit more to standardise our terminology, and make it more understandable for users:
In this abstraction, the rabbitmq configuration would come under |
Fair enough, but that does mean it has to happen soon, within the next week or so, since we want to be releasing very soon. If you think you'll add it before then (or want me to do it), fine to have it in separate commits or PRs. |
Once you "sign-off" the current code in this PR, I'll rebase/squash into one commit, and add this on top of that (in this PR) for you to review |
Alright @chrisjsewell , considered this "signed-off" and feel free to do squash and do the second part |
cheers! will do it later tonight |
ba998ff
to
fe28f48
Compare
8706dbe
to
ab9d156
Compare
class AbstractStorage(SingleMigration): | ||
"""Move the storage configuration under a top-level "storage" key. | ||
class AbstractStorageAndProcess(SingleMigration): | ||
"""Move the storage config under a top-level "storage" key and rabbitmq config under "processing". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not actually true I think, although I think that would actually be ideal:
{
'storage': {
'backend': 'string',
'config': {}
},
'process_control': {
'backend': 'string',
'config': {}
}
}
Think that is the clearest.
But at least if you keep the current layout, then please update the docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I think that seems reasonable 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done; don't say I never give you anything 😉
if new in profile.get('storage_config', {}): | ||
profile[old] = profile['storage_config'].pop(new) | ||
profile.pop('storage_config', None) | ||
if 'storage_backend' in profile: | ||
profile['AIIDADB_BACKEND'] = profile.pop('storage_backend') | ||
for key in self.process_keys: | ||
if key in profile.get('process_control_config', {}): | ||
profile[key] = profile['process_control_config'].pop(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably add the same warning as for the storage conversion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
"description": "The configuration to parse to the storage backend", | ||
"type": "object", | ||
"properties": { | ||
"database_engine": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although it is nice to have the explicit config options for the storage and process_control backends, this is problematic in principle since the config schema will be backend specific and so dynamic. Not sure if the JSON schema allows for such a concept. For now it is ok to keep since anyway we don't have a way to dynamically change the backends, but with your idea of making that possible in the future, just wanted to highlight this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeh, I was well aware of this incongruity, just wanted to still have this "specification" somewhere. Probably want to put this validation onto a class method on the storage/process_control backend class, e.g.
storage_cls = load_storage_cls(config["storage_backend"])
storage_cls.validate_config(config["storage_config"])
dad7a57
to
297da50
Compare
48817ab
to
05aa2d1
Compare
This PR refactors the
Profile
class, and provides a complimentary config migration, in two steps (commits):Storage backend
Configuration for the storage backend is changed from:
to
Storage configuration should be specific to the storage backend (and in fact all operations on storage should go via the backend).
It is envisaged that eventually (as part of #5172) the
storage_config
will be directly parsed to the backend for validation/instantiation, rather than indirectly obtaining it from the (global) profile, e.g. something like:Rabbitmq configuration
Configuration for the storage backend is changed from:
to
It is highly possible that, in the future, RabbitMQ will be replaced (see aiidateam/AEP#30).
This change begins to move aiida-core away from "hard-coding" its use.
It also makes clearer, the purpose of these configuration variables.
This PR also removes the
Profile
s behaviour, to strip unknown keys from the config (which then may be subsequently written to disk). This stripping is unnecessary, and the keys may be there to aid in "lose-less" upgrade/downgrade of the config.