Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CU-2tuwdjf Move to an identifier based config #257

Merged
merged 36 commits into from
Sep 11, 2022

Conversation

mart-r
Copy link
Collaborator

@mart-r mart-r commented Sep 5, 2022

The current config is based on (potentially) nested dictionaries with string keys.

However, the system has some potential shortcomings

  • Typos in config keys could result in exception during runtime
  • Potentially missing data is only discovered when access is attempted during runtime
  • Renaming keys is not trivial through IDE
  • Validation of data in config is done at the time of its use
  • Potentially different defaults for values in different parts of the code (via the get and getattr methods)

This PR is attempting to remedy some of the above issues by basing the config on pydantic's Basemodel.
The key benefits are

  • Data in config is validated at its creation or when it's assigned
  • Keys are used as identifiers
    • Easier to detected typos by the IDE
    • Easier to rename keys by the IDE
  • Docstring can be attached to the attributes (i.e they show up in IDEs)

Some other priorities when designing

  • Making sure that backwards compatibility is maintained
    • I.e the use of ['item'] still works for setting and getting items
  • Making as few changes as possible to the rest of the project
    • Though a few such changes were made
  • Addition of new / undefined attributes is still allowed

While I made every attempt to make sure the changes are compatible with the other parts of the project and all the tests are successful, it is entirely possible that this change does introduce some breaking changes as well.

And I am sure there are other shortcomings within this PR as well, so any comments are more than welcome.

PS:
This PR could perhaps be thought of more of a conversation starter rather than something that has to be merged.

…ed config system as well as a very short test case for new config
…sts from a separate file to the main config test file. Removed all remnants of the old config system
…ugh both the legacy getitem method as well as through the attribute based method
…well as an ampety dictionary when reading from config
Copy link
Collaborator

@w-is-h w-is-h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say it looks good, there are too many changes to go through everything, but I've checked the config.py and as there is backward compatibility via fakedict, all should be fine. Just to make sure maybe check hashing and make sure that whatever is changed in the config also changes the hash.

…lues as well as that they are consistent for identical configs
@mart-r
Copy link
Collaborator Author

mart-r commented Sep 7, 2022

Added some successful tests to ensure that changes to configs result in a changed hash.
And that the same config results in the same hash as well.

Copy link
Member

@tomolopolis tomolopolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you list pydantic as a dep in setup.py plz

Copy link
Member

@tomolopolis tomolopolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of small things I think.

  • can you add a test for loading config from a file i.e. testing the refactored config.parse_conifig_file
  • potentially reconsider the base FakeDict class, and reimplemented dict methods that potentially could be inherited directly.

medcat/config.py Outdated
from medcat.utils.hasher import Hasher
from typing import Optional, Iterable, Tuple, Dict, Any

LOGGER = logging.getLogger(__package__)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style thing - but can you use log = logging... , consistent with the other modules

medcat/config.py Outdated
@@ -15,13 +22,73 @@ def workers(workers_override: Optional[int] = None) -> int:
return max(cpu_count() - 1, 1) if workers_override is None else workers_override


class ConfigMixin(object):
class FakeDict(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to only inherit from object (which isn't technically needed in any case as thats the default), can we not inherit directly from dict or from collections.UserDict. That has implementations for set and getitem, as well iterator methods, constructor implementations etc. etc.?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess inheriting from object is a habit from python2 where this made some difference. It's definitely not needed here.

As for inheriting from a dict (or UserDict), it really depends on whether or if we want to keep using it as such.
Right now I only implemented the methods that I could see being used in the project for dict based approach.
However, if we inherit from dict, we would need to overwrite a bunch more of the methods to make sure the beahviour would be predictable.
Plus, some parts of a regular dict would not make sense in the current context. For instance, using pop to remove one of the fields.

But then again, if we don't want to have the option of using the MixingConfig instances as a dict, we would be better off forcing that behaviour by not providing the methods overwritten by FakeDict. I've mostly kept that in for backwards compatibility and because I don't know where else parts of the config may be used as a dicts.
So I don't think the current behaviour is ideal, either.

So as I see it, we have a number of options going forward:

  • Keep full dict functionality
    • Inherit from dict (or UserDict) and overwrite a bunch of the methods
    • Raise an exception in case of methods that don't make sense
  • Keep no dict functionality
    • I.e remove FakeDict
    • Force usage of the fields/attributes
    • Where parts of the config required as dict, call asdict
  • Keep partial backwards compatibility but encourage use of the new system
    • Perhaps overwrite
    • I.e warn when the dict-based access is attempted

self.word_skipper = re.compile('^({})$'.format('|'.join(self.preprocessing['words_to_skip'])))
# Very agressive punct checker, input will be lowercased
self.punct_checker = re.compile(r'[^a-z0-9]+')
def asdict(self) -> Dict[str, Any]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this or the above change would be needed if FakeDict or MixingConfig directly inherit from dict or UserDict

medcat/config.py Outdated


class _DefPartial(object):
"""This is a helper class to make it possible to check equality of two diffault Linking instances
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in the docstring? diffault

hash = c.get_hash()
self.assertEqual(orig_hash, hash)

def test_changeed_config_different_hash(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in test name

@w-is-h w-is-h merged commit e59c536 into CogStack:master Sep 11, 2022
mart-r added a commit to mart-r/MedCAT that referenced this pull request Jun 14, 2023
* Initial commit with new identifier based config

* Fixed issue with saving into json for idconfig

* Added support for MetaCat and NER configs with the new identifier based config system as well as a very short test case for new config

* New identity based config now allows for creation of new attributes

* Now allowing extra attributes for each of the sub-types of the config

* Moved the comments within configs to docstrings so IDEs would pick them up and show them as/when needed

* Moved identifier based config to regular config package along with tests from a separate file to the main config test file. Removed all remnants of the old config system

* Removed necessity of config merging needing dicts

* Removed most getitem config operations

* Removed unnecessary basemodel checks from new config

* Moved some further comments into docsstrings on the config level

* Added two new simple tests for legacy (getitem) support as well as the new identifier based methodd

* Moved away from Field assignments for config defaults since BaseModel handles new instance initialisation

* Added extra tests to make sure all the config keys are available through both the legacy getitem method as well as through the attribute based method

* Removed unused code from test

* Moved around some things to avoid duplication

* Removed (now) irrelevant comments from config

* Removed unused enums from config

* Added some further docstrings to config class(es)

* Being more accurate regarding setitem return type

* Removed debug from logged output

* Making sure that type validation happens during construction time for configs

* Added alternative parser so that {} could be used as an empty set as well as an ampety dictionary when reading from config

* Now validating config values when assigning as well

* Added new tests for merge and assignment validation

* Removed legacy comments

* Fixed a few type hinting issues that caused mypy to result in errors

* Fixed indentation and blank lines issues within meta cant and transformers ner configs

* Added tests that make sure the config hashes change upon change to values as well as that they are consistent for identical configs

* Added test that ensures version-specific config stuff doesn't affect the hash

* Added pydantic dependency to setup.py

* Moved config logger to class attribute

* Fixed typo in docsting

* Fixed typo in config test

* Removed inheritance from object

* Added tests for config's parse_config_file method
@mart-r mart-r deleted the configChanges branch September 8, 2023 08:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants