-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CU-2tuwdjf Move to an identifier based config #257
Conversation
…ed config system as well as a very short test case for new config
…em up and show them as/when needed
…sts from a separate file to the main config test file. Removed all remnants of the old config system
…e new identifier based methodd
… handles new instance initialisation
…ugh both the legacy getitem method as well as through the attribute based method
…well as an ampety dictionary when reading from config
…rmers ner configs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say it looks good, there are too many changes to go through everything, but I've checked the config.py and as there is backward compatibility via fakedict, all should be fine. Just to make sure maybe check hashing and make sure that whatever is changed in the config also changes the hash.
…lues as well as that they are consistent for identical configs
Added some successful tests to ensure that changes to configs result in a changed hash. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you list pydantic as a dep in setup.py plz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple of small things I think.
- can you add a test for loading config from a file i.e. testing the refactored config.parse_conifig_file
- potentially reconsider the base FakeDict class, and reimplemented dict methods that potentially could be inherited directly.
medcat/config.py
Outdated
from medcat.utils.hasher import Hasher | ||
from typing import Optional, Iterable, Tuple, Dict, Any | ||
|
||
LOGGER = logging.getLogger(__package__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style thing - but can you use log = logging... , consistent with the other modules
medcat/config.py
Outdated
@@ -15,13 +22,73 @@ def workers(workers_override: Optional[int] = None) -> int: | |||
return max(cpu_count() - 1, 1) if workers_override is None else workers_override | |||
|
|||
|
|||
class ConfigMixin(object): | |||
class FakeDict(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to only inherit from object (which isn't technically needed in any case as thats the default), can we not inherit directly from dict
or from collections.UserDict
. That has implementations for set and getitem, as well iterator methods, constructor implementations etc. etc.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess inheriting from object is a habit from python2 where this made some difference. It's definitely not needed here.
As for inheriting from a dict
(or UserDict
), it really depends on whether or if we want to keep using it as such.
Right now I only implemented the methods that I could see being used in the project for dict based approach.
However, if we inherit from dict, we would need to overwrite a bunch more of the methods to make sure the beahviour would be predictable.
Plus, some parts of a regular dict would not make sense in the current context. For instance, using pop
to remove one of the fields.
But then again, if we don't want to have the option of using the MixingConfig
instances as a dict
, we would be better off forcing that behaviour by not providing the methods overwritten by FakeDict
. I've mostly kept that in for backwards compatibility and because I don't know where else parts of the config may be used as a dicts.
So I don't think the current behaviour is ideal, either.
So as I see it, we have a number of options going forward:
- Keep full
dict
functionality- Inherit from
dict
(orUserDict
) and overwrite a bunch of the methods - Raise an exception in case of methods that don't make sense
- Inherit from
- Keep no
dict
functionality- I.e remove
FakeDict
- Force usage of the fields/attributes
- Where parts of the config required as
dict
, callasdict
- I.e remove
- Keep partial backwards compatibility but encourage use of the new system
- Perhaps overwrite
- I.e warn when the dict-based access is attempted
self.word_skipper = re.compile('^({})$'.format('|'.join(self.preprocessing['words_to_skip']))) | ||
# Very agressive punct checker, input will be lowercased | ||
self.punct_checker = re.compile(r'[^a-z0-9]+') | ||
def asdict(self) -> Dict[str, Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this or the above change would be needed if FakeDict or MixingConfig directly inherit from dict
or UserDict
medcat/config.py
Outdated
|
||
|
||
class _DefPartial(object): | ||
"""This is a helper class to make it possible to check equality of two diffault Linking instances |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo in the docstring? diffault
tests/test_config.py
Outdated
hash = c.get_hash() | ||
self.assertEqual(orig_hash, hash) | ||
|
||
def test_changeed_config_different_hash(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo in test name
* Initial commit with new identifier based config * Fixed issue with saving into json for idconfig * Added support for MetaCat and NER configs with the new identifier based config system as well as a very short test case for new config * New identity based config now allows for creation of new attributes * Now allowing extra attributes for each of the sub-types of the config * Moved the comments within configs to docstrings so IDEs would pick them up and show them as/when needed * Moved identifier based config to regular config package along with tests from a separate file to the main config test file. Removed all remnants of the old config system * Removed necessity of config merging needing dicts * Removed most getitem config operations * Removed unnecessary basemodel checks from new config * Moved some further comments into docsstrings on the config level * Added two new simple tests for legacy (getitem) support as well as the new identifier based methodd * Moved away from Field assignments for config defaults since BaseModel handles new instance initialisation * Added extra tests to make sure all the config keys are available through both the legacy getitem method as well as through the attribute based method * Removed unused code from test * Moved around some things to avoid duplication * Removed (now) irrelevant comments from config * Removed unused enums from config * Added some further docstrings to config class(es) * Being more accurate regarding setitem return type * Removed debug from logged output * Making sure that type validation happens during construction time for configs * Added alternative parser so that {} could be used as an empty set as well as an ampety dictionary when reading from config * Now validating config values when assigning as well * Added new tests for merge and assignment validation * Removed legacy comments * Fixed a few type hinting issues that caused mypy to result in errors * Fixed indentation and blank lines issues within meta cant and transformers ner configs * Added tests that make sure the config hashes change upon change to values as well as that they are consistent for identical configs * Added test that ensures version-specific config stuff doesn't affect the hash * Added pydantic dependency to setup.py * Moved config logger to class attribute * Fixed typo in docsting * Fixed typo in config test * Removed inheritance from object * Added tests for config's parse_config_file method
The current config is based on (potentially) nested dictionaries with string keys.
However, the system has some potential shortcomings
get
andgetattr
methods)This PR is attempting to remedy some of the above issues by basing the config on pydantic's Basemodel.
The key benefits are
Some other priorities when designing
['item']
still works for setting and getting itemsWhile I made every attempt to make sure the changes are compatible with the other parts of the project and all the tests are successful, it is entirely possible that this change does introduce some breaking changes as well.
And I am sure there are other shortcomings within this PR as well, so any comments are more than welcome.
PS:
This PR could perhaps be thought of more of a conversation starter rather than something that has to be merged.