-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CU2e77a5x - Add a CDB merge function #373
Conversation
Thanks for the comments. I've made some of the suggested changes locally already, when I started testing it. I'm also looking at using I'll upload them shortly. Regarding where this method should go - I don't see an ideal home for it currently in |
I'll try and answer some of the questions as well.
|
About the context_vectors, and how to test merging them appropriately. It is difficult to test if merging two cdbs will result in appropriate context_vectors calculated via a weighted average. To test this we are going to:
We have considered the case of negative samples affecting the context vectors - as they are random. Given a large enough corpus to train this wouldn't be an issue, however it is beyond scope for this kind of testing. So we will set the number of negative samples to 0. In a practical sense this would not be ideal, but here it is a good workaround for testing. We would also add documentation to indicate that context vectors via merging will not be identical as if they were generated at the same time. |
Please make the GHA (flake8, mypy, tests) run successfully.
(tests take a long time so I don't like them in the hook - I run them separately). And you could then also update the top description as to what the current state of the PR is - what's been done and what's still a WIP. Perhaps convert he questions and/or TODOs to a task list with check points. Though would be great to retain stuff that fell out of the scope for whatever reason. |
I've fixed the tests - I had issues with flake8 locally. Flake8 6.0 is compatible with our flake8 config file. |
requirements-dev.txt and CONTRIBUTING - we've set things up with v4.0.1. I'll take a look at the entire thing later today. |
Just an issue on my local machine. I'm writing up a list now of the changes I've made and if there's anything else left to do. |
Just saying it's easier if you keep your virtual environment in sync with the project requirements :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think this looks good.
But I think we can improve a few small things.
tests/test_cdb.py
Outdated
@@ -90,5 +90,44 @@ def test_cui2snames_population(self): | |||
with self.subTest(cui): | |||
self.assertIn(cui, self.undertest.cui2snames) | |||
|
|||
|
|||
def test_merge_cdb(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one test seems to test a lot of different things. And have a lot of setup in the actual test method.
What I would suggest is a new class (unittest.TestCase
) where in the class setup method, the two cuis are generated and merged. And then having various small test methods where each of the specific things is tested for.
Though in principle, these different test cases (i.e the different changes that you're testing for within the merge) might also benefit from being separated into different classes. That way we'd be doing (and testing for) one change at a time. But I think it's fine to do it all at once here. Tests already take a long time, would hate to slow them down even further. We can revisit at a later date if/when this becomes an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was personally unhappy with the test method being where it is, and not being very elegant.
I think if we move the method its own utils class elsewhere in medcat
then splitting these out would be ideal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless you see a reason to use this outside of tests, I don't think it fits in the medcat
package itself.
Though there is a tests.helper
module you could move some boilerplate to.
On the other hand, if this piece of code is only relevant to testing CDB merging, creating a new module for testing the merge capabilities and having the helper method there would probably make more sense.
tests/test_cdb.py
Outdated
maker2 = CDBMaker(config) # second maker is required as it will otherwise point to same object | ||
path = os.path.join(os.path.dirname(os.path.realpath(__file__)), "model_creator", "umls_sample.csv") | ||
cdb1 = maker1.prepare_csvs(csv_paths=[path]) | ||
cdb2 = maker2.prepare_csvs(csv_paths=[path]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can get away with 1 CDBMaker
just making two identical CDB
s? That seems to be the way this class works anyway (the CDBMaker
is created once upon class setup and a separate CDB
is made in the setup for each individual test).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is how I assumed it would work.
Making two cdbs from the same maker seemed to point the same cdb1. i.e.
maker = CDBMaker(config)
path = os.path.join(os.path.dirname(os.path.realpath(__file__)), "model_creator", "umls_sample.csv")
cdb1 = maker.prepare_csvs(csv_paths=[path])
cdb2 = maker.prepare_csvs(csv_paths=[path])
print(cdb1 == cdb2)
returns true.
I'm happy to change this if I'm mistaken or have overlooked something in the building. The same config seems to be fine for CDBMaker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right - it just modifies the same instance each time the prepare_csvs
method is called.
Though merging two identical CDBs to return a third identical one could also be an edge case to test.
But what this means is that the way the entire class works is prone to side effects since in each setUp
it looks like a new CDB
is generated, whereas in reality the same instance is modified and returned.
Changes made to
Added testing to
These tests might not be extensive, if theres any tests you'd like to see let me know and I'll make sure to add them. Regarding a cdb generated from a full dataset ( We agreed that this test isn't suitable for unit testing. This test was proven to be quite difficult to collect suitable results on very small training datasets. I managed to test with and without negative sampling and found similar results between them regardless. Due to the small size of the training document - the vectors generated in both For a fairer comparison I tested the vector merging method from Potential changes to be done:
|
The only reason I'd be inclined to not put the method in
I think it might be worth trying to set up tests for the different |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm -
Couple of things though.
- Can we move to a
medcat.utils.cdb
module.- rationale for moving it into this new module is that there is a cdb filter method somewhere that should also go here
- A tutorial in how to use this functionality would nice. Perhaps in a specialised tutorial. Called something like CDB management? https://github.com/CogStack/MedCATtutorials
- responding to thread RE tests: side effects on tests is a recipe for future frustration. Please fix with setup local to class in the test etc.
I agree. But I think it's outside the scope of this PR since it's the pre-existing |
Few more changes which I think finalise this:
If theres anything before I merge the request I just want to check if there are any further changes I've missed in the previous suggestions. Regarding the move of this work to another module, should that be another CU also? |
That all looks great!
I think the tests could still benefit from being being separated into multiple tests. Especially since you added more tests to the same test method.
By separating the tests into multiple properly named ones, we'll have more granularity about what breaks (if it does) in the future, instead of just "test_cdb_merge" failed, it'd say something like "test_cdb_merge_name_in_cui2names" failed.
If Tom said he'd like it in |
I've moved everything over, and split out the tests. I've named the I'll merge when these changes get a thums up. |
Once the two comments I've made are resolved, I'd be happy for a merge. |
Looks all good to me now. Wait for GHA to do its magic and we'll be good to merge on my end. |
@adam-sutton-1992 |
Hi sorry, I will do in the future. Thanks. |
This is early stages of this enhancement, and I'll list what I've done, questions I have, and a TODO list.
What I've done:
cdb1
and itsconfig
. I consider this cdb to be the base.name2x
dict's I check if they exist in both cdbs and try to merge them in the most suitable way (i.e.name2count_train
will add the number of training samples together).name
key only exists incdb2
I add that key and value to the newcdb
snames
using a set unioncui2x
as I did for names, and merge them as suitable.CUIs
exist I perform a weighted sum, where the weights are sourced from thecui2count_train
dict. I check for eachwindow length
accordingly.vocab
sums the occurences for each word from both cdbsaddl_info
prioritises addingcdb1
entries overcdb2
if keys exist in both cdbs.Questions:
cui2info
used for? There is no documentation for it, and it seems redundant compared toaddl_info
.addl_info
with the same keys?overwrite
andvector_import
params?TODO:
cuis
exist forcontext_vectors
but don't forcount_train