Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relation extraction #173

Merged
merged 119 commits into from
May 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
119 commits
Select commit Hold shift + click to select a range
b20e7c8
Added files.
vladd-bit Aug 24, 2021
eec6c59
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit Aug 31, 2021
56220aa
More additions to rel extraction.
vladd-bit Sep 1, 2021
7ad88f5
Rel base.
vladd-bit Sep 3, 2021
233ce36
Update.
vladd-bit Sep 6, 2021
85a7015
Updates.
vladd-bit Sep 10, 2021
5003548
Dependency parsing.
vladd-bit Oct 1, 2021
541b47d
Updates.
vladd-bit Oct 13, 2021
c042b0d
Added pre-training steps.
vladd-bit Oct 15, 2021
87d0c0c
Added training & model utils.
vladd-bit Oct 18, 2021
4f42696
Cleanup & fixes.
vladd-bit Oct 19, 2021
018d811
Update.
vladd-bit Oct 21, 2021
f3d3f44
Evaluation updates for pretraining.
vladd-bit Oct 27, 2021
e5f354e
Removed duplicate relation storage.
vladd-bit Nov 9, 2021
c69de67
Merged master.
vladd-bit Nov 9, 2021
031d256
Moved RE model file location.
vladd-bit Nov 12, 2021
2259a6b
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit Nov 16, 2021
1c469e9
Structure revisions.
vladd-bit Nov 22, 2021
423b4e1
Added custom config for RE.
vladd-bit Dec 13, 2021
8ae9abb
Implemented custom dataset loader for RE.
vladd-bit Dec 13, 2021
186416c
More changes.
vladd-bit Dec 13, 2021
451e33f
Small fix.
vladd-bit Dec 13, 2021
8b36413
Latest additions to RelCAT (pipe + predictions)
vladd-bit Jan 19, 2022
2fb8fc9
Setup.py fix.
vladd-bit Jan 19, 2022
930dd11
RE utils update.
vladd-bit Jan 19, 2022
24b2841
rel model update.
vladd-bit Jan 19, 2022
193ecb1
rel dataset + tokenizer improvements.
vladd-bit Jan 19, 2022
03111a7
RelCAT updates.
vladd-bit Jan 19, 2022
7ab60f4
RelCAT saving/loading improvements.
vladd-bit Jan 21, 2022
40875f3
RelCAT saving/loading improvements.
vladd-bit Jan 21, 2022
810d1dc
RelCAT model fixes.
vladd-bit Jan 21, 2022
11dcb32
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit Jan 21, 2022
72187f6
Attempted gpu learning fix. Dataset label generation fixes.
vladd-bit Jan 24, 2022
5f67a4c
Minor train dataset gen fix.
vladd-bit Jan 24, 2022
cfc0e91
Minor train dataset gen fix No.2.
vladd-bit Jan 24, 2022
9f4b220
Config updates.
vladd-bit Jan 25, 2022
19afa81
Gpu support fixes. Added label stats.
vladd-bit Jan 25, 2022
8eb1665
Evaluation stat fixes.
vladd-bit Jan 26, 2022
6e86fa2
Cleaned stat output mode during training.
vladd-bit Jan 26, 2022
5cee8cf
Build fix.
vladd-bit Jan 26, 2022
223ac9a
removed unused dependencies and fixed code formatting
vladd-bit Jan 26, 2022
ea7d68c
Mypy compliance.
vladd-bit Jan 26, 2022
1ea9738
Fixed linting.
vladd-bit Jan 27, 2022
9f6609e
More Gpu mode train fixes.
vladd-bit Jan 28, 2022
1782c0b
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit Jan 28, 2022
fb86869
Fixed model saving/loading issues when using other baes models.
vladd-bit Jan 31, 2022
df21543
More fixes to stat evaluation. Added proper CAT integration of RelCAT.
vladd-bit Feb 3, 2022
92a5e08
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit Feb 3, 2022
87d1a9c
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit Mar 11, 2022
ced1627
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit Mar 14, 2022
7b69710
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit Mar 28, 2022
37fd212
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit Apr 4, 2022
f0eda2b
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit Apr 8, 2022
10269b9
Setup.py typo fix.
vladd-bit Apr 8, 2022
b8a45b2
Merge branch 'relation_extraction' of https://github.com/CogStack/Med…
vladd-bit Apr 8, 2022
20203ac
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit May 10, 2022
f057139
RelCAT loading fix.
vladd-bit May 10, 2022
197a27a
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit Jul 21, 2022
86fd509
RelCAT Config changes.
vladd-bit Aug 1, 2022
79dc069
Type fix. Minor additions to RelCAT model.
vladd-bit Aug 1, 2022
323c895
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit Aug 1, 2022
f1c56bf
Type fixes.
vladd-bit Aug 1, 2022
a78ff86
Type corrections.
vladd-bit Aug 2, 2022
f09ceb2
RelCAT update.
vladd-bit Mar 21, 2023
32574f2
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit Mar 21, 2023
c081c3e
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit May 22, 2023
e2e48b5
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit Dec 11, 2023
4ce5ba3
Type fixes.
vladd-bit Dec 12, 2023
21c09ff
Merge branch 'relation_extraction' of https://github.com/CogStack/Med…
vladd-bit Dec 13, 2023
8123689
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit Dec 13, 2023
57ab0c5
Fixed type issue.
vladd-bit Dec 13, 2023
9da5aa6
RelCATConfig: added seed param.
vladd-bit Dec 13, 2023
009e832
Adaptations to the new codebase + type fixes..
vladd-bit Dec 15, 2023
1a7d130
Doc/type fixes.
vladd-bit Dec 19, 2023
53dba6a
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit Dec 20, 2023
92613ed
Fixed input size issue for model.
vladd-bit Jan 8, 2024
a49a44a
Fixed issue(s) with model size and config.
vladd-bit Jan 16, 2024
6456e6e
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit Jan 16, 2024
5aac9ab
RelCAT: updated configs to new style.
vladd-bit Jan 19, 2024
9c50b30
RelCAT: removed old refs to logging.
vladd-bit Jan 19, 2024
b071607
Merge branches 'relation_extraction' and 'master' of https://github.c…
vladd-bit Jan 29, 2024
89d9128
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit Feb 7, 2024
e6e99cb
Fixed GPU training + added extra stat print for train set.
vladd-bit Feb 7, 2024
307d194
Type fixes.
vladd-bit Feb 7, 2024
fb7efe3
Updated dev requirements.
vladd-bit Feb 7, 2024
c235daf
Linting.
vladd-bit Feb 7, 2024
fcdf2e3
Merge branches 'relation_extraction' and 'master' of https://github.c…
vladd-bit Feb 9, 2024
aad0a73
Fixed pin_memory issue when training on CPU.
vladd-bit Feb 9, 2024
8a9026b
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit Mar 8, 2024
f94e349
Updated RelCAT dataset get + default config.
vladd-bit Mar 21, 2024
0770356
Updated RelDS generator + default config
vladd-bit Mar 25, 2024
bdf20f5
Linting.
vladd-bit Mar 25, 2024
f7b5aaf
Updated RelDatset + config.
vladd-bit Apr 3, 2024
3e827cf
Merge branch 'relation_extraction' of https://github.com/CogStack/Med…
vladd-bit Apr 3, 2024
aaf6533
Pushing updates to model
shubham-s-agarwal Apr 8, 2024
18f9bb8
Fixing formatting
shubham-s-agarwal Apr 8, 2024
503513c
Update rel_dataset.py
shubham-s-agarwal Apr 8, 2024
040821b
Update rel_dataset.py
shubham-s-agarwal Apr 8, 2024
ed7c8d5
Update rel_dataset.py
shubham-s-agarwal Apr 8, 2024
8d0bfe4
RelCAT: added test resource files.
vladd-bit Apr 9, 2024
3f3a780
RelCAT: Fixed model load/checkpointing.
vladd-bit Apr 10, 2024
3f56824
RelCAT: updated to pipe spacy doc call.
vladd-bit Apr 12, 2024
b7a4987
RelCAT: added tests.
vladd-bit Apr 12, 2024
77d27b0
Merge branch 'relation_extraction' of https://github.com/CogStack/Med…
vladd-bit Apr 12, 2024
a9258a2
Fixed lint/type issues & added rel tag to test DS.
vladd-bit Apr 15, 2024
0ed70fb
Fixed ann id to token issue.
vladd-bit Apr 15, 2024
8db2e76
RelCAT: updated test dataset + tests.
vladd-bit Apr 18, 2024
6eea6b7
RelCAT: updates to requested changes + dataset improvements.
vladd-bit Apr 18, 2024
6972310
RelCAT: updated docs/logs according to commends.
vladd-bit Apr 18, 2024
d03316c
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit Apr 18, 2024
8cb12a4
RelCAT: type fix.
vladd-bit Apr 18, 2024
d10318a
RelCAT: mct export dataset updates.
vladd-bit Apr 19, 2024
12acaeb
RelCAT: test updates + requested changes p2.
vladd-bit Apr 19, 2024
4c14a3a
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit Apr 19, 2024
382cefc
RelCAT: log for MCT export train.
vladd-bit Apr 19, 2024
35b0913
Updated docs + split train_test & dataset for benchmarks.
vladd-bit Apr 26, 2024
d48bc41
type fixes.
vladd-bit Apr 26, 2024
3068516
Merge branch 'master' of https://github.com/CogStack/MedCAT into rela…
vladd-bit Apr 26, 2024
72643fc
Merge branch 'master' into relation_extraction
mart-r Apr 29, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,9 @@ venv
db.sqlite3
.ipynb_checkpoints

# vscode
.vscode

#tmp and similar files
.nfs*
*.log
Expand Down
31 changes: 27 additions & 4 deletions medcat/cat.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@
from medcat.linking.context_based_linker import Linker
from medcat.preprocessing.cleaners import prepare_name
from medcat.meta_cat import MetaCAT
from medcat.rel_cat import RelCAT
from medcat.utils.meta_cat.data_utils import json_to_fake_spacy
from medcat.config import Config
from medcat.vocab import Vocab
Expand Down Expand Up @@ -64,6 +65,8 @@ class CAT(object):
meta_cats (list of medcat.meta_cat.MetaCAT, optional):
A list of models that will be applied sequentially on each
detected annotation.
rel_cats (list of medcat.rel_cat.RelCAT, optional)
List of models applied sequentially on all detected annotations.

Attributes (limited):
cdb (medcat.cdb.CDB):
Expand All @@ -89,6 +92,7 @@ def __init__(self,
vocab: Union[Vocab, None] = None,
config: Optional[Config] = None,
meta_cats: List[MetaCAT] = [],
rel_cats: List[RelCAT] = [],
mart-r marked this conversation as resolved.
Show resolved Hide resolved
addl_ner: Union[TransformersNER, List[TransformersNER]] = []) -> None:
self.cdb = cdb
self.vocab = vocab
Expand All @@ -100,6 +104,7 @@ def __init__(self,
self.config = config
self.cdb.config = config
self._meta_cats = meta_cats
self._rel_cats = rel_cats
self._addl_ner = addl_ner if isinstance(addl_ner, list) else [addl_ner]
self._create_pipeline(self.config)

Expand Down Expand Up @@ -133,6 +138,9 @@ def _create_pipeline(self, config: Config):
for meta_cat in self._meta_cats:
self.pipe.add_meta_cat(meta_cat, meta_cat.config.general.category_name)

for rel_cat in self._rel_cats:
self.pipe.add_rel_cat(rel_cat, "_".join(list(rel_cat.config.general["labels2idx"].keys())))

# Set max document length
self.pipe.spacy_nlp.max_length = config.preprocessing.max_document_length

Expand Down Expand Up @@ -297,6 +305,10 @@ def create_model_pack(self, save_dir_path: str, model_pack_name: str = DEFAULT_M
name = comp[0]
meta_path = os.path.join(save_dir_path, "meta_" + name)
comp[1].save(meta_path)
if isinstance(comp[1], RelCAT):
name = comp[0]
rel_path = os.path.join(save_dir_path, "rel_" + name)
comp[1].save(rel_path)

# Add a model card also, why not
model_card_path = os.path.join(save_dir_path, "model_card.json")
Expand Down Expand Up @@ -341,7 +353,8 @@ def load_model_pack(cls,
meta_cat_config_dict: Optional[Dict] = None,
ner_config_dict: Optional[Dict] = None,
load_meta_models: bool = True,
load_addl_ner: bool = True) -> "CAT":
load_addl_ner: bool = True,
load_rel_models: bool = True) -> "CAT":
mart-r marked this conversation as resolved.
Show resolved Hide resolved
"""Load everything within the 'model pack', i.e. the CDB, config, vocab and any MetaCAT models
(if present)

Expand All @@ -360,13 +373,16 @@ def load_model_pack(cls,
Whether to load MetaCAT models if present (Default value True).
load_addl_ner (bool):
Whether to load additional NER models if present (Default value True).
load_rel_models (bool):
Whether to load RelCAT models if present (Default value True).

Returns:
CAT: The resulting CAT object.
"""
from medcat.cdb import CDB
from medcat.vocab import Vocab
from medcat.meta_cat import MetaCAT
from medcat.rel_cat import RelCAT

model_pack_path = cls.attempt_unpack(zip_path)

Expand Down Expand Up @@ -409,8 +425,15 @@ def load_model_pack(cls,
meta_cats.append(MetaCAT.load(save_dir_path=meta_path,
config_dict=meta_cat_config_dict))

cat = cls(cdb=cdb, config=cdb.config, vocab=vocab, meta_cats=meta_cats, addl_ner=addl_ner)
# Find Rel models in model_pack
rel_paths = [os.path.join(model_pack_path, path) for path in os.listdir(model_pack_path) if path.startswith('rel_')] if load_rel_models else []
rel_cats = []
for rel_path in rel_paths:
rel_cats.append(RelCAT.load(load_path=rel_path))

cat = cls(cdb=cdb, config=cdb.config, vocab=vocab, meta_cats=meta_cats, addl_ner=addl_ner, rel_cats=rel_cats)
logger.info(cat.get_model_card()) # Print the model card

return cat

def __call__(self, text: Optional[str], do_train: bool = False) -> Optional[Doc]:
Expand Down Expand Up @@ -1092,8 +1115,8 @@ def get_entities_multi_texts(self,
elif out[i].get('text', '') != text:
out.insert(i, self._doc_to_out(None, only_cui, addl_info)) # type: ignore

cnf_annotation_output = self.config.annotation_output
if not cnf_annotation_output.include_text_in_output:
cnf_annotation_output = getattr(self.config, 'annotation_output', {})
if not (cnf_annotation_output.get('include_text_in_output', False)):
for o in out:
if o is not None:
o.pop('text', None)
Expand Down
98 changes: 98 additions & 0 deletions medcat/config_rel_cat.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
import logging
from typing import Dict, Any, List
from medcat.config import MixingConfig, BaseModel, Optional, Extra


class General(MixingConfig, BaseModel):
"""The General part of the RelCAT config"""
device: str = "cpu"
relation_type_filter_pairs: List = []
"""Map from category values to ID, if empty it will be autocalculated during training"""
vocab_size: Optional[int] = None
lowercase: bool = True
"""If true all input text will be lowercased"""
cntx_left: int = 15
"""Number of tokens to take from the left of the concept"""
cntx_right: int = 15
"""Number of tokens to take from the right of the concept"""
window_size: int = 300
"""Max acceptable dinstance between entities (in characters), care when using this as it can produce sentences that are over 512 tokens (limit is given by tokenizer)"""

mct_export_max_non_rel_sample_size:int = 200
"""Limit the number of 'Other' samples selected for training/test. This is applied per encountered medcat project, sample_size/num_projects. """
mct_export_create_addl_rels: bool = False
"""When processing relations from a MedCAT export, relations labeled as 'Other' are created from all the annotations pairs available"""

tokenizer_name: str = "bert"
model_name: str = "bert-base-uncased"
log_level: int = logging.INFO
max_seq_length: int = 512
tokenizer_special_tokens: bool = False
annotation_schema_tag_ids: List = []
"""If a foreign non-MCAT trainer dataset is used, you can insert your own Rel entity token delimiters into the tokenizer, \
copy those token IDs here, and also resize your tokenizer embeddings and adjust the hidden_size of the model, this will depend on the number of tokens you introduce"""
labels2idx: Dict = {}
idx2labels: Dict = {}
pin_memory: bool = True
seed: int = 13
task: str = "train"


class Model(MixingConfig, BaseModel):
"""The model part of the RelCAT config"""
input_size: int = 300
hidden_size: int = 768
hidden_layers: int = 3
""" hidden_size * 5, 5 being the number of tokens, default (s1,s2,e1,e2+CLS)"""
model_size: int = 5120
dropout: float = 0.2
num_directions: int = 2
"""2 - bidirectional model, 1 - unidirectional"""

padding_idx: int = -1
emb_grad: bool = True
"""If True the embeddings will also be trained"""
ignore_cpos: bool = False
"""If set to True center positions will be ignored when calculating represenation"""

class Config:
extra = Extra.allow
validate_assignment = True


class Train(MixingConfig, BaseModel):
"""The train part of the RelCAT config"""
nclasses: int = 2
"""Number of classes that this model will output"""
batch_size: int = 25
nepochs: int = 1
lr: float = 1e-4
adam_epsilon: float = 1e-4
test_size: float = 0.2
gradient_acc_steps: int = 1
multistep_milestones: List[int] = [
2, 4, 6, 8, 12, 15, 18, 20, 22, 24, 26, 30]
multistep_lr_gamma: float = 0.8
max_grad_norm: float = 1.0
shuffle_data: bool = True
"""Used only during training, if set the dataset will be shuffled before train/test split"""
class_weights: Optional[Any] = None
score_average: str = "weighted"
"""What to use for averaging F1/P/R across labels"""
auto_save_model: bool = True
"""Should the model be saved during training for best results"""

class Config:
extra = Extra.allow
validate_assignment = True


class ConfigRelCAT(MixingConfig, BaseModel):
"""The RelCAT part of the config"""
general: General = General()
model: Model = Model()
train: Train = Train()

class Config:
extra = Extra.allow
validate_assignment = True
9 changes: 9 additions & 0 deletions medcat/pipe.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from medcat.linking.context_based_linker import Linker
from medcat.meta_cat import MetaCAT
from medcat.ner.vocab_based_ner import NER
from medcat.rel_cat import RelCAT
from medcat.utils.normalizers import TokenNormalizer, BasicSpellChecker
from medcat.config import Config
from medcat.pipeline.pipe_runner import PipeRunner
Expand Down Expand Up @@ -161,6 +162,13 @@ def add_meta_cat(self, meta_cat: MetaCAT, name: Optional[str] = None) -> None:
# Used for sharing pre-processed data/tokens
Doc.set_extension('share_tokens', default=None, force=True)

def add_rel_cat(self, rel_cat: RelCAT, name: Optional[str] = None) -> None:
component_name = spacy.util.get_object_name(rel_cat)
name = name if name is not None else component_name
Language.component(name=component_name, func=rel_cat)
self._nlp.add_pipe(component_name, name=name, last=True)
# dictionary containing relations of the form {}
Doc.set_extension("relations", default=[], force=True)

def add_addl_ner(self, addl_ner: TransformersNER, name: Optional[str] = None) -> None:
component_name = spacy.util.get_object_name(addl_ner)
Expand All @@ -169,6 +177,7 @@ def add_addl_ner(self, addl_ner: TransformersNER, name: Optional[str] = None) ->
self._nlp.add_pipe(component_name, name=name, last=True)

Doc.set_extension('ents', default=[], force=True)
Doc.set_extension('relations', default=[], force=True)
Span.set_extension('confidence', default=-1, force=True)
Span.set_extension('id', default=0, force=True)
Span.set_extension('cui', default=-1, force=True)
Expand Down
Loading
Loading