Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.14.0 release PR #506

Merged
merged 19 commits into from
Nov 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
b8bb4e3
Production/master sync (#483)
mart-r Aug 30, 2024
b28fa05
CU-8695jwnjk: Fix description of an argument for --help to work in CL…
mart-r Sep 2, 2024
6127f77
Pushing bug fix for metacat (#487)
shubham-s-agarwal Sep 5, 2024
2588670
CU-8695q21f6: Replace rosalind links with S3 ones in docs (#489)
mart-r Sep 16, 2024
56a2856
CU-8695uhe5n: Update docs dependency pins (#491)
mart-r Sep 16, 2024
394e17b
CU-8695m5q4x: Fix issues detecting 1-token concepts (#485)
mart-r Sep 17, 2024
eb912d6
CU-8695pvhfe fix usage monitoring for multiprocessing (#488)
mart-r Sep 17, 2024
cbae5b3
CU-8695knfbg Add name edits to regression suite (#486)
mart-r Sep 30, 2024
a9544f7
CU-869574kvp update snomed preprocessing naming (#469)
mart-r Sep 30, 2024
b433195
CU-8695vu71q: Make report identical run to run in identical cases (#492)
mart-r Sep 30, 2024
44db08b
CU-8695ucw9b deid transformers fix (#490)
mart-r Oct 7, 2024
909cfad
CU-869637yfx: Pin spacy dependency to lower than 3.8 (#494)
mart-r Oct 9, 2024
3e01747
MetaCAT fixes and upgrades (#495)
shubham-s-agarwal Oct 14, 2024
976adc2
CU-869671bn4: Update requirements to fix workflow issue due to mypy …
mart-r Oct 28, 2024
04efda5
CU-86967nnra Drop python 3.8 support (EoL) (#498)
mart-r Oct 28, 2024
e924798
CU-86964zm4d fix preprocessing (#496)
mart-r Nov 1, 2024
c0082ef
CU-8695hghww backwards compatibility workflow (#478)
mart-r Nov 1, 2024
df3df66
CU-8696n7w95: Remove commented code to fix DeID (oversight in PR 490)…
mart-r Nov 15, 2024
37a8a63
CU-8696m1mch: Remove versioning utility since all its parts were depr…
mart-r Nov 19, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: [ '3.8', '3.9', '3.10', '3.11' ]
python-version: [ '3.9', '3.10', '3.11' ]
max-parallel: 4

steps:
Expand Down Expand Up @@ -42,6 +42,8 @@ jobs:
timeout 25m python -m unittest ${second_half_nl[@]}
- name: Regression
run: source tests/resources/regression/run_regression.sh
- name: Model backwards compatibility
run: source tests/resources/model_compatibility/check_backwards_compatibility.sh
- name: Get the latest release version
id: get_latest_release
uses: actions/github-script@v6
Expand Down
8 changes: 4 additions & 4 deletions docs/main.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,12 +122,12 @@ If you have access to UMLS or SNOMED-CT, you can download the pre-built CDB and
A basic trained model is made public. It contains ~ 35K concepts available in `MedMentions`. This was compiled from MedMentions and does not have any data from [NLM](https://www.nlm.nih.gov/research/umls/) as that data is not publicaly available.

Model packs:
- MedMentions with Status (Is Concept Affirmed or Negated/Hypothetical) [Download](https://medcat.rosalind.kcl.ac.uk/media/medmen_wstatus_2021_oct.zip)
- MedMentions with Status (Is Concept Affirmed or Negated/Hypothetical) [Download](https://cogstack-medcat-example-models.s3.eu-west-2.amazonaws.com/medcat-example-models/medmen_wstatus_2021_oct.zip)

Separate models:
- Vocabulary [Download](https://medcat.rosalind.kcl.ac.uk/media/vocab.dat) - Built from MedMentions
- CDB [Download](https://medcat.rosalind.kcl.ac.uk/media/cdb-medmen-v1_2.dat) - Built from MedMentions
- MetaCAT Status [Download](https://medcat.rosalind.kcl.ac.uk/media/mc_status.zip) - Built from a sample from MIMIC-III, detects is an annotation Affirmed (Positve) or Other (Negated or Hypothetical)
- Vocabulary [Download](https://cogstack-medcat-example-models.s3.eu-west-2.amazonaws.com/medcat-example-models/vocab.dat) - Built from MedMentions
- CDB [Download](https://cogstack-medcat-example-models.s3.eu-west-2.amazonaws.com/medcat-example-models/cdb-medmen-v1.dat) - Built from MedMentions
- MetaCAT Status [Download](https://cogstack-medcat-example-models.s3.eu-west-2.amazonaws.com/medcat-example-models/mc_status.zip) - Built from a sample from MIMIC-III, detects is an annotation Affirmed (Positve) or Other (Negated or Hypothetical)

## Acknowledgements
Entity extraction was trained on [MedMentions](https://github.com/chanzuckerberg/MedMentions) In total it has ~ 35K entites from UMLS
Expand Down
148 changes: 75 additions & 73 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,103 +2,105 @@ sphinx==6.2.1
sphinx-rtd-theme~=1.0
myst-parser~=0.17
sphinx-autoapi~=3.0.0
MarkupSafe==2.1.3
accelerate==0.23.0
aiofiles==23.2.1
aiohttp==3.8.5
MarkupSafe==2.1.5
accelerate==0.34.2
aiofiles==24.1.0
aiohttp==3.10.5
aiosignal==1.3.1
asttokens==2.4.0
asttokens==2.4.1
async-timeout==4.0.3
attrs==23.1.0
attrs==24.2.0
backcall==0.2.0
blis==0.7.11
catalogue==2.0.10
certifi==2023.7.22
charset-normalizer==3.3.0
certifi==2024.8.30
charset-normalizer==3.3.2
click==8.1.7
comm==0.1.4
confection==0.1.3
comm==0.2.2
confection==0.1.5
cymem==2.0.8
datasets==2.14.5
darglint==1.8.1
datasets==2.21.0
decorator==5.1.1
dill==0.3.7
exceptiongroup==1.1.3
executing==2.0.0
filelock==3.12.4
flake8==4.0.1
frozenlist==1.4.0
fsspec==2023.6.0
gensim==4.3.2
huggingface-hub==0.17.3
idna==3.4
ipython==8.16.1
ipywidgets==8.1.1
dill==0.3.8
exceptiongroup==1.2.2
executing==2.1.0
filelock==3.16.0
flake8==7.0.0
frozenlist==1.4.1
fsspec==2024.6.1
gensim==4.3.3
huggingface-hub==0.24.7
idna==3.10
ipython==8.27.0
ipywidgets==8.1.5
jedi==0.19.1
jinja2==3.1.2
joblib==1.3.2
jsonpickle==3.0.2
jupyterlab-widgets==3.0.9
langcodes==3.3.0
matplotlib-inline==0.1.6
mccabe==0.6.1
jinja2==3.1.4
joblib==1.4.2
jsonpickle==3.3.0
jupyterlab-widgets==3.0.13
langcodes==3.4.0
matplotlib-inline==0.1.7
mccabe==0.7.0
mpmath==1.3.0
multidict==6.0.4
multiprocess==0.70.15
multidict==6.1.0
multiprocess==0.70.16
murmurhash==1.0.10
mypy==1.0.0
mypy-extensions==0.4.3
networkx==3.1
mypy==1.11.2
mypy-extensions==1.0.0
networkx==3.3
numpy==1.25.2
packaging==23.2
pandas==2.1.1
parso==0.8.3
pathy==0.10.2
pexpect==4.8.0
packaging==24.1
pandas==2.2.2
parso==0.8.4
pathy==0.11.0
peft==0.12.0
pexpect==4.9.0
pickleshare==0.7.5
preshed==3.0.9
prompt-toolkit==3.0.39
psutil==5.9.5
prompt-toolkit==3.0.47
psutil==6.0.0
ptyprocess==0.7.0
pure-eval==0.2.2
pyarrow==13.0.0
pycodestyle==2.8.0
pydantic==1.10.13
pyflakes==2.4.0
pygments==2.16.1
python-dateutil==2.8.2
pytz==2023.3.post1
pyyaml==6.0.1
regex==2023.10.3
requests==2.31.0
safetensors==0.4.0
scikit-learn==1.3.1
pure-eval==0.2.3
pyarrow==17.0.0
pycodestyle==2.11.1
pydantic==1.10.18
pyflakes==3.2.0
pygments==2.18.0
python-dateutil==2.9.0
pytz==2024.2
pyyaml==6.0.2
regex==2024.9.11
requests==2.32.3
safetensors==0.4.5
scikit-learn==1.5.2
scipy==1.9.3
six==1.16.0
smart-open==6.4.0
spacy==3.4.4
spacy==3.6.1
spacy-legacy==3.0.12
spacy-loggers==1.0.5
srsly==2.4.8
stack-data==0.6.3
sympy==1.12
sympy==1.13.2
thinc==8.1.12
threadpoolctl==3.2.0
tokenizers==0.14.1
threadpoolctl==3.5.0
tokenizers==0.19.1
tomli==2.0.1
torch==2.1.0
tqdm==4.66.1
traitlets==5.11.2
transformers==4.34.0
triton==2.1.0
typer==0.7.0
torch==2.4.1
tqdm==4.66.5
traitlets==5.14.3
transformers==4.44.2
triton==3.0.0
typer==0.9.4
types-PyYAML==6.0.3
types-aiofiles==0.8.3
types-setuptools==57.4.10
typing-extensions==4.8.0
tzdata==2023.3
urllib3==2.0.6
wasabi==0.10.1
wcwidth==0.2.8
widgetsnbextension==4.0.9
xxhash==3.4.1
yarl==1.9.2
typing-extensions==4.12.2
tzdata==2024.1
urllib3==2.2.3
wasabi==1.1.3
wcwidth==0.2.13
widgetsnbextension==4.0.13
xxhash==3.5.0
yarl==1.11.1
Binary file removed examples/cdb_new.dat
Binary file not shown.
4 changes: 2 additions & 2 deletions install_requires.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
'numpy>=1.22.0,<1.26.0' # 1.22.0 is first to support python 3.11; post 1.26.0 there's issues with scipy
'pandas>=1.4.2' # first to support 3.11
'gensim>=4.3.0,<5.0.0' # 5.3.0 is first to support 3.11; avoid major version bump
'spacy>=3.6.0,<4.0.0' # Some later model packs (e.g HPO) are made with 3.6.0 spacy model; avoid major version bump
'spacy>=3.6.0,<3.8.0' # 3.8 only supports numpy2 which we can't use due to other dependencies
'scipy~=1.9.2' # 1.9.2 is first to support 3.11
'transformers>=4.34.0,<5.0.0' # avoid major version bump
'accelerate>=0.23.0' # required by Trainer class in de-id
Expand All @@ -21,4 +21,4 @@
'click>=8.0.4' # allow later versions, tested with 8.1.3
'pydantic>=1.10.0,<2.0' # for spacy compatibility; avoid 2.0 due to breaking changes
"humanfriendly~=10.0" # for human readable file / RAM sizes
"peft>=0.8.2"
"peft>=0.8.2"
23 changes: 22 additions & 1 deletion medcat/cat.py
Original file line number Diff line number Diff line change
Expand Up @@ -1127,11 +1127,29 @@ def get_entities_multi_texts(self,
self.pipe.set_error_handler(self._pipe_error_handler)
try:
texts_ = self._get_trimmed_texts(texts)
if self.config.general.usage_monitor.enabled:
input_lengths: List[Tuple[int, int]] = []
for orig_text, trimmed_text in zip(texts, texts_):
if orig_text is None or trimmed_text is None:
l1, l2 = 0, 0
else:
l1 = len(orig_text)
l2 = len(trimmed_text)
input_lengths.append((l1, l2))
docs = self.pipe.batch_multi_process(texts_, n_process, batch_size)

for doc in tqdm(docs, total=len(texts_)):
for doc_nr, doc in tqdm(enumerate(docs), total=len(texts_)):
doc = None if doc.text.strip() == '' else doc
out.append(self._doc_to_out(doc, only_cui, addl_info, out_with_text=True))
if self.config.general.usage_monitor.enabled:
l1, l2 = input_lengths[doc_nr]
if doc is None:
nents = 0
elif self.config.general.show_nested_entities:
nents = len(doc._.ents) # type: ignore
else:
nents = len(doc.ents) # type: ignore
self.usage_monitor.log_inference(l1, l2, nents)

# Currently spaCy cannot mark which pieces of texts failed within the pipe so be this workaround,
# which also assumes texts are different from each others.
Expand Down Expand Up @@ -1637,6 +1655,9 @@ def _mp_cons(self, in_q: Queue, out_list: List, min_free_memory: float,
logger.warning("PID: %s failed one document in _mp_cons, running will continue normally. \n" +
"Document length in chars: %s, and ID: %s", pid, len(str(text)), i_text)
logger.warning(str(e))
if self.config.general.usage_monitor.enabled:
# NOTE: This is in another process, so need to explicitly flush
self.usage_monitor._flush_logs()
sleep(2)

def _add_nested_ent(self, doc: Doc, _ents: List[Span], _ent: Union[Dict, Span]) -> None:
Expand Down
7 changes: 3 additions & 4 deletions medcat/meta_cat.py
Original file line number Diff line number Diff line change
Expand Up @@ -257,20 +257,19 @@ def train_raw(self, data_loaded: Dict, save_dir_path: Optional[str] = None, data
category_value2id = g_config['category_value2id']
if not category_value2id:
# Encode the category values
data_undersampled, full_data, category_value2id = encode_category_values(data,
full_data, data_undersampled, category_value2id = encode_category_values(data,
category_undersample=self.config.model.category_undersample)
g_config['category_value2id'] = category_value2id
else:
# We already have everything, just get the data
data_undersampled, full_data, category_value2id = encode_category_values(data,
full_data, data_undersampled, category_value2id = encode_category_values(data,
existing_category_value2id=category_value2id,
category_undersample=self.config.model.category_undersample)
g_config['category_value2id'] = category_value2id
# Make sure the config number of classes is the same as the one found in the data
if len(category_value2id) != self.config.model['nclasses']:
logger.warning(
"The number of classes set in the config is not the same as the one found in the data: {} vs {}".format(
self.config.model['nclasses'], len(category_value2id)))
"The number of classes set in the config is not the same as the one found in the data: %d vs %d",self.config.model['nclasses'], len(category_value2id))
logger.warning("Auto-setting the nclasses value in config and rebuilding the model.")
self.config.model['nclasses'] = len(category_value2id)

Expand Down
63 changes: 34 additions & 29 deletions medcat/ner/transformers_ner.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
import datasets
from spacy.tokens import Doc
from datetime import datetime
from typing import Iterable, Iterator, Optional, Dict, List, cast, Union, Tuple, Callable
from typing import Iterable, Iterator, Optional, Dict, List, cast, Union, Tuple, Callable, Type
from spacy.tokens import Span
import inspect
from functools import partial
Expand Down Expand Up @@ -87,7 +87,13 @@ def create_eval_pipeline(self):
# NOTE: this will fix the DeID model(s) created before medcat 1.9.3
# though this fix may very well be unstable
self.ner_pipe.tokenizer._in_target_context_manager = False
if not hasattr(self.ner_pipe.tokenizer, 'split_special_tokens'):
# NOTE: this will fix the DeID model(s) created with transformers before 4.42
# and allow them to run with later transforemrs
self.ner_pipe.tokenizer.split_special_tokens = False
self.ner_pipe.device = self.model.device
self._consecutive_identical_failures = 0
self._last_exception: Optional[Tuple[str, Type[Exception]]] = None

def get_hash(self) -> str:
"""A partial hash trying to catch differences between models.
Expand Down Expand Up @@ -390,34 +396,33 @@ def _process(self,
#all_text_processed = self.tokenizer.encode_eval(all_text)
# For now we will process the documents one by one, should be improved in the future to use batching
for doc in docs:
try:
res = self.ner_pipe(doc.text, aggregation_strategy=self.config.general['ner_aggregation_strategy'])
doc.ents = [] # type: ignore
for r in res:
inds = []
for ind, word in enumerate(doc):
end_char = word.idx + len(word.text)
if end_char <= r['end'] and end_char > r['start']:
inds.append(ind)
# To not loop through everything
if end_char > r['end']:
break
if inds:
entity = Span(doc, min(inds), max(inds) + 1, label=r['entity_group'])
entity._.cui = r['entity_group']
entity._.context_similarity = r['score']
entity._.detected_name = r['word']
entity._.id = len(doc._.ents)
entity._.confidence = r['score']

doc._.ents.append(entity)
create_main_ann(self.cdb, doc)
if self.cdb.config.general['make_pretty_labels'] is not None:
make_pretty_labels(self.cdb, doc, LabelStyle[self.cdb.config.general['make_pretty_labels']])
if self.cdb.config.general['map_cui_to_group'] is not None and self.cdb.addl_info.get('cui2group', {}):
map_ents_to_groups(self.cdb, doc)
except Exception as e:
logger.warning(e, exc_info=True)
res = self.ner_pipe(doc.text, aggregation_strategy=self.config.general['ner_aggregation_strategy'])
doc.ents = [] # type: ignore
for r in res:
inds = []
for ind, word in enumerate(doc):
end_char = word.idx + len(word.text)
if end_char <= r['end'] and end_char > r['start']:
inds.append(ind)
# To not loop through everything
if end_char > r['end']:
break
if inds:
entity = Span(doc, min(inds), max(inds) + 1, label=r['entity_group'])
entity._.cui = r['entity_group']
entity._.context_similarity = r['score']
entity._.detected_name = r['word']
entity._.id = len(doc._.ents)
entity._.confidence = r['score']

doc._.ents.append(entity)
create_main_ann(self.cdb, doc)
if self.cdb.config.general['make_pretty_labels'] is not None:
make_pretty_labels(self.cdb, doc, LabelStyle[self.cdb.config.general['make_pretty_labels']])
if self.cdb.config.general['map_cui_to_group'] is not None and self.cdb.addl_info.get('cui2group', {}):
map_ents_to_groups(self.cdb, doc)
self._consecutive_identical_failures = 0 # success
self._last_exception = None
yield from docs

# Override
Expand Down
Loading
Loading