Integration of MedCAT #101

Zethson · 2021-09-22T13:31:57Z

To extract keywords from free text notes we will be integrating MedCAT.

The goals are as follows:

Allow for the training of own models with own vocabularies
Allow for the extraction of keywords
Ensure that keywords are associated with attributes like quantitative or quality information
Extract a word2vec embedding and save it somewhere in the AnnData object
Ensure that we have a easily runnable default vocab etc
n_proc of multiprocessing should respect n_jobs of our settings object

Tasks in somewhat reasonable order

Get an example dataset. We could use the one that MedCAT uses in its tutorials?
Work through all MedCAT tutorials
Integrate whatever we can take from https://colab.research.google.com/drive/1nQ3H7plYoOyC6MzqxECbm02oxoY6F3ZL into our quality control functions.
Implement functions to build a concept database and vocabulary. These functions should validate the structure of the input files quickly.
Implement a prediction function which extracts the keywords with all information.
Implement a function to map the CUI to the disease name and vice versa (already part of MedCAT).
Implement function to run unsupervised learning to generate a new Concept Data Base (CDB)
Implement a function to filter CDB and update CDB (part of MedCAT)
Implement a function to generate summary statistics from all predictions. It should look somewhat like:

cui | nsubjects | tui | name | perc_subjects
-- | -- | -- | -- | --

Easy way to plot what was found. Things like top x diseases. How many subjects had which disease, also relative with percentages etc etc etc
We should ensure that there is an easy way to map back from the results to the patient IDs and vice versa. This may not need any further improvements.
MetaAnnotations: Implement way to train a huggingface tokenizer. Implement way to generate word2vec embeddings. Implement generation of embedding matrix. This one should likely be saved in the AnnData object? Can we somehow combine it with everything else that we are generating?
Implement functions to run supervised training
Implement full pipeline with MetaAnnotations (maybe nothing is now missing and this is obsolete)
Ensure that this works well with translations DeepL translations #94 . Should not need any fixes, but will see.

The text was updated successfully, but these errors were encountered:

Zethson · 2021-10-09T16:15:39Z

CogStack/MedCAT#145

Zethson · 2021-10-09T20:58:21Z

Was merged, but no release yet.

Add Github tag with python-poetry/poetry#313

Zethson · 2021-10-26T08:14:40Z

I think that free text should not be part of X in its raw form. I'd add the free text to obs only and allow for an embedding column/matrix to be appended to X after MedCAT was applied.

Imipenem · 2021-10-26T11:15:25Z

I think that free text should not be part of X in its raw form. I'd add the free text to obs only and allow for an embedding column/matrix to be appended to X after MedCAT was applied.

Yes, I will take care of implementing an "autodetect" feature for this, so users are not forced to pass every free text column for obs only when creating the MuData/AnnData object.

Zethson · 2021-12-15T22:19:54Z

1.2.6 was released and we should be ready to implement this now.

Zethson · 2022-01-14T09:55:31Z

Rewrote our current implementation to work with the latest MedCAT. Think this still requires a redesign.

Imipenem · 2022-04-08T13:55:02Z

As discussed: Keep a "main" MedCat object, so we do not loose any results.

Add a function to nicely display such an object, for easier navigation by the user.

Add pp functions to filter the object for specific values (like tui, cui, type of disease, symptoms, etc)

Add a function that can return a binary column based on user filtering (e.g. which row contains for example pulmonary diseases). So the actual values (which might be multiple values in one row) never need to be actually stored in the AnnData object, only indicators when they are needed.

Add a decorator or overwrite plotting functions like umap, pca etc, for example when coloring by pulmonary disease (y/n).
This column might not be present in X or obs, so add it using the way described right above.

Imipenem · 2022-04-08T13:56:56Z

Did I miss something @Zethson ?

Zethson · 2022-04-08T14:48:28Z

No, sounds great. Feel free to show early drafts so that we can evaluate our approach before doubling down.

Thank you!

- refactored exsting code - removed most unused functions - MedCat object now only serves for cdb and vocab (not for actual processing functions) - processing functions are now part of the public API rather than static object methods - downstream analysis WIP (prepare annotation results etc, see issue)

MedCat [#101]: extract biomedical concepts/entities from (free) text and analyse them with ehrapy

Zethson added the enhancement New feature or request label Sep 22, 2021

Zethson added this to the 0.1 milestone Sep 22, 2021

Zethson assigned Zethson and Imipenem Sep 22, 2021

Zethson modified the milestones: 0.1, 0.2 Dec 29, 2021

Zethson linked a pull request Apr 25, 2022 that will close this issue

MedCat [#101]: extract biomedical concepts/entities from (free) text #367

Merged

4 tasks

Imipenem closed this as completed in #367 May 11, 2022

Imipenem added a commit that referenced this issue May 11, 2022

Merge pull request #367 from theislab/feature/medcat

3841d5b

MedCat [#101]: extract biomedical concepts/entities from (free) text and analyse them with ehrapy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration of MedCAT #101

Integration of MedCAT #101

Zethson commented Sep 22, 2021 •

edited

Loading

Zethson commented Oct 9, 2021

Zethson commented Oct 9, 2021

Zethson commented Oct 26, 2021

Imipenem commented Oct 26, 2021

Zethson commented Dec 15, 2021

Zethson commented Jan 14, 2022

Imipenem commented Apr 8, 2022

Imipenem commented Apr 8, 2022

Zethson commented Apr 8, 2022

Integration of MedCAT #101

Integration of MedCAT #101

Comments

Zethson commented Sep 22, 2021 • edited Loading

Zethson commented Oct 9, 2021

Zethson commented Oct 9, 2021

Zethson commented Oct 26, 2021

Imipenem commented Oct 26, 2021

Zethson commented Dec 15, 2021

Zethson commented Jan 14, 2022

Imipenem commented Apr 8, 2022

Imipenem commented Apr 8, 2022

Zethson commented Apr 8, 2022

Zethson commented Sep 22, 2021 •

edited

Loading