Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration of MedCAT #101

Closed
4 of 15 tasks
Zethson opened this issue Sep 22, 2021 · 9 comments · Fixed by #367
Closed
4 of 15 tasks

Integration of MedCAT #101

Zethson opened this issue Sep 22, 2021 · 9 comments · Fixed by #367
Assignees
Labels
enhancement New feature or request

Comments

@Zethson
Copy link
Member

Zethson commented Sep 22, 2021

To extract keywords from free text notes we will be integrating MedCAT.

The goals are as follows:

  1. Allow for the training of own models with own vocabularies
  2. Allow for the extraction of keywords
  3. Ensure that keywords are associated with attributes like quantitative or quality information
  4. Extract a word2vec embedding and save it somewhere in the AnnData object
  5. Ensure that we have a easily runnable default vocab etc
  6. n_proc of multiprocessing should respect n_jobs of our settings object

Tasks in somewhat reasonable order

  • Get an example dataset. We could use the one that MedCAT uses in its tutorials?
  • Work through all MedCAT tutorials
  • Integrate whatever we can take from https://colab.research.google.com/drive/1nQ3H7plYoOyC6MzqxECbm02oxoY6F3ZL into our quality control functions.
  • Implement functions to build a concept database and vocabulary. These functions should validate the structure of the input files quickly.
  • Implement a prediction function which extracts the keywords with all information.
  • Implement a function to map the CUI to the disease name and vice versa (already part of MedCAT).
  • Implement function to run unsupervised learning to generate a new Concept Data Base (CDB)
  • Implement a function to filter CDB and update CDB (part of MedCAT)
  • Implement a function to generate summary statistics from all predictions. It should look somewhat like:
cui | nsubjects | tui | name | perc_subjects
-- | -- | -- | -- | --
  • Easy way to plot what was found. Things like top x diseases. How many subjects had which disease, also relative with percentages etc etc etc
  • We should ensure that there is an easy way to map back from the results to the patient IDs and vice versa. This may not need any further improvements.
  • MetaAnnotations: Implement way to train a huggingface tokenizer. Implement way to generate word2vec embeddings. Implement generation of embedding matrix. This one should likely be saved in the AnnData object? Can we somehow combine it with everything else that we are generating?
  • Implement functions to run supervised training
  • Implement full pipeline with MetaAnnotations (maybe nothing is now missing and this is obsolete)
  • Ensure that this works well with translations DeepL translations #94 . Should not need any fixes, but will see.
@Zethson Zethson added the enhancement New feature or request label Sep 22, 2021
@Zethson Zethson added this to the 0.1 milestone Sep 22, 2021
@Zethson
Copy link
Member Author

Zethson commented Oct 9, 2021

CogStack/MedCAT#145

@Zethson
Copy link
Member Author

Zethson commented Oct 9, 2021

Was merged, but no release yet.

Add Github tag with python-poetry/poetry#313

@Zethson
Copy link
Member Author

Zethson commented Oct 26, 2021

I think that free text should not be part of X in its raw form. I'd add the free text to obs only and allow for an embedding column/matrix to be appended to X after MedCAT was applied.

@Imipenem
Copy link
Collaborator

I think that free text should not be part of X in its raw form. I'd add the free text to obs only and allow for an embedding column/matrix to be appended to X after MedCAT was applied.

Yes, I will take care of implementing an "autodetect" feature for this, so users are not forced to pass every free text column for obs only when creating the MuData/AnnData object.

@Zethson
Copy link
Member Author

Zethson commented Dec 15, 2021

1.2.6 was released and we should be ready to implement this now.

@Zethson Zethson modified the milestones: 0.1, 0.2 Dec 29, 2021
@Zethson
Copy link
Member Author

Zethson commented Jan 14, 2022

Rewrote our current implementation to work with the latest MedCAT. Think this still requires a redesign.

@Imipenem
Copy link
Collaborator

Imipenem commented Apr 8, 2022

As discussed: Keep a "main" MedCat object, so we do not loose any results.

Add a function to nicely display such an object, for easier navigation by the user.

Add pp functions to filter the object for specific values (like tui, cui, type of disease, symptoms, etc)

Add a function that can return a binary column based on user filtering (e.g. which row contains for example pulmonary diseases). So the actual values (which might be multiple values in one row) never need to be actually stored in the AnnData object, only indicators when they are needed.

Add a decorator or overwrite plotting functions like umap, pca etc, for example when coloring by pulmonary disease (y/n).
This column might not be present in X or obs, so add it using the way described right above.

@Imipenem
Copy link
Collaborator

Imipenem commented Apr 8, 2022

Did I miss something @Zethson ?

@Zethson
Copy link
Member Author

Zethson commented Apr 8, 2022

No, sounds great. Feel free to show early drafts so that we can evaluate our approach before doubling down.

Thank you!

Imipenem added a commit that referenced this issue Apr 9, 2022
- refactored exsting code

- removed most unused functions

- MedCat object now only serves for cdb and vocab (not for actual processing functions)

- processing functions are now part of the public API rather than static object methods

- downstream analysis WIP (prepare annotation results etc, see issue)
@Zethson Zethson linked a pull request Apr 25, 2022 that will close this issue
4 tasks
Imipenem added a commit that referenced this issue May 11, 2022
MedCat [#101]: extract biomedical concepts/entities from (free) text and analyse them with ehrapy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants