-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MedCat [#101]: extract biomedical concepts/entities from (free) text #367
Conversation
- refactored exsting code - removed most unused functions - MedCat object now only serves for cdb and vocab (not for actual processing functions) - processing functions are now part of the public API rather than static object methods - downstream analysis WIP (prepare annotation results etc, see issue)
- flatten the annotated results dict to prepare it for creation of a pd.DataFrame so this could be used in further analysis
- the dataframe contains all extracted and annotated infos from the input data in adata.obs - this will be the base for further analysis
- started overview plotting (simple dataframe or data table), bit harder than expected ;) - bar plot top entities found (customizable) - MedCAT object now stores annotation results in attribute rather than returning a bare dataframe
- no MultiIndex DataFrame anymore, just a simple single level one
- added an API function to show basic stats of the anntotated results in a nice rich table - duplicated rows are now removed from the annotated results (duplicates are entities from the same row with same meta_anns and cui)
…xt data - added a function to indicate whether a specific entity has been found in that row or not; this is useful for downstream plotting such as coloring by this entity in a umap for example - extracted freetext dataframe is now sorted by extracted row number per default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have to see it in action to make a better assessment.
Maybe we need a quick call. As mentioned, I would like to see all of these function scoped into a class. IMO it makes sense to call these functions as ep.tl.mc.MEDCAT FUNCTION
The plots as well. ep.pl.mc.SOMETHING
- cause medcat is an extra dependency, it might not be installed locally at the users machine/env
- updated draw_graph function to color by extracted medcat ents - refactored plotting function calls into partial callables so most of the args are pre-initialized - fixed a bug that caused columns to be deleted from obs when they were colored by - fixed a bug that caused ehrapy to crash when trying to plot a column in var_names with a MedCat object - fixed a docs rendering bug introduced earlier on in this PR
@Zethson I had this in mind as well, but I'm not sure whether its working because rich import will be sorted by our pre-commit CI AFTER medcat import, so I'm not sure whether rich is always available or not. Might just have to try. Edit: Does not work, unfortunately |
…w of dtype cat - most plotting functions got updated to use medcat entities if needed (clustermap, dendrogram, violin, stacked violin, ...) - extracted entities from medcat in .obs are now of dtype categorical (was numerical) - still missing: embedding, embedding_densitiy and spatial
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting closer. I'll really need to see the adapted tutorial.
👍
PR Checklist
docs
is updatedDescription of changes
MedCat object now only serves for cdb and vocab (not for actual processing functions)
processing functions are now part of the public API rather than static object methods
text annotation returns a DataFrame as a base for further analysis
basic overview table exposed via API (with optional csv save)
prep for adding to anndata object
updated tsne, umap,pca and scatter plot to color by extracted entities
paga, draw_graph, ... WIP