The memr
(Multisource Embeddings for Medical Records) package in R allows for creating embeddings, i.e. vector representations, of medical free-text records written by doctors. It also provides a wide spectrum of tools to data visualization and medical visits' segmentation. These tools aim to develop computer-supported medicine by facilitating medical data analysis and iterpretation. The package can be exploited for many applications like the recommendation prediction, patients' clustering etc. that can aid doctors in their practice.
memr
is written in R and is based on the following packages:
- dplyr
- ggplot2
- ggrepel
- Rtsne
- text2vec
To install memr
, simply type in an R console (after having installed the devtools
package, e.g. install.package('devtools')
):
devtools::install_git("https://github.com/MI2DataLab/memr")
We show the usage of the package on the example datasets. They are completely artificial, but their structure reflects a structure of the real data collected from Polish health centers. The results of the research on the real data are described in the paper Dobrakowski et al. (2019).
For every visit we can have some information about ICD-10 code of diagnosed disease, ID and specialty of the doctor:
knitr::kable(visits)
visit_id | icd10 | doctor_id | specialties |
---|---|---|---|
101 | J32 | 24 | endocrinologist, internist |
102 | Y52 | 26 | endocrinologist |
103 | X12 | 24 | endocrinologist, internist |
104 | Q29 | 24 | endocrinologist, internist |
105 | U46 | 26 | endocrinologist |
106 | U50 | 26 | endocrinologist |
107 | I58 | 25 | cardiologist, internist |
108 | C82 | 26 | endocrinologist |
109 | P73 | 24 | endocrinologist, internist |
110 | P66 | 26 | endocrinologist |
111 | U53 | 23 | ophthalmologist, endocrinologist |
For the visits we have also the descriptions of interview with the extracted medical terms:
knitr::kable(interviews)
x | |
---|---|
101 | fever, eye |
102 | cough, thyroid |
103 | fever, thyroid |
104 | fever, eye |
105 | cough, thyroid |
106 | cough, thyroid |
107 | cough, thyroid |
108 | rhinitis, eye |
109 | rhinitis, eye |
110 | rhinitis, thyroid |
Descriptions of examinations of patients:
knitr::kable(examinations)
x | |
---|---|
102 | mother, father, cough, eye |
103 | woman, father, rhinitis, thyroid |
104 | woman, father, fever, thyroid |
105 | mother, patient, rhinitis, eye |
106 | man, father, cough, heart |
107 | woman, father, fever, eye |
108 | woman, patient, rhinitis, thyroid |
109 | woman, patient, rhinitis, heart |
110 | woman, father, cough, heart |
111 | man, father, rhinitis, eye |
And descriptions of recommendations prescribed by doctors to the patients:
knitr::kable(recommendations)
x | |
---|---|
103 | hospital, endocrinologist |
104 | hospital, sleep, internist |
105 | hospital, internist |
106 | sleep, ophthalmologist |
107 | hospital, treatment, ophthalmologist |
108 | treatment, ophthalmologist |
109 | treatment, hospital, endocrinologist |
110 | hospital, sleep, cardiologist |
Each medical term has one or more categories:
knitr::kable(terms_categories)
term | category |
---|---|
man | person |
woman | person |
mother | person |
father | person |
patient | person |
cough | disease |
rhinitis | disease |
fever | disease |
sleep | recommendation |
healthy eating | recommendation |
hospital | recommendation |
treatment | recommendation |
internist | specialty |
cardiologist | specialty |
ophthalmologist | specialty |
endocrinologist | specialty |
heart | anatomic |
eye | anatomic |
thyroid | anatomic |
Firstly we can compute embeddings:
embedding_size <- 5
interview_term_vectors <- embed_terms(merged_terms = interviews, embedding_size = embedding_size,
term_count_min = 1L)
#> INFO [2020-03-18 11:43:54] 2020-03-18 11:43:54 - epoch 1, expected cost 0.1305
#> INFO [2020-03-18 11:43:54] 2020-03-18 11:43:54 - epoch 2, expected cost 0.0852
#> INFO [2020-03-18 11:43:54] 2020-03-18 11:43:54 - epoch 3, expected cost 0.0592
#> INFO [2020-03-18 11:43:54] 2020-03-18 11:43:54 - epoch 4, expected cost 0.0426
#> INFO [2020-03-18 11:43:54] 2020-03-18 11:43:54 - epoch 5, expected cost 0.0314
#> INFO [2020-03-18 11:43:54] 2020-03-18 11:43:54 - epoch 6, expected cost 0.0235
#> INFO [2020-03-18 11:43:54] 2020-03-18 11:43:54 - epoch 7, expected cost 0.0180
#> INFO [2020-03-18 11:43:54] 2020-03-18 11:43:54 - epoch 8, expected cost 0.0139
#> INFO [2020-03-18 11:43:54] 2020-03-18 11:43:54 - epoch 9, expected cost 0.0109
#> INFO [2020-03-18 11:43:54] 2020-03-18 11:43:54 - epoch 10, expected cost 0.0087
#> INFO [2020-03-18 11:43:54] 2020-03-18 11:43:54 - epoch 11, expected cost 0.0070
#> INFO [2020-03-18 11:43:54] 2020-03-18 11:43:54 - epoch 12, expected cost 0.0057
#> INFO [2020-03-18 11:43:54] 2020-03-18 11:43:54 - epoch 13, expected cost 0.0047
#> INFO [2020-03-18 11:43:54] 2020-03-18 11:43:54 - epoch 14, expected cost 0.0039
#> INFO [2020-03-18 11:43:54] 2020-03-18 11:43:54 - epoch 15, expected cost 0.0033
examination_term_vectors <- embed_terms(merged_terms = examinations, embedding_size = embedding_size,
term_count_min = 1L)
#> INFO [2020-03-18 11:43:54] 2020-03-18 11:43:54 - epoch 1, expected cost 0.1059
#> INFO [2020-03-18 11:43:54] 2020-03-18 11:43:54 - epoch 2, expected cost 0.0510
#> INFO [2020-03-18 11:43:54] 2020-03-18 11:43:54 - epoch 3, expected cost 0.0315
#> INFO [2020-03-18 11:43:54] 2020-03-18 11:43:54 - epoch 4, expected cost 0.0225
#> INFO [2020-03-18 11:43:54] 2020-03-18 11:43:54 - epoch 5, expected cost 0.0178
#> INFO [2020-03-18 11:43:54] 2020-03-18 11:43:54 - epoch 6, expected cost 0.0149
#> INFO [2020-03-18 11:43:54] 2020-03-18 11:43:54 - epoch 7, expected cost 0.0131
#> INFO [2020-03-18 11:43:54] 2020-03-18 11:43:54 - epoch 8, expected cost 0.0117
#> INFO [2020-03-18 11:43:54] 2020-03-18 11:43:54 - epoch 9, expected cost 0.0107
#> INFO [2020-03-18 11:43:54] 2020-03-18 11:43:54 - epoch 10, expected cost 0.0099
#> INFO [2020-03-18 11:43:54] 2020-03-18 11:43:54 - epoch 11, expected cost 0.0093
#> INFO [2020-03-18 11:43:54] 2020-03-18 11:43:54 - epoch 12, expected cost 0.0087
#> INFO [2020-03-18 11:43:54] 2020-03-18 11:43:54 - epoch 13, expected cost 0.0082
#> INFO [2020-03-18 11:43:54] 2020-03-18 11:43:54 - epoch 14, expected cost 0.0078
#> INFO [2020-03-18 11:43:54] 2020-03-18 11:43:54 - epoch 15, expected cost 0.0074
knitr::kable(interview_term_vectors[1:5, ])
rhinitis | -0.4704351 | 0.0315635 | 0.5707754 | 0.2680295 | 0.2400996 |
fever | -0.3948241 | 0.3853350 | 0.5458298 | 0.2350670 | 0.4494087 |
eye | 0.4959916 | 0.0223423 | -0.3431036 | -0.2673065 | -0.2371423 |
cough | -0.0089209 | 0.3737412 | 0.0054367 | -0.6043935 | 0.4064513 |
thyroid | -0.1455389 | -0.2107478 | -0.2871478 | -0.3121039 | -0.2287994 |
Terms from the chosen category can be visualized:
visualize_term_embeddings(terms_categories, interview_term_vectors, c("anatomic"), method = "PCA")
To validate the quality of embeddings we can perform the term analogy task (see more by ?analogy_task). The package delivers the analogy test set.
knitr::kable(evaluate_term_embeddings(examination_term_vectors, n = 5, terms_pairs_test))
1 | 2 | 3 | 4 | 5 | MEAN | |
---|---|---|---|---|---|---|
person | 0 | 0 | 0 | 0 | 0.5000000 | 0.1000000 |
spec | 0 | 0 | 0 | 0 | 0.0000000 | 0.0000000 |
synonym | 0 | 0 | 0 | 0 | 0.5000000 | 0.1000000 |
MEAN | 0 | 0 | 0 | 0 | 0.3333333 | 0.0666667 |
For each type of analogy we compute the mean accuracy.
Analogies can be plotted to see if the connection lines are parallel:
visualize_analogies(examination_term_vectors, terms_pairs_test$person, find_analogies = TRUE, n = 10)
Having the embeddings of terms, we can compute embeddings of visits:
visits_vectors <- embed_list_visits(interviews, examinations, interview_term_vectors, examination_term_vectors)
knitr::kable(visits_vectors[1:5, ])
101 | 0.0505837 | 0.2038387 | 0.1013631 | -0.0161198 | 0.1061332 | NA | NA | NA | NA | NA |
102 | -0.0772299 | 0.0814967 | -0.1408555 | -0.4582487 | 0.0888259 | 0.1106313 | -0.1679365 | -0.0600053 | 0.1011591 | 0.0932502 |
103 | -0.2701815 | 0.0872936 | 0.1293410 | -0.0385184 | 0.1103047 | 0.1915759 | 0.1393210 | -0.0021371 | 0.0500974 | 0.1605039 |
104 | 0.0505837 | 0.2038387 | 0.1013631 | -0.0161198 | 0.1061332 | 0.1695910 | 0.1284853 | 0.0011363 | -0.0126280 | -0.0033302 |
105 | -0.0772299 | 0.0814967 | -0.1408555 | -0.4582487 | 0.0888259 | 0.0037491 | -0.0883031 | -0.0408950 | 0.1075936 | 0.0625818 |
And now we can visualize the visits on the plot and color by the doctors' IDs:
visualize_visit_embeddings(visits_vectors, visits, color_by = "doctor",
spec = "internist")
or by ICD-10 code:
visualize_visit_embeddings(visits_vectors, visits, color_by = "icd10",
spec = "internist")
On the visits' embeddings we can run the k-means algorithm:
clusters <- cluster_visits(visits_vectors, visits, spec = "internist", cluster_number = 2)
and plot the clusters:
visualize_visit_embeddings(visits_vectors, visits, color_by = "cluster",
spec = "internist", clusters = clusters)
For every cluster we can see the most frequent recommendations from chosen categories:
rec_tables <- get_cluster_recommendations(recommendations, clusters,
category = "recommendation",
recom_table = terms_categories)
rec_tables
#> [[1]]
#> # A tibble: 3 x 3
#> recommendation count frequency
#> <fct> <int> <dbl>
#> 1 hospital 3 0.75
#> 2 sleep 1 0.25
#> 3 treatment 1 0.25
#>
#> [[2]]
#> # A tibble: 2 x 3
#> recommendation count frequency
#> <fct> <int> <dbl>
#> 1 hospital 1 1
#> 2 treatment 1 1
or from all categories:
rec_tables <- get_cluster_recommendations(recommendations, clusters, category = "all")
rec_tables
#> [[1]]
#> # A tibble: 5 x 3
#> recommendation count frequency
#> <fct> <int> <dbl>
#> 1 hospital 3 0.75
#> 2 endocrinologist 2 0.5
#> 3 internist 1 0.25
#> 4 sleep 1 0.25
#> 5 treatment 1 0.25
#>
#> [[2]]
#> # A tibble: 3 x 3
#> recommendation count frequency
#> <fct> <int> <dbl>
#> 1 hospital 1 1
#> 2 ophthalmologist 1 1
#> 3 treatment 1 1
If we have a new visit, we can assign it to the most appropriate cluster:
inter_descr <- paste("cough", sep = ", ")
exam_descr <- paste("fever", sep = ", ")
visit_description <- c(inter_descr, exam_descr)
names(visit_description) <- c("inter", "exam")
cl <- assign_visit_to_cluster(visit_description, clusters, interview_term_vectors, examination_term_vectors)
cl
#> $cl
#> 1
#> 1
#>
#> $distances
#> 1 2
#> 0.07533538 0.24136356
As the last nice thing we can see the embeddings of ICD-10 codes:
visualize_icd10(visits_vectors, visits)
The package was created during the research financially supported by the Polish Centre for Research and Development (Grant POIR.01.01.01-00-0328/17).
Dobrakowski, Adam, Agnieszka Mykowiecka, Małgorzata Marciniak, Wojciech Jaworski, and Przemysław Biecek. 2019. “Interpretable Segmentation of Medical Free-Text Records Based on Word Embeddings.” arXiv Preprint arXiv:1907.04152. https://arxiv.org/abs/1907.04152.