add greenelab#342 to EHR section

enricoferrero · Apr 25, 2017 · e5a579a · e5a579a
1 parent 053df86
commit e5a579a
Showing 1 changed file with 30 additions and 17 deletions.
diff --git a/sections/03_categorize.md b/sections/03_categorize.md
@@ -142,21 +142,34 @@ This indicates a potential strength of deep methods. It may be possible to
 repurpose features from task to task, improving overall predictions as the field
 tackles new challenges.
 
-Several authors have created reusable feature sets for medical terminologies using
-neural embeddings, as popularized by word2Vec [@tag:Word2Vec]. This approach
-was first used on free text medical notes by De Vine et al.
-[@doi:10.1145/2661829.2661974] with results at or better than traditional methods.
-Y. Choi et al.[@doi:10.1145/2567948.2577348] built embeddings of standardized
-terminologies, such as ICD and NDC, used in widely available administrative
-claims data. By learning terminologies for different entities in the same
-vector space, they can potentially find relationships between different
-domains (e.g. drugs and the diseases they treat). Medical claims data does not
-have the natural document structure of clinical notes, and this issue was
-addressed by E. Choi et al. [@doi:10.1145/2939672.2939823], who built
-embeddings using a multi-layer network architecture which mimics the structure
-of claims data. While promising, difficulties in evaluating the quality of
-these kinds of features and variations in clinical coding practices remain as
-challenges to using them in practice.
+Several authors have created reusable feature sets for medical terminologies
+using natural language processing (NLP) and neural embedding models, as
+popularized by Word2vec [@tag:Word2Vec]. This approach was first used on free
+text medical notes by De Vine et al. [@doi:10.1145/2661829.2661974]
+with results at or better than traditional methods. Y. Choi et al.
+[@doi:10.1145/2567948.2577348] built embeddings of standardized terminologies,
+such as ICD and NDC, used in widely available administrative claims data. By
+learning terminologies for different entities in the same vector space, they can
+potentially find relationships between different domains (e.g. drugs and the
+diseases they treat). Medical claims data does not have the natural document
+structure of clinical notes, and this issue was addressed by E. Choi et al.
+[@doi:10.1145/2939672.2939823], who built embeddings using a multi-layer network
+architecture which mimics the structure of claims data. While promising,
+difficulties in evaluating the quality of these kinds of features and variations
+in clinical coding practices remain as challenges to using them in practice.
+
+NLP has also been applied directly to EHRs to predict disease comorbidities and
+phenotypes as well as novel gene - disease associations. Gligorijevic et al
+[@doi:10.1038/srep32404] built neural embedding models and mined more than 35
+million records collected over a decade. The learned disease representations
+significantly ouperformed state-of-the-art methods, achieving close to 86%
+accuracy. In a second model, the authors supplement patient discharge records
+with genetic diseases associations from genome-wide association studies (GWASs)
+to learn from both disease and gene vectors at the same time, with similar
+perfomance gains. On a held-out set the model recovered all 185 genes known to
+be associated with congestive heart failure and predicted 3 novel genes for
+chronic airway obstruction, showing that mining of EHR data can potentially
+impact the discovery of therapeutic targets.
 
 Identifying consistent subgroups of individuals and individual health
 trajectories from clinical tests is also an active area of research. Approaches
@@ -202,8 +215,8 @@ family distributions [@arxiv:1411.2581v1]. The result was a deep survival
 analysis model capable of overcoming challenges posed by missing data and
 heterogeneous data types, while uncovering nonlinear relationships between
 covariates and failure time. They showed their model more accurately
-stratified patients as a function of disease risk score compared the current
-clinical implementation.
+stratified patients as a function of disease risk score compared to the current
+GWASs clinical implementation.
 
 There is a computational cost for these methods, however, when compared to
 traditional, non-network approaches. For the exponential family models,