Skip to content

Commit

Permalink
add greenelab#342 to EHR section
Browse files Browse the repository at this point in the history
  • Loading branch information
enricoferrero committed Apr 25, 2017
1 parent 053df86 commit e5a579a
Showing 1 changed file with 30 additions and 17 deletions.
47 changes: 30 additions & 17 deletions sections/03_categorize.md
Original file line number Diff line number Diff line change
Expand Up @@ -142,21 +142,34 @@ This indicates a potential strength of deep methods. It may be possible to
repurpose features from task to task, improving overall predictions as the field
tackles new challenges.

Several authors have created reusable feature sets for medical terminologies using
neural embeddings, as popularized by word2Vec [@tag:Word2Vec]. This approach
was first used on free text medical notes by De Vine et al.
[@doi:10.1145/2661829.2661974] with results at or better than traditional methods.
Y. Choi et al.[@doi:10.1145/2567948.2577348] built embeddings of standardized
terminologies, such as ICD and NDC, used in widely available administrative
claims data. By learning terminologies for different entities in the same
vector space, they can potentially find relationships between different
domains (e.g. drugs and the diseases they treat). Medical claims data does not
have the natural document structure of clinical notes, and this issue was
addressed by E. Choi et al. [@doi:10.1145/2939672.2939823], who built
embeddings using a multi-layer network architecture which mimics the structure
of claims data. While promising, difficulties in evaluating the quality of
these kinds of features and variations in clinical coding practices remain as
challenges to using them in practice.
Several authors have created reusable feature sets for medical terminologies
using natural language processing (NLP) and neural embedding models, as
popularized by Word2vec [@tag:Word2Vec]. This approach was first used on free
text medical notes by De Vine et al. [@doi:10.1145/2661829.2661974]
with results at or better than traditional methods. Y. Choi et al.
[@doi:10.1145/2567948.2577348] built embeddings of standardized terminologies,
such as ICD and NDC, used in widely available administrative claims data. By
learning terminologies for different entities in the same vector space, they can
potentially find relationships between different domains (e.g. drugs and the
diseases they treat). Medical claims data does not have the natural document
structure of clinical notes, and this issue was addressed by E. Choi et al.
[@doi:10.1145/2939672.2939823], who built embeddings using a multi-layer network
architecture which mimics the structure of claims data. While promising,
difficulties in evaluating the quality of these kinds of features and variations
in clinical coding practices remain as challenges to using them in practice.

NLP has also been applied directly to EHRs to predict disease comorbidities and
phenotypes as well as novel gene - disease associations. Gligorijevic et al
[@doi:10.1038/srep32404] built neural embedding models and mined more than 35
million records collected over a decade. The learned disease representations
significantly ouperformed state-of-the-art methods, achieving close to 86%
accuracy. In a second model, the authors supplement patient discharge records
with genetic diseases associations from genome-wide association studies (GWASs)
to learn from both disease and gene vectors at the same time, with similar
perfomance gains. On a held-out set the model recovered all 185 genes known to
be associated with congestive heart failure and predicted 3 novel genes for
chronic airway obstruction, showing that mining of EHR data can potentially
impact the discovery of therapeutic targets.

Identifying consistent subgroups of individuals and individual health
trajectories from clinical tests is also an active area of research. Approaches
Expand Down Expand Up @@ -202,8 +215,8 @@ family distributions [@arxiv:1411.2581v1]. The result was a deep survival
analysis model capable of overcoming challenges posed by missing data and
heterogeneous data types, while uncovering nonlinear relationships between
covariates and failure time. They showed their model more accurately
stratified patients as a function of disease risk score compared the current
clinical implementation.
stratified patients as a function of disease risk score compared to the current
GWASs clinical implementation.

There is a computational cost for these methods, however, when compared to
traditional, non-network approaches. For the exponential family models,
Expand Down

0 comments on commit e5a579a

Please sign in to comment.