From 16a275e2f2f8f9f2259062ec45dc7c001c2d8103 Mon Sep 17 00:00:00 2001
From: cnellington <caleb.n.ellington@gmail.com>
Date: Fri, 16 Aug 2024 10:18:22 -0400
Subject: [PATCH 1/7] add intro stub

---
 content/02.introduction.md | 142 +++++++++++++++++++++++++++++++++++++
 1 file changed, 142 insertions(+)
 create mode 100644 content/02.introduction.md

diff --git a/content/02.introduction.md b/content/02.introduction.md
new file mode 100644
index 0000000..2ba82e2
--- /dev/null
+++ b/content/02.introduction.md
@@ -0,0 +1,142 @@
+## Introduction
+Personalization aims to solve the problem of __parameter heterogeneity__, where model parameters are __sample-specific__. 
+$$X_i \sim P(X; \theta_i)$$
+From $N$ observations, personalized modeling methods aim to recover $N$ parameter estimates $\widehat{\theta}_1, ..., \widehat{\theta}_N$.
+Without further assumptions this problem is ill-defined, and the estimators have far too much variance to be useful. 
+We can begin to make this problem tractable by imposing assumptions on the topology of $\theta$, or the relationship between $\theta$ and exogenous (often causal) variables.
+
+<!-- % \hl{Trustworthy ML, Algorithmic decison making, causal inference, overparameterization and memorization~\cite{jiang2021characterizing}}
+% DP implies TV-stability, which furthers implies "subgroup DG" to guarantee that the accuracy on even arbitrarily small subgroups will generalize well \cite{kulynych2022you}. -->
+
+
+<!-- \section{Population Models}
+The fundamental assumptions of machine learning are that samples are independent and identically distributed.
+However, if samples are identically distributed they must also have identical parameters.
+Accounting for parameter heterogeneity can relax this assumption and help to create more realistic models, but the assumption is so fundamental to many methods that alternatives are rarely explored.
+Additionally, many traditional models may produce a seemingly acceptable fit to their data, even when the underlying model is heterogeneous.
+Here, we explore the consequences of applying homogeneous modeling approaches to heterogeneous data, and discuss how subtle but meaningful effects are often lost to the strength of the identically distributed assumption.
+
+\paragraph{Lemma:} A traditional OLS linear model will be the average of heterogeneous models. 
+
+\subsection{Failure Modes}
+Failure modes can be identified by their error distributions.
+
+\paragraph{Mode collapse}
+If one population is much larger than another, the other population will be underrepresented in the model.
+
+\paragraph{Outliers}
+Small populations of outliers can have an enormous effect on OLS models in the parameter-averaging regime.
+
+\paragraph{Phantom Populations}
+If several populations are present but equally represented, the optimal traditional model will represent none of these populations.
+
+\section{Context-informed models}
+
+\subsection{Conditional and Cluster Models}
+While conditional and cluster models are not truly personalized models, the spirit is the same.
+These models make the assumption that models in a single conditional or cluster group are homogeneous. 
+More commonly this is written as a group of observations being generated by a single model.
+While the assumption results in fewer than $N$ models, it allows the use of generic plug-in estimators.
+Conditional or cluster estimators take the form 
+$$ \widehat{\theta}_0, ..., \widehat{\theta}_C = \arg\max_{\theta_0, ..., \theta_C} \sum_{c \in \mathcal{C}} \ell(X_c; \theta_c) $$
+where $\ell(X; \theta)$ is the log-likelihood of $\theta$ on $X$ and $c$ specifies the covariate group that samples are assigned to, usually by specifying a condition or clustering on covariates thought to affect the distribution of observations. 
+Notably, this method produces fewer than $N$ distinct models for $N$ samples and will fail to recover per-sample parameter variation.
+
+
+\subsection{Distance-regularized Models}
+Distance-regularized models assume that models with similar covariates have similar parameters and encode this assumption as a regularization term.
+$$ \widehat{\theta}_0, ..., \widehat{\theta}_N = \arg\max_{\theta_0, ..., \theta_N} \sum_i \left[ \ell(x_i; \theta_i) \right] - \sum_{i, j} \frac{\| \theta_i - \theta_j \|}{D(c_i, c_j)} $$
+The second term is a regularizer that penalizes divergence of $\theta$'s with similar $c$.
+
+
+\subsection{Parametric Varying-coefficient models}
+Original paper (based on a smoothing spline function): \cite{hastie_varying-coefficient_1993}
+Markov networks: \cite{wang_bayesian_2022}
+Linear varying-coefficient models assume that parameters vary linearly with covariates, a much stronger assumption than the classic varying-coefficient model but making a conceptual leap that allows us to define a form for the relationship between the parameters and covariates. 
+$$\widehat{\theta}_0, ..., \widehat{\theta}_N = \widehat{A} C^T$$
+$$ \widehat{A} = \arg\max_A \sum_i \ell(x_i; A c_i) $$
+
+
+\subsection{Semi-parametric varying-coefficient Models}
+Original paper: \cite{fan_statistical_1999}
+Ising Markov networks: \cite{kolar_estimating_2010}
+2-step estimation: \cite{sosa_time-varying_2022}
+Applications: RBF kernel estimation has desirable properties \cite{sosa_time-varying_2022}
+
+Classic varying-coefficient models assume that models with similar covariates have similar parameters, or -- more formally -- that changes in parameters are smooth over the covariate space.
+This assumption is encoded as a sample weighting, often using a kernel, where the relevance of a sample to a model is equivalent to its kernel similarity over the covariate space.
+$$\widehat{\theta}_0, ..., \widehat{\theta}_N = \arg\max_{\theta_0, ..., \theta_N} \sum_{i, j} \frac{K(c_i, c_j)}{\sum_{k} K(c_i, c_k)} \ell(x_j; \theta_i)$$
+This estimator is the simplest to recover $N$ unique parameter estimates. 
+However, the assumption here is contradictory to the partition model estimator. 
+When the relationship between covariates and parameters is discontinuous or abrupt, this estimator will fail.
+
+surveys for varying-coefficient models~\cite{park_varying_2015, fan_statistical_1999}.
+
+
+\subsection{Contextualized Models}
+Original paper: \cite{al-shedivat_contextual_2020}
+DAGs: \cite{lengerich_notmad_2021}
+Applications: Lung cancer subtyping \cite{lengerich_discriminative_2020}
+
+Contextualized models make the assumption that parameters are some function of context, but make no assumption on the form of that function. 
+In this regime, we seek to estimate the function often using a deep learner (if we have some differentiable proxy for probability):
+$$ \widehat{f} = \arg \max_{f \in \mathcal{F}} \sum_i \ell(x_i; f(c_i)) $$
+
+\cite{lengerich_notmad_2021, al-shedivat_contextual_2020, lengerich_discriminative_2020}
+
+
+\section{Latent-structure Models}
+
+\subsection{Partition Models}
+\cite{kolar_estimating_2010}
+\ce{Currently written for 1D covariates. How do we extend this to multidimensional covariates?}
+Partition models also assume that parameters can be partitioned into homogeneous groups over the covariate space, but make no assumption about where these partitions occur.
+This allows the use of information from different groups in estimating a model for a each covariate. 
+Partition model estimators are most often utilized to infer abrupt model changes over time and take the form
+$$ \widehat{\theta}_0, ..., \widehat{\theta}_N = \arg\max_{\theta_0, ..., \theta_N} \sum_i \ell(x_i; \theta_i) + \sum_{i = 2}^N \text{TV}(\theta_i, \theta_{i-1})$$
+Where the regularizaiton term might take the form 
+$$\text{TV}(\theta_i, \theta_{i - 1}) = |\theta_i - \theta_{i-1}|$$ 
+This still fails to recover a unique parameter estimate for each sample, but gets closer to the spirit of personalized modeling by putting the model likelihood and partition regularizer in competition to find the optimal partitions. 
+
+
+\section{Fine-tuned Models}
+\subsection{2-step estimation}
+Original paper ???
+Review: \cite{suriyakumar_when_2022}
+Noted in foundational literature for linear varying coefficient models \cite{fan_statistical_1999}
+
+Estimate a population model, freeze these parameters, and then include a smaller set of personalized parameters to estimate on a smaller subpopulation.
+$$ \widehat{\gamma} = \arg\max_{\gamma} = \ell(\gamma; X) $$
+$$ \widehat{\theta_c} = \arg\max_{\theta_c} = \ell(\theta_c; \widehat{\gamma}, X_c) $$
+
+
+
+\section{Context-informed and Latent-structure models}
+Original paper: \cite{lengerich_learning_2019}
+\ce{Key idea: negative information sharing. Different models should be pushed apart.}
+$$ \widehat{\theta}_0, ..., \widehat{\theta}_N = \arg\max_{\theta_0, ..., \theta_N, D} \sum_{i=0}^N \prod_{j=0 s.t. D(c_i, c_j) < d}^N P(x_j; \theta_i) P(\theta_i ; \theta_j) $$
+
+
+\section{Miscellaneous personalized models}
+Models that don't fit into the master equations listed above.
+
+\begin{itemize}
+    \item \url{http://www.cs.cmu.edu/~epxing/papers/2019/Aragam-aas19.pdf}
+    \item \url{http://www.cs.cmu.edu/~epxing/papers/Old_papers/xing_nips02_metric.pdf}
+    \item \url{https://arxiv.org/pdf/0812.5087.pdf}
+    \item \url{http://www.cs.cmu.edu/~epxing/papers/2012/Kim_Xing_AOAS12.pdf}
+\end{itemize}
+For clustering problems, \cite{xing_distance_2002} propose to learn a personalized distance metric from example pairs provided by the user.
+This metric is parametrized by a positive semi-definite matrix $\mathbf{A}$, which can be estimated with either Newton's method or gradient descents.
+The distinguishing feature of this approach is that the learned metric is effective over the entire input domain, whereas its predecessors rely on specific locations of the train samples.
+
+\cite{kolar_estimating_2010} consider estimating a time-varying network, wherein the joint distribution of the network shifts over time and only limited samples are available at each timestamp.
+Under assumptions that the parameter changes are smooth and infrequent, \cite{kolar_estimating_2010} formulate the estimating problem as logistic regression with $\ell_{1}$ constraint which can be efficient solved due to its convexity nature.
+The proposed approach is capable of handling both single and multiple observations at each timestamp, and the information across timestamps can be taken into account.
+
+In the context expression quantitative trait locus analysis, \cite{kim_tree-guided_2012} consider a sparse multi-response regression function.
+In contrast to prior work, the hierarchical tree model leverages more prior knowledge of the data (e.g.\ relations among genes). 
+Thanks to this, the proposed tree lasso is capable of modeling the clustering of genes at multiple levels of granularity with internal nodes and the individual genes with leaf nodes.
+In prior work, one problem is inconsistent estimates which can result from their ad hoc weighting schemes.
+To resolve this issue, \cite{kim_tree-guided_2012} introduce a systematic weight scheme that imposes a balanced penalization to all regression coefficients.
+\cite{kim_tree-guided_2012} show that smoothing proximal gradient method can be an efficient solver for the tree lasso model. -->
\ No newline at end of file

From bdf5ad0ce4a6b6f2edd1c19c08aaabc6e82ac592 Mon Sep 17 00:00:00 2001
From: cnellington <caleb.n.ellington@gmail.com>
Date: Fri, 16 Aug 2024 10:19:15 -0400
Subject: [PATCH 2/7] test gh actions

---
 content/02.introduction.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/02.introduction.md b/content/02.introduction.md
index 2ba82e2..44b8f60 100644
--- a/content/02.introduction.md
+++ b/content/02.introduction.md
@@ -3,7 +3,7 @@ Personalization aims to solve the problem of __parameter heterogeneity__, where
 $$X_i \sim P(X; \theta_i)$$
 From $N$ observations, personalized modeling methods aim to recover $N$ parameter estimates $\widehat{\theta}_1, ..., \widehat{\theta}_N$.
 Without further assumptions this problem is ill-defined, and the estimators have far too much variance to be useful. 
-We can begin to make this problem tractable by imposing assumptions on the topology of $\theta$, or the relationship between $\theta$ and exogenous (often causal) variables.
+We can begin to make this problem tractable by imposing assumptions on the topology of $\theta$, or the relationship between $\theta$ and contextual variables.
 
 <!-- % \hl{Trustworthy ML, Algorithmic decison making, causal inference, overparameterization and memorization~\cite{jiang2021characterizing}}
 % DP implies TV-stability, which furthers implies "subgroup DG" to guarantee that the accuracy on even arbitrarily small subgroups will generalize well \cite{kulynych2022you}. -->

From cb259d3701ce8959586d2927d2ce8f0affdc0da5 Mon Sep 17 00:00:00 2001
From: cnellington <caleb.n.ellington@gmail.com>
Date: Fri, 16 Aug 2024 10:28:09 -0400
Subject: [PATCH 3/7] partial md formatting, test gh actions again

---
 content/02.introduction.md | 40 +++++++++++++++++---------------------
 1 file changed, 18 insertions(+), 22 deletions(-)

diff --git a/content/02.introduction.md b/content/02.introduction.md
index 44b8f60..b3f3788 100644
--- a/content/02.introduction.md
+++ b/content/02.introduction.md
@@ -1,38 +1,34 @@
 ## Introduction
-Personalization aims to solve the problem of __parameter heterogeneity__, where model parameters are __sample-specific__. 
+Personalization aims to solve the problem of _parameter heterogeneity_, where model parameters are _sample-specific_. 
 $$X_i \sim P(X; \theta_i)$$
 From $N$ observations, personalized modeling methods aim to recover $N$ parameter estimates $\widehat{\theta}_1, ..., \widehat{\theta}_N$.
 Without further assumptions this problem is ill-defined, and the estimators have far too much variance to be useful. 
 We can begin to make this problem tractable by imposing assumptions on the topology of $\theta$, or the relationship between $\theta$ and contextual variables.
 
-<!-- % \hl{Trustworthy ML, Algorithmic decison making, causal inference, overparameterization and memorization~\cite{jiang2021characterizing}}
-% DP implies TV-stability, which furthers implies "subgroup DG" to guarantee that the accuracy on even arbitrarily small subgroups will generalize well \cite{kulynych2022you}. -->
-
-
-<!-- \section{Population Models}
-The fundamental assumptions of machine learning are that samples are independent and identically distributed.
+### Population Models
+The fundamental assumption of most models is that samples are independent and identically distributed.
 However, if samples are identically distributed they must also have identical parameters.
-Accounting for parameter heterogeneity can relax this assumption and help to create more realistic models, but the assumption is so fundamental to many methods that alternatives are rarely explored.
+To account for parameter heterogeneity and create more realistic models we must relax this assumption, but the assumption is so fundamental to many methods that alternatives are rarely explored.
 Additionally, many traditional models may produce a seemingly acceptable fit to their data, even when the underlying model is heterogeneous.
 Here, we explore the consequences of applying homogeneous modeling approaches to heterogeneous data, and discuss how subtle but meaningful effects are often lost to the strength of the identically distributed assumption.
 
-\paragraph{Lemma:} A traditional OLS linear model will be the average of heterogeneous models. 
-
-\subsection{Failure Modes}
+#### Failure Modes:
 Failure modes can be identified by their error distributions.
 
-\paragraph{Mode collapse}
+__Mode collapse__:
 If one population is much larger than another, the other population will be underrepresented in the model.
 
-\paragraph{Outliers}
+__Outliers__:
 Small populations of outliers can have an enormous effect on OLS models in the parameter-averaging regime.
 
-\paragraph{Phantom Populations}
+__Phantom Populations__:
 If several populations are present but equally represented, the optimal traditional model will represent none of these populations.
 
-\section{Context-informed models}
+__Lemma:__ A traditional OLS linear model will be the average of heterogeneous models. 
+
+### Context-informed models
 
-\subsection{Conditional and Cluster Models}
+### Conditional and Cluster Models
 While conditional and cluster models are not truly personalized models, the spirit is the same.
 These models make the assumption that models in a single conditional or cluster group are homogeneous. 
 More commonly this is written as a group of observations being generated by a single model.
@@ -43,21 +39,21 @@ where $\ell(X; \theta)$ is the log-likelihood of $\theta$ on $X$ and $c$ specifi
 Notably, this method produces fewer than $N$ distinct models for $N$ samples and will fail to recover per-sample parameter variation.
 
 
-\subsection{Distance-regularized Models}
+#### Distance-regularized Models
 Distance-regularized models assume that models with similar covariates have similar parameters and encode this assumption as a regularization term.
 $$ \widehat{\theta}_0, ..., \widehat{\theta}_N = \arg\max_{\theta_0, ..., \theta_N} \sum_i \left[ \ell(x_i; \theta_i) \right] - \sum_{i, j} \frac{\| \theta_i - \theta_j \|}{D(c_i, c_j)} $$
 The second term is a regularizer that penalizes divergence of $\theta$'s with similar $c$.
 
 
-\subsection{Parametric Varying-coefficient models}
-Original paper (based on a smoothing spline function): \cite{hastie_varying-coefficient_1993}
-Markov networks: \cite{wang_bayesian_2022}
+#### Parametric Varying-coefficient models
+Original paper (based on a smoothing spline function): @doi:10.1111/j.2517-6161.1993.tb01939.x
+Markov networks: @doi:10.1080/01621459.2021.2000866
 Linear varying-coefficient models assume that parameters vary linearly with covariates, a much stronger assumption than the classic varying-coefficient model but making a conceptual leap that allows us to define a form for the relationship between the parameters and covariates. 
 $$\widehat{\theta}_0, ..., \widehat{\theta}_N = \widehat{A} C^T$$
 $$ \widehat{A} = \arg\max_A \sum_i \ell(x_i; A c_i) $$
 
 
-\subsection{Semi-parametric varying-coefficient Models}
+<!-- \subsection{Semi-parametric varying-coefficient Models
 Original paper: \cite{fan_statistical_1999}
 Ising Markov networks: \cite{kolar_estimating_2010}
 2-step estimation: \cite{sosa_time-varying_2022}
@@ -139,4 +135,4 @@ In contrast to prior work, the hierarchical tree model leverages more prior know
 Thanks to this, the proposed tree lasso is capable of modeling the clustering of genes at multiple levels of granularity with internal nodes and the individual genes with leaf nodes.
 In prior work, one problem is inconsistent estimates which can result from their ad hoc weighting schemes.
 To resolve this issue, \cite{kim_tree-guided_2012} introduce a systematic weight scheme that imposes a balanced penalization to all regression coefficients.
-\cite{kim_tree-guided_2012} show that smoothing proximal gradient method can be an efficient solver for the tree lasso model. -->
\ No newline at end of file
+\cite{kim_tree-guided_2012} show that smoothing proximal gradient method can be an efficient solver for the tree lasso model. --> -->
\ No newline at end of file

From c69d958fe02ee84a163e33d83f4fe97c1e0b9430 Mon Sep 17 00:00:00 2001
From: cnellington <caleb.n.ellington@gmail.com>
Date: Fri, 16 Aug 2024 10:32:49 -0400
Subject: [PATCH 4/7] partial md formatting, test gh actions again

---
 content/02.introduction.md | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/content/02.introduction.md b/content/02.introduction.md
index b3f3788..de500ce 100644
--- a/content/02.introduction.md
+++ b/content/02.introduction.md
@@ -12,8 +12,7 @@ To account for parameter heterogeneity and create more realistic models we must
 Additionally, many traditional models may produce a seemingly acceptable fit to their data, even when the underlying model is heterogeneous.
 Here, we explore the consequences of applying homogeneous modeling approaches to heterogeneous data, and discuss how subtle but meaningful effects are often lost to the strength of the identically distributed assumption.
 
-#### Failure Modes:
-Failure modes can be identified by their error distributions.
+Failure modes of population models can be identified by their error distributions.
 
 __Mode collapse__:
 If one population is much larger than another, the other population will be underrepresented in the model.
@@ -28,7 +27,7 @@ __Lemma:__ A traditional OLS linear model will be the average of heterogeneous m
 
 ### Context-informed models
 
-### Conditional and Cluster Models
+#### Conditional and Cluster Models
 While conditional and cluster models are not truly personalized models, the spirit is the same.
 These models make the assumption that models in a single conditional or cluster group are homogeneous. 
 More commonly this is written as a group of observations being generated by a single model.
@@ -39,13 +38,13 @@ where $\ell(X; \theta)$ is the log-likelihood of $\theta$ on $X$ and $c$ specifi
 Notably, this method produces fewer than $N$ distinct models for $N$ samples and will fail to recover per-sample parameter variation.
 
 
-#### Distance-regularized Models
+##### Distance-regularized Models
 Distance-regularized models assume that models with similar covariates have similar parameters and encode this assumption as a regularization term.
 $$ \widehat{\theta}_0, ..., \widehat{\theta}_N = \arg\max_{\theta_0, ..., \theta_N} \sum_i \left[ \ell(x_i; \theta_i) \right] - \sum_{i, j} \frac{\| \theta_i - \theta_j \|}{D(c_i, c_j)} $$
 The second term is a regularizer that penalizes divergence of $\theta$'s with similar $c$.
 
 
-#### Parametric Varying-coefficient models
+##### Parametric Varying-coefficient models
 Original paper (based on a smoothing spline function): @doi:10.1111/j.2517-6161.1993.tb01939.x
 Markov networks: @doi:10.1080/01621459.2021.2000866
 Linear varying-coefficient models assume that parameters vary linearly with covariates, a much stronger assumption than the classic varying-coefficient model but making a conceptual leap that allows us to define a form for the relationship between the parameters and covariates. 

From 943f4a11fba4ea22252e73d47c3e640e02a9f93c Mon Sep 17 00:00:00 2001
From: cnellington <caleb.n.ellington@gmail.com>
Date: Fri, 16 Aug 2024 10:47:52 -0400
Subject: [PATCH 5/7] first draft of introduction, all formatted

---
 content/02.introduction.md | 70 +++++++++-----------------------------
 1 file changed, 17 insertions(+), 53 deletions(-)

diff --git a/content/02.introduction.md b/content/02.introduction.md
index de500ce..c3a7c82 100644
--- a/content/02.introduction.md
+++ b/content/02.introduction.md
@@ -27,7 +27,7 @@ __Lemma:__ A traditional OLS linear model will be the average of heterogeneous m
 
 ### Context-informed models
 
-#### Conditional and Cluster Models
+##### Conditional and Cluster Models
 While conditional and cluster models are not truly personalized models, the spirit is the same.
 These models make the assumption that models in a single conditional or cluster group are homogeneous. 
 More commonly this is written as a group of observations being generated by a single model.
@@ -52,11 +52,9 @@ $$\widehat{\theta}_0, ..., \widehat{\theta}_N = \widehat{A} C^T$$
 $$ \widehat{A} = \arg\max_A \sum_i \ell(x_i; A c_i) $$
 
 
-<!-- \subsection{Semi-parametric varying-coefficient Models
-Original paper: \cite{fan_statistical_1999}
-Ising Markov networks: \cite{kolar_estimating_2010}
-2-step estimation: \cite{sosa_time-varying_2022}
-Applications: RBF kernel estimation has desirable properties \cite{sosa_time-varying_2022}
+##### Semi-parametric varying-coefficient Models
+Original paper: @doi:10.1214/aos/1017939139
+2-step estimation with RBF kernels: @arxiv:2103.00315
 
 Classic varying-coefficient models assume that models with similar covariates have similar parameters, or -- more formally -- that changes in parameters are smooth over the covariate space.
 This assumption is encoded as a sample weighting, often using a kernel, where the relevance of a sample to a model is equivalent to its kernel similarity over the covariate space.
@@ -65,26 +63,18 @@ This estimator is the simplest to recover $N$ unique parameter estimates.
 However, the assumption here is contradictory to the partition model estimator. 
 When the relationship between covariates and parameters is discontinuous or abrupt, this estimator will fail.
 
-surveys for varying-coefficient models~\cite{park_varying_2015, fan_statistical_1999}.
-
-
-\subsection{Contextualized Models}
-Original paper: \cite{al-shedivat_contextual_2020}
-DAGs: \cite{lengerich_notmad_2021}
-Applications: Lung cancer subtyping \cite{lengerich_discriminative_2020}
+##### Contextualized Models
+Seminal work @doi:10.48550/arXiv.1705.10301
+Contextualized ML generalization and applications: @doi:10.48550/arXiv.2310.11340, @doi:10.48550/arXiv.2111.01104, @doi:10.21105/joss.06469, @doi:10.48550/arXiv.2310.07918, @doi:10.1016/j.jbi.2022.104086, @doi:10.1101/2020.06.25.20140053, @doi:10.1101/2023.12.01.569658, @doi:10.48550/arXiv.2312.14254
 
 Contextualized models make the assumption that parameters are some function of context, but make no assumption on the form of that function. 
 In this regime, we seek to estimate the function often using a deep learner (if we have some differentiable proxy for probability):
 $$ \widehat{f} = \arg \max_{f \in \mathcal{F}} \sum_i \ell(x_i; f(c_i)) $$
 
-\cite{lengerich_notmad_2021, al-shedivat_contextual_2020, lengerich_discriminative_2020}
-
-
-\section{Latent-structure Models}
+### Latent-structure Models
 
-\subsection{Partition Models}
-\cite{kolar_estimating_2010}
-\ce{Currently written for 1D covariates. How do we extend this to multidimensional covariates?}
+##### Partition Models
+Markov networks: @doi:10.1214/09-AOAS308
 Partition models also assume that parameters can be partitioned into homogeneous groups over the covariate space, but make no assumption about where these partitions occur.
 This allows the use of information from different groups in estimating a model for a each covariate. 
 Partition model estimators are most often utilized to infer abrupt model changes over time and take the form
@@ -94,11 +84,9 @@ $$\text{TV}(\theta_i, \theta_{i - 1}) = |\theta_i - \theta_{i-1}|$$
 This still fails to recover a unique parameter estimate for each sample, but gets closer to the spirit of personalized modeling by putting the model likelihood and partition regularizer in competition to find the optimal partitions. 
 
 
-\section{Fine-tuned Models}
-\subsection{2-step estimation}
-Original paper ???
-Review: \cite{suriyakumar_when_2022}
-Noted in foundational literature for linear varying coefficient models \cite{fan_statistical_1999}
+### Fine-tuned Models
+Review: @doi:10.48550/arXiv.2206.02058
+Noted in foundational literature for linear varying coefficient models @doi:10.1214/aos/1017939139
 
 Estimate a population model, freeze these parameters, and then include a smaller set of personalized parameters to estimate on a smaller subpopulation.
 $$ \widehat{\gamma} = \arg\max_{\gamma} = \ell(\gamma; X) $$
@@ -106,32 +94,8 @@ $$ \widehat{\theta_c} = \arg\max_{\theta_c} = \ell(\theta_c; \widehat{\gamma}, X
 
 
 
-\section{Context-informed and Latent-structure models}
-Original paper: \cite{lengerich_learning_2019}
-\ce{Key idea: negative information sharing. Different models should be pushed apart.}
-$$ \widehat{\theta}_0, ..., \widehat{\theta}_N = \arg\max_{\theta_0, ..., \theta_N, D} \sum_{i=0}^N \prod_{j=0 s.t. D(c_i, c_j) < d}^N P(x_j; \theta_i) P(\theta_i ; \theta_j) $$
-
+### Context-informed and Latent-structure models
+Seminal paper: @doi:10.48550/arXiv.1910.06939
 
-\section{Miscellaneous personalized models}
-Models that don't fit into the master equations listed above.
-
-\begin{itemize}
-    \item \url{http://www.cs.cmu.edu/~epxing/papers/2019/Aragam-aas19.pdf}
-    \item \url{http://www.cs.cmu.edu/~epxing/papers/Old_papers/xing_nips02_metric.pdf}
-    \item \url{https://arxiv.org/pdf/0812.5087.pdf}
-    \item \url{http://www.cs.cmu.edu/~epxing/papers/2012/Kim_Xing_AOAS12.pdf}
-\end{itemize}
-For clustering problems, \cite{xing_distance_2002} propose to learn a personalized distance metric from example pairs provided by the user.
-This metric is parametrized by a positive semi-definite matrix $\mathbf{A}$, which can be estimated with either Newton's method or gradient descents.
-The distinguishing feature of this approach is that the learned metric is effective over the entire input domain, whereas its predecessors rely on specific locations of the train samples.
-
-\cite{kolar_estimating_2010} consider estimating a time-varying network, wherein the joint distribution of the network shifts over time and only limited samples are available at each timestamp.
-Under assumptions that the parameter changes are smooth and infrequent, \cite{kolar_estimating_2010} formulate the estimating problem as logistic regression with $\ell_{1}$ constraint which can be efficient solved due to its convexity nature.
-The proposed approach is capable of handling both single and multiple observations at each timestamp, and the information across timestamps can be taken into account.
-
-In the context expression quantitative trait locus analysis, \cite{kim_tree-guided_2012} consider a sparse multi-response regression function.
-In contrast to prior work, the hierarchical tree model leverages more prior knowledge of the data (e.g.\ relations among genes). 
-Thanks to this, the proposed tree lasso is capable of modeling the clustering of genes at multiple levels of granularity with internal nodes and the individual genes with leaf nodes.
-In prior work, one problem is inconsistent estimates which can result from their ad hoc weighting schemes.
-To resolve this issue, \cite{kim_tree-guided_2012} introduce a systematic weight scheme that imposes a balanced penalization to all regression coefficients.
-\cite{kim_tree-guided_2012} show that smoothing proximal gradient method can be an efficient solver for the tree lasso model. --> -->
\ No newline at end of file
+Key idea: negative information sharing. Different models should be pushed apart.
+$$ \widehat{\theta}_0, ..., \widehat{\theta}_N = \arg\max_{\theta_0, ..., \theta_N, D} \sum_{i=0}^N \prod_{j=0 s.t. D(c_i, c_j) < d}^N P(x_j; \theta_i) P(\theta_i ; \theta_j) $$

From 44dee2a664d98d2a1c62de7a04fe8a5a8f6fdc49 Mon Sep 17 00:00:00 2001
From: cnellington <caleb.n.ellington@gmail.com>
Date: Fri, 16 Aug 2024 10:54:06 -0400
Subject: [PATCH 6/7] add transfer learning header

---
 content/02.introduction.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/02.introduction.md b/content/02.introduction.md
index c3a7c82..9df3e15 100644
--- a/content/02.introduction.md
+++ b/content/02.introduction.md
@@ -84,7 +84,7 @@ $$\text{TV}(\theta_i, \theta_{i - 1}) = |\theta_i - \theta_{i-1}|$$
 This still fails to recover a unique parameter estimate for each sample, but gets closer to the spirit of personalized modeling by putting the model likelihood and partition regularizer in competition to find the optimal partitions. 
 
 
-### Fine-tuned Models
+### Fine-tuned Models and Transfer Learning
 Review: @doi:10.48550/arXiv.2206.02058
 Noted in foundational literature for linear varying coefficient models @doi:10.1214/aos/1017939139
 

From 32656fc5384ab2f2f8d7e265a07b3b724b3a9672 Mon Sep 17 00:00:00 2001
From: cnellington <caleb.n.ellington@gmail.com>
Date: Fri, 16 Aug 2024 11:02:32 -0400
Subject: [PATCH 7/7] add author

---
 content/metadata.yaml | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/content/metadata.yaml b/content/metadata.yaml
index 3abd386..d88d461 100644
--- a/content/metadata.yaml
+++ b/content/metadata.yaml
@@ -19,12 +19,15 @@ authors:
       - Department of Statistics, University of Wisconsin-Madison
     funders:
         -
-  - github: janeroe
-    name: Jane Roe
-    initials: JR
-    orcid: XXXX-XXXX-XXXX-XXXX
-    email: jane.roe@whatever.edu
+  - github: cnellington
+    name: Caleb N. Ellington
+    initials: CE
+    orcid: 0000-0001-7029-8023
+    twitter: probablybots
+    mastodon:
+    mastodon-server:
+    email: cellingt@cs.cmu.edu
     affiliations:
-      - Department of Something, University of Whatever
-      - Department of Whatever, University of Something
-    corresponding: true
+      - Computational Biology Department, Carnegie Mellon University
+    funders:
+        -
\ No newline at end of file