3-probability_practice.tex

\chapter{Probability in Practice: \\Probabilistic Computation}

The immediate problem with the abstract definitions introduced in the
previous chapter is that they do not provide an explicit means of either
specifying a probability distribution, at least without considering every 
single event, or computing expectations in practical applications.  When 
the sample space is structured, however, that structure can be leveraged 
to provide the explicit implementations we need to apply probability 
theory in practice.  This is particularly evident when the sample space 
ordered, discrete, or a subset of real numbers.

\section{Cumulative Distribution Functions}

When the sample space is ordered we can completely specify a probability
distribution with only the probabilities assigned to \emph{intervals}, 
$\mathcal{I} \! \left( \Theta \right) \subset \mathcal{E} \! \left( \Theta \right)$.
Intervals are events spanning all points less than or equal to some 
distinguished point, $\theta$,
%
\begin{equation*}
I \! \left( \theta \right) = \left\{ \theta' \in \Theta \mid \theta' \leq \theta \right\}.
\end{equation*}
%
The function that assigns these probabilities,
%
\begin{align*}
P 
&: \Theta \rightarrow \mathcal{I} \! 
\left( \Theta \right) \rightarrow \left[0, 1 \right]
\\
&\quad \theta \mapsto 
I \! \left( \theta \right) \;\; \mapsto 
\PP \! \left[ I \! \left( \theta \right) \right].
\end{align*}
%
is called the \emph{cumulative distribution function}.  

Cumulative distribution functions naturally pushforward from one 
sample space into another.  Given a measureable map 
$s : \Theta \rightarrow \Omega$, the cumulative distribution function
on $\Omega$ is defined by
%
\begin{equation*}
P_{\Omega} \! \left( I_{\omega} \right) 
\equiv 
P_{\Theta} \! \left( s^{-1} \! \left( I_{\omega} \right) \right).
\end{equation*}

\section{Implementations of Probability Distributions \\ 
over Discrete Sample Spaces}

Every discrete sample space features a natural measure which we 
can use to not only specify all other measures, and hence probability 
distributions, but also convert expectations into summations.

\subsection{The Counting Measure and Mass Functions}

The \emph{counting measure} on a discrete space assigns unit
measure to the \emph{point events},
%
\begin{equation*}
\chi \! \left( \theta \right) = 1, \, \forall \theta \in \Theta.
\end{equation*}
%
In order to compute the measure of any other event we 
simply count the elements in that event,
%
\begin{equation*}
\chi \! \left( E \right)
=
\sum_{\theta \in E}  1 = \left| E \right|,
\end{equation*}
%
hence the name.  Similarly, Lebesgue integrals reduce to a 
summation over the sample space weighted by the function 
values,
%
\begin{equation*}
\mathcal{I}_{\chi} \! \left[ f \right]
=
\sum_{\theta \in \Theta} 
f \! \left( \theta \right) \chi \! \left( \theta \right)
=
\sum_{\theta \in \Theta} f \! \left( \theta \right),
\end{equation*}
%
for any $f \in \mathcal{F} \! \left( \Theta \right)$.

Because the counting measure assigns zero measure only to the
null event, all discrete measures are absolutely continuous with
respect to it.  Consequently, any discrete measure can be defined
by its Radon-Nikodym derivative with respect to the counting measure.
In this case the Radon-Nikodym derivative is denoted a 
\emph{mass function}
%
\begin{equation*}
m : \Theta \rightarrow \RR^{+}.
\end{equation*}
%
Measures and Lebesgue integrals then reduce to discrete summations
weighted by the mass function,
%
\begin{align*}
\mu \! \left( E \right)
&=
\sum_{\theta \in E} m \! \left( \theta \right)
\\
\mathcal{I}_{\mu} \! \left[ f \right]
&=
\sum_{\theta \in \Theta} f \! \left( \theta \right) m \! \left( \theta \right).
\end{align*}

Moreover, the Radon-Nikodym derivative between different discrete 
measures reduces to the simple ratio of their mass functions,
%
\begin{equation*}
\frac{ \dd \mu_{2} }{ \dd \mu_{1} } \! \left( \theta \right)
=
\frac{ m_{2} \! \left( \theta \right) }{ m_{1} \! \left( \theta \right) }.
\end{equation*}

\subsection{Transforming Counting Measures and Mass Functions}

The counting measure, and hence mass functions, also have the
convenient property that they transform naturally with changes to
the underlying sample space.  Given a measurable map 
$s : \Theta \rightarrow \Omega$, a mass function on $\Theta$
pushes forward to a mass function on $\Omega$,
%
\begin{equation*}
m_{\Omega} \! \left( \omega \right) 
\equiv 
m_{\Theta} \! \left( s^{-1} \! \left( \omega \right) \right).
\end{equation*}

\subsection{Probability Mass Functions}

Mass functions corresponding to probability distributions reduce
to \emph{probability mass functions}, which assigns probability to the
point events,
%
\begin{equation*}
p : \Theta \rightarrow \left[0, 1\right].
\end{equation*}
%
such that
%
\begin{equation*}
1 = \pi \! \left[ \Theta \right] = \sum_{\theta \in \Theta} p \! \left( \theta \right).
\end{equation*}

\subsection{Conditional Probability Mass Functions}

Probability mass functions immediately generalize to specify conditional 
probability distributions by simply adding a conditioning variable,
%
\begin{equation*}
p_{\Theta \mid \Phi} : \Theta \times \Phi \rightarrow \left[0, 1\right].
\end{equation*}
%
Conditional probabilities and conditional expectations are computed
as above,
%
\begin{align*}
\PP_{\Theta \mid \Phi}  \! \left[ E \mid \phi \right]
&=
\sum_{\theta \in E} p_{\Theta \mid \Phi}  \! \left( \theta \mid \phi \right),
\\
\mathbb{E}_{\PP_{\Theta \mid \Phi} } \! \left[ f \mid \phi \right]
&=
\sum_{\theta \in \Theta} f \! \left( \theta \right) 
p_{\Theta \mid \Phi}  \! \left( \theta \mid \phi \right),
\end{align*}

A huge advantage of this representation is that it drastically simplifies the
construction of joint and marginal probability distributions.  Instead of 
implicitly defining an abstract joint distribution with probability assignments
or expectations, for example, we can construct an explicit joint probability 
mass function as the product of a conditional mass function and a marginal 
mass function,
%
\begin{equation*}
p_{\Theta \times \Phi} \! \left( \theta, \phi \right) = 
p_{\Theta \mid \Phi} \! \left( \theta | \phi \right) p_{\Phi} \! \left( \phi \right),
\end{equation*}
%
which readily gives joint probabilities and joint expectations.

Marginalization proceeds similarly -- the marginal probability mass function
is given by simply summing the joint probability mass function over the
nuisance components,
%
\begin{align*}
p_{\Theta} \! \left( \theta \right)
&= 
\sum_{\phi \in \Phi}  p_{\Theta \times \Phi} \! \left( \theta, \phi \right) \\
&=
\sum_{\phi \in \Phi}
p_{\Theta \mid \Phi} \! \left( \theta | \phi \right) p_{\Phi} \! \left( \phi \right).
\end{align*}

Bayes' Theorem implies that any decomposition of a joint probability
mass function is equal,
%
\begin{equation} \label{eqn:discrete_bayes}
p_{\Phi \mid \Theta} \! \left( \phi | \theta \right) p_{\Theta} \! \left( \theta \right)
=
p_{\Theta \times \Phi} \! \left( \theta, \phi \right) 
= 
p_{\Theta \mid \Phi} \! \left( \theta | \phi \right) p_{\Phi} \! \left( \phi \right),
\end{equation}
%
and from this one equality we can then recover the various implications of
Bayes' Theorem.  For example,
%
\begin{equation*}
\pi_{\Theta \mid \Phi} =
\frac{ \dd \pi_{\Phi \mid \Theta} }{ \dd \pi_{\Phi} }
\pi_{\Theta}
\end{equation*}
%
becomes
%
\begin{equation*}
p_{\Theta \mid \Phi} \! \left( \theta \mid \phi \right) =
\frac{ p_{\Phi \mid \Theta} \! \left( \phi \mid \theta \right) }
{ p_{\Phi} \! \left( \phi \right) }
p_{\Theta} \! \left( \theta \right),
\end{equation*}
%
which is an immediate consequence of dividing both sides of the joint
equality \eqref{eqn:discrete_bayes} by $p_{\Phi} \! \left( \phi \right)$.

\subsection{Relating Probability Mass Functions and Cumulative 
Distribution Functions}

Because probability mass functions and cumulative distribution functions 
both specify the same probability distribution, one can always be used to 
construct the other.  Given a probability mass function, for example, we 
can construct the cumulative distribution function as
%
\begin{equation*}
P \! \left( \theta \right)
= \PP \! \left[ I \! \left( \theta \right) \right]
= \sum_{\theta' \in I ( \theta )} p \! \left( \theta' \right).
\end{equation*}
%
Similarly, we can construct a probability mass function from a cumulative
distribution function as
%
\begin{equation*}
p \! \left( \theta \right) = 
P \! \left( \theta \right)
- P \! \left( \theta_{-} \right),
\end{equation*}
%
where $\theta_{-}$ is the largest element of $\Theta$ less that $\theta$,
%
\begin{equation*}
\theta_{-} = \max \left\{ \theta' \in \Theta \mid \theta' < \theta \right\}.
\end{equation*}

\section{Implementations of Probability Distributions over Real
Numbers}

When the sample space are $D$-dimensional real numbers, or
a subset thereof, there are always an uncountably infinite number 
of point events and we can't assign a finite measure to each without
the measure of any dense event becoming infinite.  Consequently,
the counting measure becomes pathological, assigning either infinite 
measure or zero measure to every dense event.  Fortunately, the real 
numbers admit their own natural measure on which we can implement 
probability distributions in practice.

\subsection{The Lebesgue Measure and Density Functions}

\textbf{Lebesgue measure on R assigns measure based on interval
length, then generalize as a product of Lebesgue measures that
assigns measure based on volume.}

The Lebesgue measure on the one-dimensional real numbers
assigns to each interval,
%
\begin{equation*}
I = \left[a, b \right] \subset \RR,
\end{equation*}
%
a measure proportional to its length,
%
\begin{equation*}
\lambda \! \left( I \right) = b - a.  
\end{equation*}

This generalizes to the general Lebesgue measure, $\lambda$, 
which assigns to each rectangular neighborhood
%
\begin{equation*}
I_{D} = \prod_{d=1}^{D} I_{d} = \prod_{d=1}^{D} \left[a_{d}, b_{d} \right]
\subset \RR^{D},
\end{equation*}
%
a measure proportional to its volume,
%
\begin{equation*}
\lambda \! \left( I_{d} \right) = 
\prod_{d = 1}^{D} \left(b_{d} - a_{d} \right).  
\end{equation*}
%
Although not immediately obvious, this condition implies that the 
Lebesgue measure of an arbitrary event is also given by the
enclosed volume specified by the Riemann integral over that event
%
\begin{equation*}
\lambda \! \left( E \right)= \int_{E} \dd^{D} \theta.
\end{equation*}
%
Because of this equivalence, it is common shorthand to write the 
Lebesgue measure on $\Theta = R^{D}$ as 
%
\begin{equation*}
\lambda = \dd^{D} \theta = \prod_{d = 1}^{D} \dd \theta_{d}.
\end{equation*}

Similarly, Lebesgue integrals with respect to the Lebesgue measure
(mathematicians have sick sense of humor) also reduce to Riemann
integrals,
%
\begin{equation*}
\mathcal{I}_{\lambda} \! \left( f \right)
= \int_{\Theta} f \! \left( \theta \right) \dd \theta.
\end{equation*}
%
An immediate consequence of this result is that all point events
have zero Lebesgue measure.

Provided that a measure does not assign non-zero measure to any
point events, it will be absolutely continuous with respect to the
Lebesgue measure and hence can be completely specified with
the corresponding Radon-Nikodym derivative.  In this case the
Radon-Nikodym derivative is denoted a \emph{density function},
%
\begin{equation*}
m : \Theta \rightarrow \RR^{+}.
\end{equation*}
%
As above, any measure or Lebesgue integral becomes a standard
Riemann integral that can be readily computed with the usual
calculus techniques,
%
\begin{align*}
\mu \! \left( E \right)
&=
\mathcal{I}_{\lambda} \! \left[ m \cdot I_{E} \right]
=
\int_{E} m \! \left( \theta \right) \dd \theta
\\
\mathcal{I}_{\mu} \! \left[ f \right]
&=
\mathcal{I}_{\lambda} \! \left[ f \cdot m \right]
=
\int_{\Theta} f \! \left( \theta \right) m \! \left( \theta \right) \dd \theta.
\end{align*}

The Radon-Nikodym derivative between two different real measures
behaves exactly as in the discrete case.  If $\mu_{2}$ is
absolutely continuous with respect to $\mu_{1}$ then the density
$m_{2} \! \left( \theta \right)$ can vanish only when the density
$m_{1} \! \left( \theta \right)$ vanishes, and the corresponding 
Radon-Nikodym derivative is given by
%
\begin{equation*}
\frac{ \dd \mu_{2} }{ \dd \mu_{1} } \! \left( \theta \right)
=
\frac{ m_{2} \! \left( \theta \right) }{ m_{1} \! \left( \theta \right) }.
\end{equation*}
%
When both densities vanish we have to be careful take an
appropriate limit in order to evaluate the Radon-Nikodym derivative.

\subsection{Transforming Lebesgue Measures and Density Functions}

One of the most subtle properties of Lebesgue measures, and
hence density functions, is that they do not transform as naturally
as their discrete counterparts.  The difficulty is that nonlinear
transformations warp the real numbers, distorting volumes and,
consequently, the Lebesgue measure.  

Consider a nonlinear, measurable map $s : \Theta \rightarrow \Omega$,
and let $\tilde{\lambda}_{\Omega}$ be the corresponding pushforward 
of the Lebesgue measure on $\Theta$ onto $\Omega$,
%
\begin{equation*}
\tilde{\lambda}_{\Omega} \! \left( E_{\Omega} \right)
\equiv
s_{*} \lambda_{\Theta} \! \left( E_{\Omega} \right)
=
\mu^{L}_{\Theta} \! \left[ s^{-1} \! \left( E_{\Omega} \right) \right].
\end{equation*}
%
The pushforward $\tilde{\lambda}_{\Omega}$ is not the Lebesgue 
measure on $\Omega$, rather the two are related by the Radon-Nikodym
derivative
%
\begin{equation*}
\frac{ \dd \lambda_{\Omega} }
{\dd \tilde{\lambda}_{\Omega} }
=
\left| \mathbf{J} \right|,
\end{equation*}
%
where the matrix
%
\begin{equation*}
J_{ij} 
= 
\frac{\partial \omega_{i} }{ \partial \theta_{j} }
\equiv 
\frac{ \partial s_{i} }{ \partial \theta_{j} }
\end{equation*} 
%
is called the \emph{Jacobian} of the transformation.  The determinant
of the Jacobian matrix quantifies how the nonlinear transformation 
warps volumes and hence the Lebesgue measure.

This nontrivial relationship between Lebesgue measures and their
pushforwards has an important consequence for density functions.  
Density functions do not transform as we might naively expect,
%
\begin{equation*}
m_{\Omega} \! \left( \omega \right) 
\ne 
m_{\Theta} \! \left( s^{-1} \! \left( \omega \right) \right),
\end{equation*}
%
rather they too are modified by the determinant of the Jacobian matrix,
%
\begin{equation*}
m_{\Omega} \! \left( \omega \right) 
= 
m_{\Theta} \! \left( s^{-1} \! \left( \omega \right) \right) | \mathbf{J} |^{-1}.
\end{equation*}
%
We can also motivate this relationship by considering the invariance
of Lebesgue integrals.  Equivalent sample spaces feature distinct
Lebesgue measure and density functions, but they must have the
\emph{same Lebesgue integrals} (Figure \ref{fig:pdf_transform}).
The Lebesgue integrals are invariant only if the Lebesgue measures
and densities transform oppositely to each other.

\begin{figure*}
\centering
%
\subfigure[]{
%\begin{tikzpicture}[scale=0.3, thick, show background rectangle]
\begin{tikzpicture}[scale=0.3, thick]
  \node[] at (0,0) {\includegraphics[width=5cm]{density.eps}};
  \foreach \i in {-10.5, -10, ..., 10.5} {
    \draw[-, color=gray80] (-11, \i) to (11, \i);
    \draw[-, color=gray80] (\i, -11) to (\i, 11);
  }  
  \node[] at (0,0) {\includegraphics[width=5cm]{area.eps}};
  \node[] at (0, -13) { $\mu \! \left( E_{\Theta} \right) = \int_{E_{\Theta}} 
                                   m_{\Theta} (\theta_{1}, \theta_{2}) 
                                   \dd \theta_{1} \dd \theta_{2} $ };
\end{tikzpicture}
}
%
\subfigure[]{
%\begin{tikzpicture}[scale=0.3, thick, show background rectangle]
\begin{tikzpicture}[scale=0.3, thick]
  \node[] at (0,0) {\includegraphics[width=5cm]{transformed_density.eps}};
  \foreach \i in {-10.9969, -10.9428, -10.8857, -10.8251, -10.7605, -10.6915, -10.6174, 
                        -10.5374, -10.4504, -10.355, -10.2497, -10.1319, -9.99835, -9.84419, 
                        -9.66186, -9.4387, -9.15099, -8.74547, -8.05217, 0., 8.05217, 
                        8.74547, 9.15099, 9.4387, 9.66186, 9.84419, 9.99835, 10.1319, 
                        10.2497, 10.355, 10.4504, 10.5374, 10.6174, 10.6915, 10.7605,
                        10.8251, 10.8857, 10.9428, 10.9969} {
    \draw[-, color=gray80] (-11, \i) to (11, \i);
    \draw[-, color=gray80] (\i, -11) to (\i, 11);
  }  
  \node[] at (0,0) {\includegraphics[width=5cm]{transformed_area.eps}};
  \node[] at (0, -13) { $\mu \! \left( s \! \left( E_{\Theta} \right) \right)
                                   = \int_{s(E_{\Theta})} 
                                   m_{\Omega} (\omega_{1}, \omega_{2}) 
                                   \dd \omega_{1} \dd \omega_{2}$ };
\end{tikzpicture}
}
\caption{When sample spaces are real, each sample space has
its own density functions (red), events (green), and differential volumes 
(grey).  Here the sample space in (b) is related to the sample space in 
(a) by the compatible mapping $\left(\omega_{1}, \omega_{2} \right) 
= s \!\left( \theta_{1}, \theta_{2} \right)$. All of these differences, however, 
exactly compensate to ensure that integrals always yield the same 
measures, $\mu \! \left( E_{\Theta} \right) = 
\mu \! \left( s \! \left( E_{\Theta} \right) \right) $.
}
\label{fig:pdf_transform}
\end{figure*}

Consequently, density functions depend not just on the abstract
system that we're trying to describe but also on the particular 
representation that we're using.  The only application of density
functions that are not sensitive to this dependence is integration,
and we have to be careful not to take densities too seriously when
they are not explicitly under an integral sign.

\subsubsection{Univariate Examples of Density Transformations}

A common motivation for transforming the sample space, and
hence any probability density functions, is when the nominal sample
space is contained.  Mapping to an unconstrained sample space
removes the unconstraint, potentially simplifying calculations.

For example, let $p_{\Theta} \! \left( \theta \right)$ be a probability 
distribution function defined on $\Theta = \RR^{+}$.  We can 
unconstrained the positivity constraint by applying a logarithmic 
transformation,
%
\begin{equation*}
\omega = s \! \left( \theta \right) = \log \theta \in \RR.
\end{equation*}
%
which results in the Jacobian
%
\begin{equation*}
J 
= \frac{ \partial s }{ \partial \theta} \! \left( \omega \right) 
= \frac{1}{\exp \! \left( \omega \right)}
\end{equation*}
%
and the transformed probability density function on $\Omega = \RR$,
%
\begin{align*}
p_{\Omega} \! \left( \omega \right)
&=
p_{\Theta} \! \left( s^{-1} \! \left( \omega \right) \right) \left|J\right|^{-1}
\\
&=
p_{\Theta} \! \left( s^{-1} \! \left( \omega \right) \right)
\exp \! \left( \omega \right).
\end{align*}

Similarly, we can unconstrain a sample space representing probabilities
themselves, $\Theta = \left[0, 1 \right]$ by applying a logistic transform,
%
\begin{equation*}
\omega = s \! \left( \theta \right) = \frac{1}{1 + \exp( - \theta ) } \in \RR.
\end{equation*}
%
This results in the Jacobian
%
\begin{equation*}
J 
= \frac{ \partial s }{ \partial \theta} \! \left( \omega \right) 
= \frac{1}{\omega \! \left( 1 - \omega \right) }
\end{equation*}
%
and the transformed probability density function on $\Omega = \RR$,
%
\begin{align*}
p_{\Omega} \! \left( \omega \right)
&=
p_{\Theta} \! \left( s^{-1} \! \left( \omega \right) \right) \left|J\right|^{-1}
\\
&=
p_{\Theta} \! \left( s^{-1} \! \left( \omega \right) \right)
\omega \! \left( 1 - \omega \right).
\end{align*}

\subsubsection{A Multivariate Example of Density Transformations}

To see how much a probability density function can be warped, consider 
a two-dimensional sample space with real components 
$\left( \theta_{1}, \theta_{2} \right)$ and the probability density function
%
\begin{equation*}
p_{\Theta} \! \left( \theta_{1}, \theta_{2} \right)
\propto
\exp \! \left( - \frac{\theta_{1}^{2} + \theta_{2}^{2}}{2} \right).
\end{equation*}
%
and the measurable map
%
\begin{equation*}
\left( \omega_{1}, \omega_{2} \right) 
= 
r \! \left( \theta_{1}, \theta_{2} \right) 
= 
\left( s \! \left( \theta_{1} \right), s \! \left( \theta_{2} \right) \right),
\end{equation*}
%
where the component maps are given by
%
\begin{equation*}
s \! \left( \theta \right)
=
\log \! \left(
\frac{ \pi + 2 \, \mathrm{atan} \! \left( \alpha \, \theta \right) }
{ \pi - 2 \, \mathrm{atan} \! \left( \alpha \, \theta \right) }
\right)
\end{equation*}
%
with the inverse
%
\begin{equation*}
s^{-1} \! \left( \omega \right)
=
\frac{1}{\alpha} 
\tan \! \left(
\frac{\pi}{2} \, \frac{e^{\omega} - 1}{e^{\omega} + 1} \right)
\end{equation*}
%
and Jacobian
%
\begin{equation*}
J \! \left( \omega \right)
=
\frac{ \partial s }{ \partial \theta } \! \left( \omega \right)
=
\frac{\alpha}{\pi} \frac{ \left( 1 + e^{\omega} \right)^{2} }{ e^{\omega} }
\sin^{2} \! \left( \frac{ \pi }{ 1 + e^{\omega} } \right).
\end{equation*}
%
The Jacobian of the complete map is then given by
%
\begin{equation*}
\mathbf{J} = 
\begin{bmatrix}
J & 0 \\
0 & J
\end{bmatrix},
\end{equation*}
%
with the determinant $| \mathbf{J} | = J^{2}$.  Hence the transformed
probability density function and differential volume are given by
%
\begin{equation*}
p_{\Omega} \! \left( \omega_{1}, \omega_{2} \right)
=
p_{\Theta} \! \left( 
s^{-1} \! \left( \omega_{1} \right), 
s^{-1} \! \left( \omega_{2} \right) \right)
J^{-2} \! \left( \omega \right)
\end{equation*}
%
and
%
\begin{equation*}
\dd \omega_{1} \dd \omega_{2}
=
J^{2} \dd \theta_{1} \dd \theta_{2},
\end{equation*}
%
respectively.  The extreme differences between these two representations
are evident graphically in Figure \ref{fig:pdf_transform}.

\subsection{Probability Density Functions}

Density functions corresponding to probability distributions reduce
to \emph{probability density function} which assign a positive value 
to each point in the sample space,
%
\begin{equation*}
p : \Theta \rightarrow \mathbb{R}^{+}.
\end{equation*}
such that
%
\begin{equation*}
1 = \pi \! \left( \Theta \right) = \int_{\Theta} p \! \left( \theta \right) \dd \theta.
\end{equation*}
%
These values, known as \emph{probability densities}, have no particular 
meaning of their own and instead exist only to be integrated to give 
probabilities,
%
\begin{equation*}
\pi \! \left( E \right)
=
\int_{E} p \! \left( \theta \right) \dd \theta,
\end{equation*}
%
and expectations,
%
\begin{equation*}
\mathbb{E}_{\pi} \! \left[ f \right]
=
\int_{\Theta} f \! \left( \theta \right) p \! \left( \theta \right) \dd \theta.
\end{equation*}

This is an important point that is worth repeating -- probability densities 
are meaningless until they have been integrated over some event.  To 
analogize with physics, the event over which we integrate corresponds 
to a \emph{volume} and the probability given by integrating the density 
over such a volume corresponds to a \emph{mass}.  When we want to 
be careful to differentiate between probabilities and probability densities 
we'll use \emph{probability mass} to refer to the former.

\subsection{Conditional Probability Density Functions}

Just as in the discrete case, probability density functions can be 
immediately extended to implement conditional probability distributions 
by simply adding a conditioning variable,
%
\begin{equation*}
p_{\Theta \mid \Phi} : \Theta \times \Phi \rightarrow \mathbb{R}^{+},
\end{equation*}
%
with conditional probabilities and conditional expectations reducing
to integrals,
%
\begin{align*}
\pi_{\Theta \mid \Phi}  \! \left( E \mid \phi \right)
&=
\int_{E} p_{\Theta \mid \Phi}  \! \left( \theta \mid \phi \right) \dd \theta,
\\
\mathbb{E}_{\pi_{\Theta \mid \Phi} } \! \left[ f \mid \phi \right]
&=
\int_{\Theta} f \! \left( \theta \right) 
p_{\Theta \mid \Phi}  \! \left( \theta \mid \phi \right) \dd \theta,
\end{align*}

Likewise, probability density functions representing joint and marginal 
probability distributions are easy to construct for conditional probability 
density functions.  Joint probability density functions are given by a simple
multiplication,
%
\begin{equation*}
p_{\Theta \times \Phi} \! \left( \theta, \phi \right) = 
p_{\Theta \mid \Phi} \! \left( \theta | \phi \right) p_{\Phi} \! \left( \phi \right),
\end{equation*}
%
and marginal probability density functions are given by integrating out the
nuisance components,
%
\begin{align*}
p_{\Theta} \! \left( \theta \right)
&= 
\int_{\Phi} p_{\Theta \times \Phi} \! \left( \theta, \phi \right) \dd \phi \\
&=
\int_{\Phi}
p_{\Theta \mid \Phi} \! \left( \theta | \phi \right) 
p_{\Phi} \! \left( \phi \right) \dd \phi.
\end{align*}

Bayes' Theorem also implies that any decomposition of a joint probability
density function is equal,
%
\begin{equation} \label{eqn:real_bayes}
p_{\Phi \mid \Theta} \! \left( \phi | \theta \right) p_{\Theta} \! \left( \theta \right)
=
p_{\Theta \times \Phi} \! \left( \theta, \phi \right) 
= 
p_{\Theta \mid \Phi} \! \left( \theta | \phi \right) p_{\Phi} \! \left( \phi \right),
\end{equation}
%
and from this one equality we can then recover the various implications of
Bayes' Theorem.  For example,
%
\begin{equation*}
\pi_{\Theta \mid \Phi} =
\frac{ \dd \pi_{\Phi \mid \Theta} }{ \dd \pi_{\Phi} }
\pi_{\Theta}
\end{equation*}
%
becomes
%
\begin{equation*}
p_{\Theta \mid \Phi} \! \left( \theta \mid \phi \right) =
\frac{ p_{\Phi \mid \Theta} \! \left( \phi \mid \theta \right) }
{ p_{\Phi} \! \left( \phi \right) }
p_{\Theta} \! \left( \theta \right),
\end{equation*}
%
which is an immediate consequence of dividing both sides of the joint
equality \eqref{eqn:real_bayes} by $p_{\Phi} \! \left( \phi \right)$.

\subsection{Relating Probability Density Functions and Cumulative 
Distribution Functions}

As with their discrete counterparts, probability density functions 
and cumulative distribution functions can always be derived from
each other because they specify the same probability distribution.  
Cumulative distribution functions, for example, are given by 
integrating over probability density functions,
%
\begin{equation*}
P \! \left( \theta \right)
= \pi \! \left[ I \! \left( \theta \right) \right]
= \int_{\theta_{\min}}^{\theta} p \! \left( \theta' \right) \dd \theta.
\end{equation*}

Probability density functions, on the other hand, are given by 
differentiating cumulative distribution functions,
%
\begin{equation*}
p \! \left( \theta \right) = 
\frac{\partial P \! \left( \theta \right) }{ \partial \theta}.
\end{equation*}
%
Note that if we map to an equivalent measurable space with 
$s : \Theta \rightarrow \Omega$, then the derivative
acquires a factor of the inverse Jacobian so that the
corresponding probability density function transforms
as necessary,
%
\begin{equation*}
p \! \left( \omega \right) 
= 
\frac{\partial P \! \left( \omega \right) }{ \partial \omega}
=
\frac{\partial P \! \left( s^{-1} \! \left( \omega \right) \right) }
{ \partial \theta}
\frac{ \partial \theta }{ \partial \omega }
= 
p \! \left( s^{-1} \! \left( \omega \right) \right) \left| \mathbf{J}^{-1} \right|.
\end{equation*}

\subsection{Implementations of Mixed Probability Distributions}

Distributions over samples spaces that have both a discrete, $\Phi$,
and a real $\Psi$, component can be implemented by leveraging 
discrete and continuous representations of conditional distributions.  

For example, a distribution over $\Theta = \Phi \times \Psi$ can be
specified by conditioning the discrete component with the real
component, $\pi_{\Phi \mid \Psi}$, and providing a marginal distribution
over the real component, $\pi_{\Psi}$.  Using a conditional
probability mass function for the former and a probability density function
for the latter, the probability of any event is given by
%
\begin{align*}
\pi_{\Theta} \! \left( E \right)
&=
\pi_{\Theta} \! \left( E_{\Phi} \times E_{\Psi} \right)
\\
&=
\EE_{\pi_{\Psi}} \! \left[
\pi_{\Phi \mid \Psi} \! \left(  E_{\Phi} \mid \psi \right)
\cdot
\mathbb{I}_{E_{\Psi}} \! \left( \psi \right)
\right] 
\\
&= 
\int_{E_{\Psi} }
\sum_{\phi \in E_{\Phi} } p_{\Phi \mid \Psi} \! \left( \phi \mid \psi \right)
p_{\Psi} \! \left( \psi \right) \dd \psi,
\end{align*}
%
with expectations given similarly by
%
\begin{align*}
\EE_{\pi_{\Theta}} \! \left[ f \right]
&=
\EE_{\pi_{\Psi}} \! \left[
\EE_{\pi_{\Phi \mid \Psi}} \! \left[  f \mid \psi \right]
\right] 
\\
&= 
\int_{E_{\Psi} } \sum_{\phi \in E_{\Phi} } f \! \left( \phi, \psi \right) 
p_{\Phi \mid \Psi} \! \left( \phi \mid \psi \right)
p_{\Psi} \! \left( \psi \right) \dd \psi.
\end{align*}

Equivalently, we could also condition the real component on the
discrete component, $\pi_{\Psi \mid \Phi}$ and provide a marginal 
distribution over the discrete component, $\pi_{\Phi}$.  We could 
then represent the distribution with a conditional probability density 
function for the former and a probability mass function for the latter.  
Likewise, the probability of any event is given by
%
\begin{align*}
\pi_{\Theta} \! \left( E \right)
&=
\pi_{\Theta} \! \left( E_{\Phi} \times E_{\Psi} \right)
\\
&=
\EE_{\pi_{\Phi}} \! \left[
\pi_{\Psi \mid \Phi} \! \left(  E_{\Psi} \mid \phi \right)
\cdot
\mathbb{I}_{E_{\Phi}} \! \left( \phi \right)
\right] 
\\
&= 
\sum_{\phi \in E_{\Phi} } \int_{E_{\Psi} }
p_{\Psi \mid \Phi} \! \left( \psi \mid \phi \right) 
p_{\Phi} \! \left( \phi \right) \dd \psi,
\end{align*}
%
with expectations given similarly by
%
\begin{align*}
\EE_{\PP_{\Theta}} \! \left[ f \right]
&=
\EE_{\PP_{\Phi}} \! \left[
\EE_{\PP_{\Psi \mid \Phi}} \! \left[  f \mid \phi \right]
\right] 
\\
&= 
\sum_{\phi \in E_{\Phi} } \int_{E_{\Psi} }
f \! \left( \phi, \psi \right) 
p_{\Psi \mid \Phi} \! \left( \psi \mid \phi \right)
p_{\Phi} \! \left( \phi \right) \dd \psi.
\end{align*}

Regardless of how we choose to decompose the mixed distribution,
combining probability density functions and probability mass functions
makes specifying and manipulating these distributions straightforward in
practice.

\section{Stochastic Implementations of Probability Distributions}

We can also specify any probability distributions and compute its
expectations \emph{stochastically}.  A \emph{stochastic process} is 
any mechanism that generates a sequence of states, or \emph{samples},
from a given sample space, 
$\left\{ \theta_{1}, \ldots, \theta_{N} \right\} \subset \Theta$.  In general,
the generation of the $n$-th state in the sequence, $\theta_{n}$, can
depend on all previous states in the sequence, 
$\left\{ \theta_{1}, \ldots, \theta_{n - 1} \right\}$.

A stochastic process implements a given probability distribution if 
the samples themselves can be used to recover all expectations as 
the size of the sequence becomes infinitely large. More formally, if
%
\begin{equation*}
\lim_{N \rightarrow \infty} \frac{1}{N} 
\sum_{n = 1}^{N} f \! \left( \theta_{n} \right)
= \EE_{\pi} \! \left[ f \right],
\end{equation*}
%
for all well-behaved functions, $f : \Theta \rightarrow \RR$, then the 
stochastic process implements all computations with respect to the 
probability distribution $\pi$.

This procedure, and hence the resulting samples, are \emph{exact} 
if every state in the sequence is generated independently of any previous 
states.  In other words, samples are exact when the stochastic process 
generates samples one at a time, with no dependence on the preceding 
or following samples.  If samples are not exact we refer to them as 
\emph{correlated}.

Exact stochastic processes are, by construction, perfectly
\emph{random} processes: there is no way to predict any element 
of the sampling sequence given the state of other samples in
that sequence.  Unfortunately such randomness is impossible to 
achieve in practice as computers are fundamentally deterministic.
Instead we rely on \emph{pseudorandom} processes which
generate sequences in a convoluted but ultimately deterministic
fashion.  When we are ignorant of the precise configuration of a
pseudorandom process the resulting samples \emph{appear} to be
random for all intents and purposes.

Pseudorandom processes that target specific probability distributions, 
or \emph{pseudorandom number generators}, can be devised for 
many simple probability distributions.  Unfortunately they are typically 
impossible to construct for the more complex distributions that are
often of practical interest.