Sam Foreman 2024-04-03
MLMC:
Machine Learning Monte Carlo
for
Lattice Gauge Theory
Sam Foreman
Xiao-Yong Jin, James C.
Osborn
saforem2/
{
lattice23
,
l2hmc-qcd
}
2023-07-31 @ Lattice 2023
[!NOTE]
Generate independent samples
${x_{i}}$ , such that1$${x_{i}} \sim p(x) \propto e^{-S(x)}$$ where$S(x)$ is the action (or potential energy)
- Want to calculate observables
$\mathcal{O}$ :
$\left\langle \mathcal{O}\right\rangle \propto \int \left[\mathcal{D}x\right]\hspace{4pt} {\mathcal{O}(x), p(x)}$
If these were independent, we could
approximate:
$\left\langle\mathcal{O}\right\rangle \simeq \frac{1}{N}\sum^{N}{n=1}\mathcal{O}(x{n})$
[!NOTE]
Generate independent samples
${x_{i}}$ , such that2$${x_{i}} \sim p(x) \propto e^{-S(x)}$$ where$S(x)$ is the action (or potential energy)
- Want to calculate observables
$\mathcal{O}$ :
$\left\langle \mathcal{O}\right\rangle \propto \int \left[\mathcal{D}x\right]\hspace{4pt} {\mathcal{O}(x), p(x)}$
Instead, nearby configs are correlated, and we incur a factor of $\textcolor{#FF5252}{\tau^{\mathcal{O}}{\mathrm{int}}}$: $$\sigma{\mathcal{O}}^{2} = \frac{\textcolor{#FF5252}{\tau^{\mathcal{O}}_{\mathrm{int}}}}{N}\mathrm{Var}{\left[\mathcal{O} (x) \right]}$$
-
Want to (sequentially) construct a chain of states:
$$x_{0} \rightarrow x_{1} \rightarrow x_{i} \rightarrow \cdots \rightarrow x_{N}\hspace{10pt}$$ such that, as
$N \rightarrow \infty$ :$$\left{x_{i}, x_{i+1}, x_{i+2}, \cdots, x_{N} \right} \xrightarrow[]{N\rightarrow\infty} p(x) \propto e^{-S(x)}$$
Tip
- Introduce fictitious momentum
$v \sim \mathcal{N}(0, \mathbb{1})$ - Normally distributed independent of
$x$ , i.e. $$\begin{align*} p(x, v) &\textcolor{#02b875}{=} p(x),p(v) \propto e^{-S{(x)}} e^{-\frac{1}{2} v^{T}v} = e^{-\left[S(x) + \frac{1}{2} v^{T}{v}\right]} \textcolor{#02b875}{=} e^{-H(x, v)} \end{align*}$$
- Normally distributed independent of
-
Idea: Evolve the
$(\dot{x}, \dot{v})$ system to get new states${x_{i}}$ ❗ -
Write the joint distribution
$p(x, v)$ : $$ p(x, v) \propto e^{-S[x]} e^{-\frac{1}{2}v^{T} v} = e^{-H(x, v)} $$
[!TIP]
$\left(\dot{x}, \dot{v}\right) = \left(\partial_{v} H, -\partial_{x} H\right)$
[!NOTE]
input
$,\left(x, v\right) \rightarrow \left(x', v'\right),$ output
$$\begin{align*} \tilde{v} &:= \textcolor{#F06292}{\Gamma}(x, v)\hspace{2.2pt} = v - \frac{\varepsilon}{2} \partial_{x} S(x) \\ x' &:= \textcolor{#FD971F}{\Lambda}(x, \tilde{v}) , = x + \varepsilon , \tilde{v} \\ v' &:= \textcolor{#F06292}{\Gamma}(x', \tilde{v}) = \tilde{v} - \frac{\varepsilon}{2} \partial_{x} S(x') \end{align*}$$
[!WARNING]
- Resample
$v_{0} \sim \mathcal{N}(0, \mathbb{1})$
at the beginning of each trajectory
Note:
-
We build a trajectory of
$N_{\mathrm{LF}}$ leapfrog steps3 $$\begin{equation*} (x_{0}, v_{0})% \rightarrow (x_{1}, v_{1})\rightarrow \cdots% \rightarrow (x', v') \end{equation*}$$ -
And propose
$x'$ as the next state in our chain
- We then accept / reject
$x'$ using Metropolis-Hastings criteria,
$A(x'|x) = \min\left{1, \frac{p(x')}{p(x)}\left|\frac{\partial x'}{\partial x}\right|\right}$
- What do we want in a good sampler?
- Fast mixing (small autocorrelations)
- Fast burn-in (quick convergence)
- Problems with HMC:
- Energy levels selected randomly
$\rightarrow$ slow mixing - Cannot easily traverse low-density zones
$\rightarrow$ slow convergence
- Energy levels selected randomly
Topological Charge:
note:
[!IMPORTANT]
$Q$ gets stuck!
- as
$\beta\longrightarrow \infty$ :
$Q \longrightarrow \text{const.}$ $\delta Q = \left(Q^{\ast} - Q\right) \rightarrow 0 \textcolor{#FF5252}{\Longrightarrow}$ - # configs required to estimate errors
grows exponentially:$\tau_{\mathrm{int}}^{Q} \longrightarrow \infty$
- Introduce two (invertible NNs)
vNet
andxNet
4:vNet:
$(x, F) \longrightarrow \left(s_{v},, t_{v},, q_{v}\right)$ xNet:
$(x, v) \longrightarrow \left(s_{x},, t_{x},, q_{x}\right)$
- Use these
$(s, t, q)$ in the generalized MD update:-
$\Gamma_{\theta}^{\pm}$ $: ({x}, \textcolor{#07B875}{v}) \xrightarrow[]{\textcolor{#F06292}{s_{v}, t_{v}, q_{v}}} (x, \textcolor{#07B875}{v'})$ -
$\Lambda_{\theta}^{\pm}$ $: (\textcolor{#AE81FF}{x}, v) \xrightarrow[]{\textcolor{#FD971F}{s_{x}, t_{x}, q_{x}}} (\textcolor{#AE81FF}{x'}, v)$
-
[!NONE]
Introduce
$d \sim \mathcal{U}(\pm)$ to determine the direction of our update
$\textcolor{#07B875}{v'} =$ $\Gamma^{\pm}$ $({x}, \textcolor{#07B875}{v})$ $\hspace{46pt}$ update$v$
$\textcolor{#AE81FF}{x'} =$ $x_{B}$ $,+,$ $\Lambda^{\pm}$ $($ $x_{A}$ $, {v'})$ $\hspace{10pt}$ update first half:$x_{A}$
$\textcolor{#AE81FF}{x''} =$ $x'{A}$$,+,$$\Lambda^{\pm}$$($$x'{B}$$, {v'})$$\hspace{8pt}$ update other half:$x_{B}$
$\textcolor{#07B875}{v''} =$ $\Gamma^{\pm}$ $({x''}, \textcolor{#07B875}{v'})$ $\hspace{36pt}$ update$v$
[!NONE]
- Resample both
$v\sim \mathcal{N}(0, 1)$ , and$d \sim \mathcal{U}(\pm)$ at the beginning of each trajectory
- To ensure ergodicity + reversibility, we split the
$x$ update into sequential (complementary) updates- Introduce directional variable
$d \sim \mathcal{U}(\pm)$ , resampled at the beginning of each trajectory:
- Note that
$\left(\Gamma^{+}\right)^{-1} = \Gamma^{-}$ , i.e.$$\Gamma^{+}\left[\Gamma^{-}(x, v)\right] = \Gamma^{-}\left[\Gamma^{+}(x, v)\right] = (x, v)$$
[!NONE]
input
:$x$
- Resample:
$\textcolor{#07B875}{v} \sim \mathcal{N}(0, \mathbb{1})$ ;$,,{d\sim\mathcal{U}(\pm)}$ - Construct initial state:
$\textcolor{#939393}{\xi} =(\textcolor{#AE81FF}{x}, \textcolor{#07B875}{v}, {\pm})$
forward
: Generate proposal$\xi'$ by passing initial$\xi$ through$N_{\mathrm{LF}}$ leapfrog layers
$$\textcolor{#939393} \xi \hspace{1pt}\xrightarrow[]{\tiny{\mathrm{LF} \text{ layer}}}\xi_{1} \longrightarrow\cdots \longrightarrow \xi_{N_{\mathrm{LF}}} = \textcolor{#f8f8f8}{\xi'} := (\textcolor{#AE81FF}{x''}, \textcolor{#07B875}{v''})$$
- Accept / Reject: $$\begin{equation*} A({\textcolor{#f8f8f8}{\xi'}}|{\textcolor{#939393}{\xi}})= \mathrm{min}\left{1, \frac{\pi(\textcolor{#f8f8f8}{\xi'})}{\pi(\textcolor{#939393}{\xi})} \left| \mathcal{J}\left(\textcolor{#f8f8f8}{\xi'},\textcolor{#939393}{\xi}\right)\right| \right} \end{equation*}$$
backward
(if training):
- Evaluate the loss function5
$\mathcal{L}\gets \mathcal{L}_{\theta}(\textcolor{#f8f8f8}{\xi'}, \textcolor{#939393}{\xi})$ and backprop
return
: $\textcolor{#AE81FF}{x}{i+1}$
Evaluate MH criteria $(1)$ and return accepted config, $$\textcolor{#AE81FF}{{x}{i+1}}\gets \begin{cases} \textcolor{#f8f8f8}{\textcolor{#AE81FF}{x''}} \small{\text{ w/ prob }} A(\textcolor{#f8f8f8}{\xi''}|\textcolor{#939393}{\xi}) \hspace{26pt} ✅ \ \textcolor{#939393}{\textcolor{#AE81FF}{x}} \hspace{5pt}\small{\text{ w/ prob }} 1 - A(\textcolor{#f8f8f8}{\xi''}|{\textcolor{#939393}{\xi}}) \hspace{10pt} 🚫 \end{cases}$$
[!NOTE]
Write link variables
$U_{\mu}(x) \in SU(3)$ :$$ \begin{align*} U_{\mu}(x) &= \mathrm{exp}\left[{i, \textcolor{#AE81FF}{\omega^{k}_{\mu}(x)} \lambda^{k}}\right]\ &= e^{i \textcolor{#AE81FF}{Q}},\quad \text{with} \quad \textcolor{#AE81FF}{Q} \in \mathfrak{su}(3) \end{align*}$$
where
$\omega^{k}_{\mu}(x)$ $\in \mathbb{R}$ , and$\lambda^{k}$ are the generators of$SU(3)$
[!TIP]
- Introduce $P_{\mu}(x) = P^{k}{\mu}(x) \lambda^{k}$ conjugate to $\omega^{k}{\mu}(x)$
[!IMPORTANT]
$$ S_{G} = -\frac{\beta}{6} \sum \mathrm{Tr}\left[U_{\mu\nu}(x)
- U^{\dagger}_{\mu\nu}(x)\right] $$
where $U_{\mu\nu}(x) = U_{\mu}(x) U_{\nu}(x+\hat{\mu}) U^{\dagger}{\mu}(x+\hat{\nu}) U^{\dagger}{\nu}(x)$
Hamiltonian:
[!NONE]
$U$ update:$\frac{d\omega^{k}}{dt} = \frac{\partial H}{\partial P^{k}}$ $$\frac{d\omega^{k}}{dt}\lambda^{k} = P^{k}\lambda^{k} \Longrightarrow \frac{dQ}{dt} = P$$ $$\begin{align*} Q(\textcolor{#FFEE58}{\varepsilon}) &= Q(0) + \textcolor{#FFEE58}{\varepsilon} P(0)\Longrightarrow\ -i, \log U(\textcolor{#FFEE58}{\varepsilon}) &= -i, \log U(0) + \textcolor{#FFEE58}{\varepsilon} P(0) \ U(\textcolor{#FFEE58}{\varepsilon}) &= e^{i,\textcolor{#FFEE58}{\varepsilon} P(0)} U(0)\Longrightarrow \ &\hspace{1pt}\ \textcolor{#FD971F}{\Lambda}:,, U \longrightarrow U' &:= e^{i\varepsilon P'} U \end{align*}$$
[!NONE]
$P$ update:$\frac{dP^{k}}{dt} = - \frac{\partial H}{\partial \omega^{k}}$ $$\frac{dP^{k}}{dt} = - \frac{\partial H}{\partial \omega^{k}} = -\frac{\partial H}{\partial Q} = -\frac{dS}{dQ}\Longrightarrow$$ $$\begin{align*} P(\textcolor{#FFEE58}{\varepsilon}) &= P(0) - \textcolor{#FFEE58}{\varepsilon} \left.\frac{dS}{dQ}\right|_{t=0} \ &= P(0) - \textcolor{#FFEE58}{\varepsilon} ,\textcolor{#E599F7}{F[U]} \ &\hspace{1pt}\ \textcolor{#F06292}{\Gamma}:,, P \longrightarrow P' &:= P - \frac{\varepsilon}{2} F[U] \end{align*}$$
-
Momentum Update:
$$\textcolor{#F06292}{\Gamma}: P \longrightarrow P' := P - \frac{\varepsilon}{2} F[U]$$ -
Link Update:
$$\textcolor{#FD971F}{\Lambda}: U \longrightarrow U' := e^{i\varepsilon P'} U\quad\quad$$ -
We maintain a batch of
Nb
lattices, all updated in parallel-
$U$ .dtype = complex128
-
$U$ .shape
= [Nb, 4, Nt, Nx, Ny, Nz, 3, 3]
-
UNet:
PNet:
UNet:
PNet:
let’s look
at this
-
input
6:$\hspace{7pt}\left(U, F\right) := (e^{i Q}, F)$ $$\begin{align*} h_{0} &= \sigma\left( w_{Q} Q + w_{F} F + b \right) \ h_{1} &= \sigma\left( w_{1} h_{0} + b_{1} \right) \ &\vdots \ h_{n} &= \sigma\left(w_{n-1} h_{n-2} + b_{n}\right) \ \textcolor{#FF5252}{z} & := \sigma\left(w_{n} h_{n-1} + b_{n}\right) \longrightarrow \ \end{align*}$$
-
output
7:$\hspace{7pt} (s_{P}, t_{P}, q_{P})$ $s_{P} = \lambda_{s} \tanh(w_s \textcolor{#FF5252}{z} + b_s)$ $t_{P} = w_{t} \textcolor{#FF5252}{z} + b_{t}$ $q_{P} = \lambda_{q} \tanh(w_{q} \textcolor{#FF5252}{z} + b_{q})$
-
Use
$(s_{P}, t_{P}, q_{P})$ to update$\Gamma^{\pm}: (U, P) \rightarrow \left(U, P_{\pm}\right)$ 8:-
forward
$(d = \textcolor{#FF5252}{+})$ :$$\Gamma^{\textcolor{#FF5252}{+}}(U, P) := P_{\textcolor{#FF5252}{+}} = P \cdot e^{\frac{\varepsilon}{2} s_{P}} - \frac{\varepsilon}{2}\left[ F \cdot e^{\varepsilon q_{P}} + t_{P} \right]$$ -
backward
$(d = \textcolor{#1A8FFF}{-})$ :$$\Gamma^{\textcolor{#1A8FFF}{-}}(U, P) := P_{\textcolor{#1A8FFF}{-}} = e^{-\frac{\varepsilon}{2} s_{P}} \left{P + \frac{\varepsilon}{2}\left[ F \cdot e^{\varepsilon q_{P}} + t_{P} \right]\right}$$
-
Deviation in
Topological charge mixing
Artificial influx of energy
- Distribution of
$\log|\mathcal{J}|$ over all chains, at each leapfrog step,$N_{\mathrm{LF}}$ ($= 0, 1, \ldots, 8$ ) during training:
-
Further code development
-
Continue to use / test different network architectures
- Gauge equivariant NNs for
$U_{\mu}(x)$ update
- Gauge equivariant NNs for
-
Continue to test different loss functions for training
-
Scaling:
- Lattice volume
- Network size
- Batch size
- # of GPUs
Note
This research used resources of the Argonne Leadership Computing
Facility,
which is a DOE Office of Science User Facility supported under
Contract DE-AC02-06CH11357.
- Links:
- References:
- Link to slides
- (Foreman et al. 2022; Foreman, Jin, and Osborn 2022, 2021)
- (Boyda et al. 2022; Shanahan et al. 2022)
- Huge thank you to:
- Yannick Meurice
- Norman Christ
- Akio Tomiya
- Nobuyuki Matsumoto
- Richard Brower
- Luchang Jin
- Chulwoo Jung
- Peter Boyle
- Taku Izubuchi
- Denis Boyda
- Dan Hackett
- ECP-CSD group
- ALCF Staff + Datascience Group
-
📊 slides (Github:
saforem2/lattice23
)
(I don’t know why this is broken 🤷🏻♂️ )
Boyda, Denis et al. 2022. “Applications of Machine Learning to Lattice Quantum Field Theory.” In Snowmass 2021. https://arxiv.org/abs/2202.05838.
Foreman, Sam, Taku Izubuchi, Luchang Jin, Xiao-Yong Jin, James C. Osborn, and Akio Tomiya. 2022. “HMC with Normalizing Flows.” PoS LATTICE2021: 073. https://doi.org/10.22323/1.396.0073.
Foreman, Sam, Xiao-Yong Jin, and James C. Osborn. 2021. “Deep Learning Hamiltonian Monte Carlo.” In 9th International Conference on Learning Representations. https://arxiv.org/abs/2105.03418.
———. 2022. “LeapfrogLayers: A Trainable Framework for Effective Topological Sampling.” PoS LATTICE2021 (May): 508. https://doi.org/10.22323/1.396.0508.
Shanahan, Phiala et al. 2022. “Snowmass 2021 Computational Frontier CompF03 Topical Group Report: Machine Learning,” September. https://arxiv.org/abs/2209.07559.
Deviation from
Average
-
Want to maximize the expected squared charge difference9: $$\begin{equation*} \mathcal{L}{\theta}\left(\xi^{\ast}, \xi\right) = {\mathbb{E}{p(\xi)}}\big[-\textcolor{#FA5252}{{\delta Q}}^{2} \left(\xi^{\ast}, \xi \right)\cdot A(\xi^{\ast}|\xi)\big] \end{equation*}$$
-
Where:
-
$\delta Q$ is the tunneling rate: $$\begin{equation*} \textcolor{#FA5252}{\delta Q}(\xi^{\ast},\xi)=\left|Q^{\ast} - Q\right| \end{equation*}$$ -
$A(\xi^{\ast}|\xi)$ is the probability10 of accepting the proposal$\xi^{\ast}$ : $$\begin{equation*} A(\xi^{\ast}|\xi) = \mathrm{min}\left( 1, \frac{p(\xi^{\ast})}{p(\xi)}\left|\frac{\partial \xi^{\ast}}{\partial \xi^{T}}\right|\right) \end{equation*}$$
-
-
Stack gauge links as
shape
$\left(U_{\mu}\right)$ =[Nb, 2, Nt, Nx]
$\in \mathbb{C}$ $$ x_{\mu}(n) ≔ \left[\cos(x), \sin(x)\right]$$
with
shape
$\left(x_{\mu}\right)$ = [Nb, 2, Nt, Nx, 2]
$\in \mathbb{R}$ -
$x$ -Network:$\psi_{\theta}: (x, v) \longrightarrow \left(s_{x},, t_{x},, q_{x}\right)$
-
$v$ -Network:-
$\varphi_{\theta}: (x, v) \longrightarrow \left(s_{v},, t_{v},, q_{v}\right)$ $\hspace{2pt}\longleftarrow$ lets look at this
-
$v$ -Update11
-
forward
$(d = \textcolor{#FF5252}{+})$ :
-
backward
$(d = \textcolor{#1A8FFF}{-})$ :
-
forward
$(d = \textcolor{#FF5252}{+})$ :
-
backward
$(d = \textcolor{#1A8FFF}{-})$ :
[!NOTE]
$$U_{\mu}(n) = e^{i x_{\mu}(n)}\in \mathbb{C},\quad \text{where}\quad$$ $$x_{\mu}(n) \in [-\pi,\pi)$$
[!IMPORTANT]
$$S_{\beta}(x) = \beta\sum_{P} \cos \textcolor{#00CCFF}{x_{P}},$$ $$\textcolor{#00CCFF}{x_{P}} = \left[x_{\mu}(n) + x_{\nu}(n+\hat{\mu})
- x_{\mu}(n+\hat{\nu})-x_{\nu}(n)\right]$$
Note:
-
Introduce an annealing schedule during the training phase:
$$\left{ \gamma_{t} \right}{t=0}^{N} = \left{\gamma{0}, \gamma_{1}, \ldots, \gamma_{N-1}, \gamma_{N} \right}$$
where
$\gamma_{0} < \gamma_{1} < \cdots < \gamma_{N} \equiv 1$ , and$\left|\gamma_{t+1} - \gamma_{t}\right| \ll 1$ -
Note:
- for
$\left|\gamma_{t}\right| < 1$ , this rescaling helps to reduce the height of the energy barriers$\Longrightarrow$ - easier for our sampler to explore previously inaccessible regions of the phase space
- for
-
Stack gauge links as
shape
$\left(U_{\mu}\right)$ =[Nb, 2, Nt, Nx]
$\in \mathbb{C}$ $$ x_{\mu}(n) ≔ \left[\cos(x), \sin(x)\right]$$
with
shape
$\left(x_{\mu}\right)$ = [Nb, 2, Nt, Nx, 2]
$\in \mathbb{R}$ -
$x$ -Network:$\psi_{\theta}: (x, v) \longrightarrow \left(s_{x},, t_{x},, q_{x}\right)$
-
$v$ -Network:$\varphi_{\theta}: (x, v) \longrightarrow \left(s_{v},, t_{v},, q_{v}\right)$
- To estimate physical quantities, we:
- Calculate physical observables at increasing spatial resolution
- Perform extrapolation to continuum limit
Footnotes
-
Here, $\sim$ means “is distributed according to” ↩
-
Here, $\sim$ means “is distributed according to” ↩
-
We always start by resampling the momentum, $v_{0} \sim \mathcal{N}(0, \mathbb{1})$ ↩
-
For simple $\mathbf{x} \in \mathbb{R}^{2}$ example, $\mathcal{L}_{\theta} = A(\xi^{\ast}|\xi)\cdot \left(\mathbf{x}^{\ast} - \mathbf{x}\right)^{2}$ ↩
-
$\sigma(\cdot)$ denotes an activation function ↩
-
$\lambda_{s},, \lambda_{q} \in \mathbb{R}$ are trainable parameters ↩
-
Note that $\left(\Gamma^{+}\right)^{-1} = \Gamma^{-}$, i.e. $\Gamma^{+}\left[\Gamma^{-}(U, P)\right] = \Gamma^{-}\left[\Gamma^{+}(U, P)\right] = (U, P)$ ↩
-
Where $\xi^{\ast}$ is the proposed configuration (prior to Accept / Reject) ↩
-
And $\left|\frac{\partial \xi^{\ast}}{\partial \xi^{T}}\right|$ is the Jacobian of the transformation from $\xi \rightarrow \xi^{\ast}$ ↩
-
Note that $\left(\Gamma^{+}\right)^{-1} = \Gamma^{-}$, i.e. $\Gamma^{+}\left[\Gamma^{-}(x, v)\right] = \Gamma^{-}\left[\Gamma^{+}(x, v)\right] = (x, v)$ ↩