Understanding MaskedMLP #30

zqy767 · 2023-11-21T09:46:51Z

zqy767
Nov 21, 2023

Hi everyone,

I am trying to understand the code of zuko, however zuko.nn.MaskedMLP confuses me a lot. I just wonder is there any public work related to this class? Or is there any mathematical proof about why it can just produce the exact Jacobian as the given adj?

Answered by francois-rozet

Nov 21, 2023

Hello @zqy767, this a very good question! I'll start by explaining why MaskedMLP is necessary in autoregressive transformations and then how it is implemented in Zuko. I also invite you to take a look at discussion #16, whose subject is related.

Why

Let $x$ be a vector in $\mathbb{R}^n$. An autoregressive transformation is a mapping $y = f(x) \in \mathbb{R}^n$ such that the $i$-th element of $y$ is a bijective univariate transformation of the $i$-th element of $x$, conditioned on the preceding elements. That is $y_i = f_i(x_i \mid h_i(x_{1:i-1}))$ where $x_{1:i} = (x_1, x_2, \dots, x_i)$ and $h_i$ returns the parameters of the univariate transformation $f_i$. $h_i$ is typically a neural n…

View full answer

simonschnake · 2023-11-21T10:18:52Z

simonschnake
Nov 21, 2023

Hey @zqy767,

the class is used for construction of masked autoregressive flows.
The design was initially defined in
MADE: Masked Autoencoder for Distribution Estimation
and the usage in zuko should be according to
Masked Autoregressive Flow for Density Estimation

Hope, that helps you :)

Cheers
Simon

3 replies

zqy767 Nov 21, 2023
Author

Thanks for answering my question so quickly. I really appreciate your help.

However I have read that two papers, but it does not seem to deal with adj, let alone mathematical details. For example:

zuko/zuko/nn.py

Line 273 in a7de3eb

precedence = adjacency.int() @ adjacency.int().t() == adjacency.sum(dim=-1)

the variable precedence seems play import role in this function, but I cannot got the mathematical idea about what it is and why it used there.

I just wonder is there any public paper for details like this?

francois-rozet Nov 21, 2023
Maintainer

Hey @zqy767, I came up with this implementation on my own. It is not based on any paper. But your question is very relevant so I will prepare a thorough answer that explains why MaskedMLP was necessary and how it works. In the mean time, here is a quick answer:

The variable precedence seems play an import role in this function, but I cannot got the mathematical idea about what it is [...]

The main idea of this algorithm is that the $i$-th output precedes the $j$-th output if the $j$-th output is connected (adjacent) to all inputs connected to the $i$-th output.

Mathematically, $P_{ij} = 1$ if

$$ A_{ik} A_{jk} = A_{ik} $$

for all $k$ or equivalently

$$ \sum_k A_{ik} A_{jk} = \sum_k A_{ik} $$

If $i$ precedes $j$, a neuron dedicated to $j$ can be connected to a neuron dedicated to $i$ without changing the inputs it is adjacent to. Hence, a neuron in the MaskedMLP is always connected to all neurons that precede it.

zqy767 Nov 21, 2023
Author

What a wonderful algorithm! Thanks for the quick answer and sorry for taking your time answering my question.

francois-rozet · 2023-11-21T15:33:32Z

francois-rozet
Nov 21, 2023
Maintainer

Hello @zqy767, this a very good question! I'll start by explaining why MaskedMLP is necessary in autoregressive transformations and then how it is implemented in Zuko. I also invite you to take a look at discussion #16, whose subject is related.

Why

Let $x$ be a vector in $\mathbb{R}^n$. An autoregressive transformation is a mapping $y = f(x) \in \mathbb{R}^n$ such that the $i$-th element of $y$ is a bijective univariate transformation of the $i$-th element of $x$, conditioned on the preceding elements. That is $y_i = f_i(x_i \mid h_i(x_{1:i-1}))$ where $x_{1:i} = (x_1, x_2, \dots, x_i)$ and $h_i$ returns the parameters of the univariate transformation $f_i$. $h_i$ is typically a neural network, called the hyper-network as it returns parameters.

However, evaluating $n$ neural networks $h_i$ at each call would be quite inefficient. Instead, it would be better if we had a single network $h(x)$ that returned all $h_i(x_{1:i-1})$ at once. However, we cannot use any network architecture: the outputs of $h$ corresponding to $h_i$ can only depend on the first $i-1$ inputs. This defines an adjacency matrix $A \in \lbrace 0, 1 \rbrace^{m \times n}$ between the inputs and outputs of $h$ such that $A_{ij} = 1$ iff the $i$-th output depends on the $j$-th input.

Assuming that $h$ is a multi-layer perceptron (MLP) with $L$ layers and $n_i$ neurons per layer ($n_1 = n$ and $n_L = m$), how do we guarantee to respect the adjacency?

How

The typical way to impose adjacency in MLPs is with binary masks. Let $W_l \in \mathbb{R}^{n_{l} \times n_{l-1}}$ be the weight matrix between the $l-1$-th and $l$-th layers and $M_l \in \lbrace 0, 1 \rbrace^{n_{l} \times n_{l-1}}$ be a binary mask for $W_l$. The zeros in $M_l$ represent the neuronal connections that are pruned from $W_l$, meaning that the effective weights are $W_l \odot M_l$ (element-wise product).

The goal is to find a factorization of masks $(M_2, M_3, \dots, M_L)$ such that their product $B = M_L \dots M_3 M_2$ is non-null only where $A$ is non-null.

$$B_{ij} \neq 0 \Rightarrow A_{ij} \neq 0$$

Trivial factorizations respecting this property exist, but they heavily reduce the expressiveness of the network. Hence, what is a good factorization and how to find one?

Implementation

In short, the factorization algorithm implemented in zuko.nn.MaskedMLP tries to allocate the same number of neurons in each layer to each of the outputs. It then prunes the connections between neurons that are allocated to outputs that do not depend on the same inputs.

Formally, let's say that if a neuron of the $l$-th layer is allocated to the $i$-th output, it only depends on the inputs adjacent to $i$. Whether this property is respected or not is determined by the product $M_l \dots M_3 M_2$.

Let's assume that all the neurons of $l$-th layer respect their allocation and let's say that a neuron of the $l + 1$-th layer is allocated to the $j$-th output. To respect its allocation, the neuron should only be connected to neurons whose allocated outputs $i$ are only adjacent to inputs adjacent to the $j$-th output. In that case, we say that $i$ precedes $j$ and we construct the precedence matrix $P \in \lbrace 0, 1 \rbrace^{m, m}$ such that $P_{ij} = 1$ iff

$$ A_{ik} A_{jk} = A_{ik} $$

for all $k$ or equivalently

$$ \sum_{k = 1}^m A_{ik} A_{jk} = \sum_{k = 1}^m A_{ik} $$

Hence, given the allocations of layers $l$ and $l + 1$, we can determine what connections should be pruned ($M_{l+1}$) based on the precedence matrix $P$. The only remaining step is to determine the mask $M_2$ between the inputs and the first neurons, which is done by looking up input dependencies within the adjacency matrix $A$.

8 replies

CaioDaumann Apr 17, 2024

And @francois-rozet I have another question, how do I handle the conditions in your MaskedMLP()? I would have to write an adjacency manually to make sure the masking matrices don't kill the connection with other nodes? Because the conditions are fully connected to other nodes, no?

francois-rozet Apr 17, 2024
Maintainer

The feature ordering does not change, what changes is which variables are connect to the p(x_{I}) in the output layer, no?

The usual way to build the adjacency matrix is to assign an order to the features/variables and then impose that a feature/variable is connected to all features/variables preceding it in the order.

the adjacency matrices are deterministic by default means that you have a defined adjacency matrix, is it a diagonal one? In the sense that we have p(x1)p(x2|x1)p(x3|x2,x1) ... ?

The default adjacency matrix is triangular ($A_{ij} = 1$ if $i < j$, $0$ everywhere else), which corresponds to an autoregressive transformation $y_i = f(x_i \mid x_{1:i-1})$.

Note that $p(x_1, x_2, x_3) = p(x_1) p(x_2 \mid x_1) p(x_3 \mid x_2, x_1)$ is always true, regardless of the distribution.

And if I set randperm=True I would have the connectivity-agnostic training mentioned in the MADE paper?

No you would just get a flow with autoregressive transformations with randomly initialized adjacency matrices. I checked sections 4.2 and 4.3 of the MADE paper, and these things are not implemented in Zuko. Once initialized, the masks of MaskedMLP do not change anymore. By the way, MaskedMLP is NOT a MADE.

francois-rozet Apr 17, 2024
Maintainer

I would have to write an adjacency manually to make sure the masking matrices don't kill the connection with other nodes?

Yes, the condition should be adjacent to all outputs.

CaioDaumann Apr 17, 2024

Ah, yes, I don't need a MADE, I just need the autoregressive architecture. I'm asking all these questions and drawing parallels with the MADE paper to better understand what's going on, as I plan to use Zuko's maskedMLP in an application.

I believe for now all my doubts have been addressed. Thanks for all the help!

francois-rozet Apr 17, 2024
Maintainer

You're welcome! You might want to take a look at the MaskedAutoregressiveTransform implementation for your application.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding MaskedMLP #30

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 11 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Understanding MaskedMLP #30

zqy767 Nov 21, 2023

Why

Replies: 2 comments · 11 replies

simonschnake Nov 21, 2023

zqy767 Nov 21, 2023 Author

francois-rozet Nov 21, 2023 Maintainer

zqy767 Nov 21, 2023 Author

francois-rozet Nov 21, 2023 Maintainer

Why

How

Implementation

CaioDaumann Apr 17, 2024

francois-rozet Apr 17, 2024 Maintainer

francois-rozet Apr 17, 2024 Maintainer

CaioDaumann Apr 17, 2024

francois-rozet Apr 17, 2024 Maintainer

zqy767
Nov 21, 2023

Replies: 2 comments 11 replies

simonschnake
Nov 21, 2023

zqy767 Nov 21, 2023
Author

francois-rozet Nov 21, 2023
Maintainer

zqy767 Nov 21, 2023
Author

francois-rozet
Nov 21, 2023
Maintainer

francois-rozet Apr 17, 2024
Maintainer

francois-rozet Apr 17, 2024
Maintainer

francois-rozet Apr 17, 2024
Maintainer