Faster advice-provider-to-memory data un-hashing #1600

Al-Kindi-0 · 2024-12-13T15:11:38Z

Feature description

We usually use the following pattern for loading data from the advice provider into the VM's memory:

push.DESTINATION_POINTER
padw padw padw
repeat.x
    adv_pipe hperm
end

This pattern is so fundamental that it might make sense to introduce a native instruction in order to do it at the chiplet level. The following is meant to start a discussion on how to build such an instruction.

Modifying the hasher chiplet

From the stack side, the instruction could have as input the following stack elements:

Digest is the claimed commitment to the data just about to get hashed, or in other words the hash of the unhashed data.
dst_ptr is the pointer to the start of the region of memory to store the aforementioned data.
end_ptr is the pointer to the last word of the said region of memory.

The stack will then have to send a request to the hasher chiplet in order to initiate a linear hash elements. The request could be
(op_label, clk, ctx, dst_ptr) and (op_label, clk, ctx, end_ptr, Digest).

The hasher will then respond with (op_label, clk, ctx, dst_ptr) to signal the start of the requested operation and at the last step with (op_label, clk, ctx, end_ptr, Digest) to signal the end of the operation. Note that there is no way, at least not to me, around adding at least three extra columns to the hasher chiplet in order to keep track of clk, ctx and ptr, where clk and ctx remain fixed throughout the operation and ptr starts with dst_ptr and is incremented at each 8-row cycle by 2 until we reach end_ptr. We might also need one additional column to be able to add the new hasher internal instruction but we might get away without doing so by redesigning things.
Thus in total we would need to increase the number of columns in the hasher chiplet by at least 3.
We now describe what happens between the two hasher responses above. The hasher chiplet, upon initiating the linear hash operation above, will send a request, at each 8-row cycle, to the memory chiplet of the form (op_label, clk, ctx, addr, Data[..4]) and (op_label, clk, ctx, addr + 1, Data[4..]) where Data[0..8] is the current double word occupying the rate portion of the hasher state and which is filled up by the prover non-deterministically.
Note that with the current degree of the flags in the hasher chiplet, which is maximum $4$, the chiplets' bus can accommodate the above response and double request even if we have to increase the current internal flags degrees.
From the side of the memory chiplet nothing changes as we are using one bus to connect all of the VM components. This would change if for some reason we decide to isolate memory related things into a separate bus.

The Old Approach and Computing inner products

Most of the time, the hashed-to-memory data will at some point need to be involved in some random linear combination computation. For most situations, this amounts to an inner product between two vectors stored in memory, where at least one of the two vectors will be loaded from the advice provider i.e., un-hashed. To get a feeling for this, I will describe this using the example of computing the DEEP queries.

Fix an $x$ in the LDE domain and let

$$ \mathsf{S} := \sum_{i=0}^k\alpha_i {\left(\frac{T_i(x) - T_i(z)}{x - z} + \frac{T_i(x) - T_i(z \cdot g)}{x - z \cdot g} \right)}. $$

Then

$$\begin{aligned} \mathsf{S} &= \sum_{i=0}^k\alpha_i {\left(\frac{T_i(x) - T_i(z)}{x - z} \right)} + \sum_{i=0}^k\alpha_i {\left(\frac{T_i(x) - T_i(z \cdot g)}{x - z \cdot g} \right)} \\\ &= \sum_{i=0}^k {\left(\frac{\alpha_i\cdot T_i(x) - \alpha_i\cdot T_i(z)}{x - z} \right)} + \sum_{i=0}^k{\left(\frac{\alpha_i\cdot T_i(x) - \alpha_i\cdot T_i(z \cdot g)}{x - z \cdot g} \right)} \\\ &= \frac{1}{x - z}\sum_{i=0}^k \left(\alpha_i\cdot T_i(x) - \alpha_i\cdot T_i(z) \right) +\frac{1}{x - z \cdot g} \sum_{i=0}^k\left(\alpha_i\cdot T_i(x) - \alpha_i\cdot T_i(z \cdot g) \right) \\\ &= \frac{1}{x - z} \left(\mathsf{ip}_x - \mathsf{ip}_z \right) +\frac{1}{x - z \cdot g} \left(\mathsf{ip}_x - \mathsf{ip}_{gz} \right) \end{aligned}$$

$$</code>\mathsf{ip}x := \langle\alpha{\cdot}, T_{\cdot}(x)\rangle = \sum_{i=0}^k\alpha_i\cdot T_i(x)<code>$$
$$</code>\mathsf{ip}z := \langle\alpha{\cdot}, T_{\cdot}(z)\rangle := \sum_{i=0}^k\alpha_i\cdot T_i(z)<code>$$
$$</code>\mathsf{ip}{gz} := \langle\alpha{\cdot}, T_{\cdot}(z \cdot g)\rangle := \sum_{i=0}^k\alpha_i\cdot T_i(z \cdot g)<code>$$

Notice the following:

$\mathsf{ip}_z$ and $\mathsf{ip}_{gz}$ are independent of $x$ and as such they can be computed once and for all FRI queries.
The only terms dependent on $x$ are $\mathsf{ip}_x$, $\frac{1}{x - z}$ and $\frac{1}{x - gz}$.
Computing $\mathsf{ip}_x$, $\mathsf{ip}_z$ or $\mathsf{ip}_{gz}$ amounts to the same as computing an inner product of two vectors.

All in all, the most critical part in computing DEEP queries is the computation of quantities of the form $\sum_{i=0}^{n-1} a_i \cdot b_i$ where $b_i$ can be over either the base or extension field and $a_i$ is always over the extension field. Hence, we will need separate operations in order to handle both situations. Note that one of these instructions is similar but more general to the rcomb_base instruction.

Assume the degree of our extension field is 2, then given the above, we can introduce a new instruction, call it inner_product_base, which will help us compute the inner product as the sum of terms of the form $$a_0 \cdot b_0 + \cdots + a_3 \cdot b_3$$ where $a_i$ are over the base field and are on the operand stack and $b_i$ are extension field elements and are stored in memory.

The main change that will be needed to make the following proposal work is extending the size of the helper registers from $6$ to $8$ field elements. Assuming this, the effect of the inner_product_base can be illustrated with the following:

Note that $acc = (acc0, acc1)$ is an accumulator holding the values of inner product sum thus far and $acc^{'}$ is the updated sum that includes $$a_0 \cdot b_0 + \cdots + a_3 \cdot b_3$$

Note also that the pointer to $b$ is updated and that the two upper words of the stack are permuted. Hence we can compute the next term by an immediate second call to inner_product_base with the following resulting effect:

Example

Assume we have a very wide (main) trace, say $2400$ columns wide as in the Keccak case. Then for each of the DEEP queries we will need to compute an inner product between a base field vector and an extension field vector of length $2400$. This means that for each query we will have to run the following block:

repeat.300
    adv_pipe hperm
    inner_product_base
    inner_product_base
end

This means that we will only need $1200$ cycles to do most of the work needed to compute the DEEP queries. If we use the estimate of $100$ cycles for the remaining work, then we are looking at $1300$ cycles in total. This means that if the number of queries is $27$ then we would need around $35000$ cycles. Note that the two main components of recursive proof verification who's work increases linearly with the width of the trace are constraints evaluation and the computation of the DEEP queries. Hence, given our current implementation of the recursive verifier, the above suggests that we would be able to verify a recursive proof for Keccak in under $64000$ cycles.

For the recursive verifier for the Miden VM, substituting the current rcomb_base and rcomb_ext with inner_product_base and inner_product_ext should lead to a slightly faster implementation. If the above is interesting, then it should be relatively easy to have a POC implementation of DEEP queries computation using the new candidate instructions in order to get more concrete numbers.

Why is this feature needed?

This will improve recursive verification in Miden VM.

The text was updated successfully, but these errors were encountered:

Al-Kindi-0 · 2024-12-18T04:56:47Z

In relation to the use of the above proposal in Falcon, we can use the inner_product_base op in order to evaluate the polynomials $h$, $s_2$, $\pi_1$ and $\pi_2$ at a random point $\tau$.

The way to do so is, again, to compute the powers of $\tau$ and store them in memory. The pointer for this will be the same as the a_ptr above. The difference will be that now $\tau^i$ will be laid out in the more compact form [a_i, b_i, c_i, d_i] where $(a_i, b_i) := \tau^i$ and $(c_i, d_i) := \tau^{i+1}$ instead of [a_i, b_i, 0, 0].

As an example of the changes in the code, we will have the following block

    # Compute h(tau)
    repeat.64
        mem_stream
        repeat.8
            rcomb_base
        end
    end

    # Compute h(tau)
    repeat.64
        mem_stream
        inner_product_base
        inner_product_base
    end

This would mean an improvement in the cycle count from $576$ to $192$. Since we are evaluating 4 polynomials this means that we can shave off $1500$ cycles from the probabilistic_product procedure.

Al-Kindi-0 added the enhancement New feature or request label Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster advice-provider-to-memory data un-hashing #1600

Faster advice-provider-to-memory data un-hashing #1600

Al-Kindi-0 commented Dec 13, 2024 •

edited

Loading

Al-Kindi-0 commented Dec 18, 2024

Faster advice-provider-to-memory data un-hashing #1600

Faster advice-provider-to-memory data un-hashing #1600

Comments

Al-Kindi-0 commented Dec 13, 2024 • edited Loading

Feature description

Modifying the hasher chiplet

The Old Approach and Computing inner products

Example

Why is this feature needed?

Al-Kindi-0 commented Dec 18, 2024

Al-Kindi-0 commented Dec 13, 2024 •

edited

Loading