Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster advice-provider-to-memory data un-hashing #1600

Open
Al-Kindi-0 opened this issue Dec 13, 2024 · 1 comment
Open

Faster advice-provider-to-memory data un-hashing #1600

Al-Kindi-0 opened this issue Dec 13, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@Al-Kindi-0
Copy link
Collaborator

Al-Kindi-0 commented Dec 13, 2024

Feature description

We usually use the following pattern for loading data from the advice provider into the VM's memory:

push.DESTINATION_POINTER
padw padw padw
repeat.x
    adv_pipe hperm
end

This pattern is so fundamental that it might make sense to introduce a native instruction in order to do it at the chiplet level. The following is meant to start a discussion on how to build such an instruction.

Modifying the hasher chiplet

From the stack side, the instruction could have as input the following stack elements:

  1. Digest is the claimed commitment to the data just about to get hashed, or in other words the hash of the unhashed data.
  2. dst_ptr is the pointer to the start of the region of memory to store the aforementioned data.
  3. end_ptr is the pointer to the last word of the said region of memory.

The stack will then have to send a request to the hasher chiplet in order to initiate a linear hash elements. The request could be
(op_label, clk, ctx, dst_ptr) and (op_label, clk, ctx, end_ptr, Digest).

The hasher will then respond with (op_label, clk, ctx, dst_ptr) to signal the start of the requested operation and at the last step with (op_label, clk, ctx, end_ptr, Digest) to signal the end of the operation. Note that there is no way, at least not to me, around adding at least three extra columns to the hasher chiplet in order to keep track of clk, ctx and ptr, where clk and ctx remain fixed throughout the operation and ptr starts with dst_ptr and is incremented at each 8-row cycle by 2 until we reach end_ptr. We might also need one additional column to be able to add the new hasher internal instruction but we might get away without doing so by redesigning things.
Thus in total we would need to increase the number of columns in the hasher chiplet by at least 3.
We now describe what happens between the two hasher responses above. The hasher chiplet, upon initiating the linear hash operation above, will send a request, at each 8-row cycle, to the memory chiplet of the form (op_label, clk, ctx, addr, Data[..4]) and (op_label, clk, ctx, addr + 1, Data[4..]) where Data[0..8] is the current double word occupying the rate portion of the hasher state and which is filled up by the prover non-deterministically.
Note that with the current degree of the flags in the hasher chiplet, which is maximum $4$, the chiplets' bus can accommodate the above response and double request even if we have to increase the current internal flags degrees.
From the side of the memory chiplet nothing changes as we are using one bus to connect all of the VM components. This would change if for some reason we decide to isolate memory related things into a separate bus.

The Old Approach and Computing inner products

Most of the time, the hashed-to-memory data will at some point need to be involved in some random linear combination computation. For most situations, this amounts to an inner product between two vectors stored in memory, where at least one of the two vectors will be loaded from the advice provider i.e., un-hashed. To get a feeling for this, I will describe this using the example of computing the DEEP queries.

Fix an $x$ in the LDE domain and let

$$ \mathsf{S} := \sum_{i=0}^k\alpha_i {\left(\frac{T_i(x) - T_i(z)}{x - z} + \frac{T_i(x) - T_i(z \cdot g)}{x - z \cdot g} \right)}. $$

Then

$$\begin{aligned} \mathsf{S} &= \sum_{i=0}^k\alpha_i {\left(\frac{T_i(x) - T_i(z)}{x - z} \right)} + \sum_{i=0}^k\alpha_i {\left(\frac{T_i(x) - T_i(z \cdot g)}{x - z \cdot g} \right)} \\\ &= \sum_{i=0}^k {\left(\frac{\alpha_i\cdot T_i(x) - \alpha_i\cdot T_i(z)}{x - z} \right)} + \sum_{i=0}^k{\left(\frac{\alpha_i\cdot T_i(x) - \alpha_i\cdot T_i(z \cdot g)}{x - z \cdot g} \right)} \\\ &= \frac{1}{x - z}\sum_{i=0}^k \left(\alpha_i\cdot T_i(x) - \alpha_i\cdot T_i(z) \right) +\frac{1}{x - z \cdot g} \sum_{i=0}^k\left(\alpha_i\cdot T_i(x) - \alpha_i\cdot T_i(z \cdot g) \right) \\\ &= \frac{1}{x - z} \left(\mathsf{ip}_x - \mathsf{ip}_z \right) +\frac{1}{x - z \cdot g} \left(\mathsf{ip}_x - \mathsf{ip}_{gz} \right) \end{aligned}$$

$$</code>\mathsf{ip}<em>x := \langle\alpha</em>{\cdot}, T_{\cdot}(x)\rangle = \sum_{i=0}^k\alpha_i\cdot T_i(x)<code>$$
$$</code>\mathsf{ip}<em>z := \langle\alpha</em>{\cdot}, T_{\cdot}(z)\rangle := \sum_{i=0}^k\alpha_i\cdot T_i(z)<code>$$
$$</code>\mathsf{ip}<em>{gz} := \langle\alpha</em>{\cdot}, T_{\cdot}(z \cdot g)\rangle := \sum_{i=0}^k\alpha_i\cdot T_i(z \cdot g)<code>$$

Notice the following:

  1. $\mathsf{ip}_z$ and $\mathsf{ip}_{gz}$ are independent of $x$ and as such they can be computed once and for all FRI queries.
  2. The only terms dependent on $x$ are $\mathsf{ip}_x$, $\frac{1}{x - z}$ and $\frac{1}{x - gz}$.
  3. Computing $\mathsf{ip}_x$, $\mathsf{ip}_z$ or $\mathsf{ip}_{gz}$ amounts to the same as computing an inner product of two vectors.

All in all, the most critical part in computing DEEP queries is the computation of quantities of the form $\sum_{i=0}^{n-1} a_i \cdot b_i$ where $b_i$ can be over either the base or extension field and $a_i$ is always over the extension field. Hence, we will need separate operations in order to handle both situations. Note that one of these instructions is similar but more general to the rcomb_base instruction.

Assume the degree of our extension field is 2, then given the above, we can introduce a new instruction, call it inner_product_base, which will help us compute the inner product as the sum of terms of the form $$a_0 \cdot b_0 + \cdots + a_3 \cdot b_3$$ where $a_i$ are over the base field and are on the operand stack and $b_i$ are extension field elements and are stored in memory.

The main change that will be needed to make the following proposal work is extending the size of the helper registers from $6$ to $8$ field elements. Assuming this, the effect of the inner_product_base can be illustrated with the following:

first_step_of_lin_comb_base

Note that $acc = (acc0, acc1)$ is an accumulator holding the values of inner product sum thus far and $acc^{'}$ is the updated sum that includes $$a_0 \cdot b_0 + \cdots + a_3 \cdot b_3$$

Note also that the pointer to $b$ is updated and that the two upper words of the stack are permuted. Hence we can compute the next term by an immediate second call to inner_product_base with the following resulting effect:

second_step_of_lin_comb_base

Example

Assume we have a very wide (main) trace, say $2400$ columns wide as in the Keccak case. Then for each of the DEEP queries we will need to compute an inner product between a base field vector and an extension field vector of length $2400$. This means that for each query we will have to run the following block:

repeat.300
    adv_pipe hperm
    inner_product_base
    inner_product_base
end

This means that we will only need $1200$ cycles to do most of the work needed to compute the DEEP queries. If we use the estimate of $100$ cycles for the remaining work, then we are looking at $1300$ cycles in total. This means that if the number of queries is $27$ then we would need around $35000$ cycles. Note that the two main components of recursive proof verification who's work increases linearly with the width of the trace are constraints evaluation and the computation of the DEEP queries. Hence, given our current implementation of the recursive verifier, the above suggests that we would be able to verify a recursive proof for Keccak in under $64000$ cycles.

For the recursive verifier for the Miden VM, substituting the current rcomb_base and rcomb_ext with inner_product_base and inner_product_ext should lead to a slightly faster implementation. If the above is interesting, then it should be relatively easy to have a POC implementation of DEEP queries computation using the new candidate instructions in order to get more concrete numbers.

Why is this feature needed?

This will improve recursive verification in Miden VM.

@Al-Kindi-0 Al-Kindi-0 added the enhancement New feature or request label Dec 13, 2024
@Al-Kindi-0
Copy link
Collaborator Author

In relation to the use of the above proposal in Falcon, we can use the inner_product_base op in order to evaluate the polynomials $h$, $s_2$, $\pi_1$ and $\pi_2$ at a random point $\tau$.

The way to do so is, again, to compute the powers of $\tau$ and store them in memory. The pointer for this will be the same as the a_ptr above. The difference will be that now $\tau^i$ will be laid out in the more compact form [a_i, b_i, c_i, d_i] where $(a_i, b_i) := \tau^i$ and $(c_i, d_i) := \tau^{i+1}$ instead of [a_i, b_i, 0, 0].

As an example of the changes in the code, we will have the following block

    # Compute h(tau)
    repeat.64
        mem_stream
        repeat.8
            rcomb_base
        end
    end
    # Compute h(tau)
    repeat.64
        mem_stream
        inner_product_base
        inner_product_base
    end

This would mean an improvement in the cycle count from $576$ to $192$. Since we are evaluating 4 polynomials this means that we can shave off $1500$ cycles from the probabilistic_product procedure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant