Mamba2 `torch_forward` reduction dimension possibly incorrect? #34817

HanGuo97 · 2024-11-19T22:47:48Z

System Info

NA

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

NA

Expected behavior

In the torch_forward part of Mamba2, it seems like the reduction dimension should be dim=3 instead of dim=2?

transformers/src/transformers/models/mamba2/modeling_mamba2.py

Line 560 in 3033509

    
           result = (decay_chunk[..., None, None] * states_permuted[:, :, None, ...]).sum(dim=2)

with dim=3, the output seems to more or less match that of Mamba-2's ssd_minimal implementation, but not with dim=2

The text was updated successfully, but these errors were encountered:

vasqu · 2024-11-19T23:35:17Z

Yes, that seems correct. Good spotting! I have a fairly extended ramble into this below (ignore if its too much :) cc @molbap

We can also see it based on the einsum notation in the ssd minimal script:
bhzc,bchpn->bzhpn

In this case as we do not use the einsum notation I will notate the same dimension notations before the sum and broadcasted (via none) values as simple 1:
decay chunk: bhzc11
permuted states: bh1cpn
So based on the multiplication before the sum we get bhzcpn and since we wanted shape bzhpn we need to sum along c (on dim=3) and reshape afterwards.

Just a quick idea: I'm not sure if we even have to reshape twice instead of once by reshaping the decay chunks only (not checked):
states: bc1hpn
permuted decay chunks: bczh11
Resulting in bczhpn and finally to bzhpn (after sum on dim=1) - hence we avoid the double permutation and just do it "once".

vasqu · 2024-11-19T23:38:08Z

I'm a bit suprised that the following operations after that don't fail. Have you tested your fixed version on a forward?

HanGuo97 · 2024-11-20T03:13:14Z

As far as I remember, the following operations won't fail because the reductions was on the number of (source) chunks even though it should be on the number of (target) chunk. During training, these two are of the same size.

vasqu · 2024-11-20T03:30:44Z

It's been a while but yea that makes sense. Thx for clarifying!

vasqu · 2024-11-20T15:46:45Z

A tad late, but I've verified it myself now based on my test and modifying the respective local ssd minimal:

Either we need to sum on dim=3
Or use transposed decays as I led to before (1 permutation less):

decay_chunk = decay_chunk.transpose(1, 3)
new_states = (decay_chunk[..., None, None] * states[:, :, None, ...]).sum(dim=1)

HanGuo97 added the bug label Nov 19, 2024

This was referenced Nov 22, 2024

[Mamba2] Fix slow path sustcsonglin/flash-linear-attention#84

Merged

[Mamba2] Fix slow path #34901

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mamba2 `torch_forward` reduction dimension possibly incorrect? #34817

Mamba2 `torch_forward` reduction dimension possibly incorrect? #34817

HanGuo97 commented Nov 19, 2024

vasqu commented Nov 19, 2024

vasqu commented Nov 19, 2024

HanGuo97 commented Nov 20, 2024

vasqu commented Nov 20, 2024

vasqu commented Nov 20, 2024

Mamba2 torch_forward reduction dimension possibly incorrect? #34817

Mamba2 torch_forward reduction dimension possibly incorrect? #34817

Comments

HanGuo97 commented Nov 19, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

vasqu commented Nov 19, 2024

vasqu commented Nov 19, 2024

HanGuo97 commented Nov 20, 2024

vasqu commented Nov 20, 2024

vasqu commented Nov 20, 2024

Mamba2 `torch_forward` reduction dimension possibly incorrect? #34817

Mamba2 `torch_forward` reduction dimension possibly incorrect? #34817