Flash Attention vs Triton Flash Attention #180

germanjke · 2023-05-20T10:19:25Z

Hi, i want to know about choice in your MPT model

Yes, Triton version supports alibi and fast forward, but it's have some disadvantages:

slow backward;
slower forward + bacward;
no dropout;
no different batch seqs;
implementation is really experimental: works with only some on num_heads, another bugs (i know we can using some custom implementation but i'm afraid it's have some bugs as well)

Do you think alibi choice is so important in this case?

It's looks like some trade off

vchiley · 2023-05-20T14:57:16Z

Without delving into the implementation details, using ALiBi as the networks position embedding is beneficial regardless of how exactly it is supported.

Relative position embedding

The default positional embedding is learned positional embeddings. The issue with learned positional embeddings is that inference max seq len is limited to the training max seq len.
ALiBi, being a relative position embedding, allows the user to extend the seq len for inference.

Source: ALiBi paper

Convergence

In my experience, ALiBi also has faster convergence to other position embedding schemes we tried.

Implementation details ie Triton

This is the flash attn implementation we use if attn_config: attn_impl: triton (note: the custom implementation you cite is the bases of the version we use).
Although the comments are copied from here, within a network, I've found that the triton version has

"~~slower~~ faster forward + backward"
it doesn't matter if it doesn't have dropout because we don't use it
no different batch seqs is fine since we concatenate sequences
the cited limitations on n_heads is a limitation on the CUDA version of FlashAttn. The triton implementation we use seems not to have any limitation (besides head_dim <= 128; they do note that it hasn't been thoroughly tested at all dim sizes)

Given that triton has faster forward + backward, I'd advocate for using the triton version. The fact that it supports ALiBi is a bonus (a very welcome bonus); I wouldn't necessarily call it a tradeoff.

germanjke · 2023-05-21T14:38:41Z

great stuff, if it's faster it's cool, but here they told about slower computation, I am not wrong?

germanjke · 2023-05-21T14:39:17Z

by the way can you tell me about setpu.py please, why you use torch 1.13.1 there, if we want to use torch 2?

germanjke · 2023-05-21T14:46:08Z

why you turn off this from setup.py 'triton-pre-mlir@git+https://github.com/vchiley/triton.git@triton_pre_mlir#subdirectory=python', it's was on your torch2 version, but now master without this: PyPI don't have direct packages for real
i mean this pull

but in main branch i guess its ok and its triton pre mlir

germanjke · 2023-05-21T22:19:33Z

upd: everything is fine with current main version

here is versions (just to information when branch will be updates):

torch==1.13.1+cu117 triton-pre-mlir @ git+https://github.com/vchiley/triton.git@2dd3b957698a39bbca615c02a447a98482c144a3#subdirectory=python`
flash-attn==v1.0.3.post0

installed everything from setup.py(gpu setup) with this docker mosaicml/pytorch:latest

vchiley · 2023-05-22T15:48:33Z

#181
undid #178
which we'll try to redo soon.

(the PR was fine, but Huggingface didn't like it so we're working on a workaround for HF stuff)

vchiley · 2023-05-26T17:24:16Z

torch2 reintegrated

vchiley self-assigned this May 20, 2023

vchiley closed this as completed May 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flash Attention vs Triton Flash Attention #180

Flash Attention vs Triton Flash Attention #180

germanjke commented May 20, 2023 •

edited

Loading

vchiley commented May 20, 2023

germanjke commented May 21, 2023

germanjke commented May 21, 2023

germanjke commented May 21, 2023 •

edited

Loading

germanjke commented May 21, 2023

vchiley commented May 22, 2023

vchiley commented May 26, 2023

Flash Attention vs Triton Flash Attention #180

Flash Attention vs Triton Flash Attention #180

Comments

germanjke commented May 20, 2023 • edited Loading

vchiley commented May 20, 2023

Relative position embedding

Convergence

Implementation details ie Triton

germanjke commented May 21, 2023

germanjke commented May 21, 2023

germanjke commented May 21, 2023 • edited Loading

germanjke commented May 21, 2023

vchiley commented May 22, 2023

vchiley commented May 26, 2023

germanjke commented May 20, 2023 •

edited

Loading

germanjke commented May 21, 2023 •

edited

Loading