PhiForCausalLM does not support Flash Attention 2.0 #28381

gmittal · 2024-01-08T01:40:26Z

import torch
from transformers import AutoModelForCausalLM, AutoModel

model = AutoModelForCausalLM.from_pretrained(
    'microsoft/phi-2',
    use_flash_attention_2=True,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)

Throws:

ValueError: PhiForCausalLM does not support Flash Attention 2.0 yet. Please open an issue on GitHub to request support for this architecture: https://github.com/huggingface/transformers/issues/new

The text was updated successfully, but these errors were encountered:

rootonchair · 2024-01-08T14:13:28Z

Hi, I would like to work on this issue

NielsRogge · 2024-01-08T16:34:00Z

Support for Phi-2 is still WIP, you can follow the progress here: #28163

susnato · 2024-01-08T18:51:44Z

Hi @gmittal, Flash Attention is already implemented for Phi, PR

It seems that you are using the hub version of phi-2. Please use it from the library to properly enable Flash Attention.
For now microsoft/phi-2, does not have the correct order of the weights to be used with the library model so please use it from susnato/phi-2.

First update to the latest transformers version -

pip install -U transformers

then run -

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("susnato/phi-2", 
    use_flash_attention_2=True, 
    torch_dtype=torch.float16)

tokenizer = AutoTokenizer.from_pretrained("susnato/phi-2")

inputs = tokenizer('''def print_prime(n):
   """
   Print all primes between 1 and n
   """''', return_tensors="pt", return_attention_mask=False)

outputs = model.generate(**inputs, max_length=200)
text = tokenizer.batch_decode(outputs)[0]
print(text)

Let me know if this works or not.

nakranivaibhav · 2024-01-09T15:55:57Z

I would like to work on this issue

NicolasMejiaPetit · 2024-01-10T03:23:11Z

Using HF alignment notebook, DPO script gives me this error regardless of transformers version. (I already force updated with pip). When I remove flash attention from the yaml it works (after a bit of code adjustment). I am able to fine tune with one of my sft scripts using flash attention, which is the strange part.

gugarosa · 2024-01-12T13:18:28Z

Hello everyone!

This should be fixed in transformers 4.37.0.dev. If not using that version, please make sure that trust_remote_code=True when loading the model and it should work out-of-the-box with flash-attention 2.

NielsRogge · 2024-01-12T14:16:35Z

Thanks! Closing as this was fixed in #28163

NicolasMejiaPetit · 2024-01-13T04:44:04Z

I installed from source so now i am on transformers 4.37.dev0 and i am still getting the Incompatible error, even with trust remote code set to true.

`C:\Users\PC\Documents\Code-Trainer\FineTune>py FINETUNERphiFP16.py --model_name_or_path C:\Users\PC\Documents\NEWGEN\text-generation-webui-main\models\dolphin-2_6-phi-2 --data_path MiniCoderW.json --output_dir C:\Users\PC\Documents\NEWGEN\text-generation-webui-main\models\TrainedPhi --num_train_epochs 3 --model_max_length 1024 --per_device_train_batch_size 1 --evaluation_strategy "no" --save_strategy "steps" --save_steps 1000 --save_total_limit 10 --learning_rate 2e-5 --warmup_steps 10 --logging_steps 10 --lr_scheduler_type "cosine" --report_to "tensorboard" --bf16 False --dataloader_num_workers 12 --optim paged_adamw_8bit
WARNING:tensorflow:From C:\Python311\Lib\site-packages\keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead.

====================================================================================================
TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
cache_dir=None,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=12,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=2e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=C:\Users\PC\Documents\NEWGEN\text-generation-webui-main\models\TrainedPhi\runs\Jan12_23-36-31_Nicolas,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=10,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_kwargs={},
lr_scheduler_type=SchedulerType.COSINE,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
model_max_length=1024,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=3.0,
optim=OptimizerNames.PAGED_ADAMW_8BIT,
optim_args=None,
output_dir=C:\Users\PC\Documents\NEWGEN\text-generation-webui-main\models\TrainedPhi,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=1,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=C:\Users\PC\Documents\NEWGEN\text-generation-webui-main\models\TrainedPhi,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=1000,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=10,
seed=42,
skip_memory_metrics=True,
split_batches=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=10,
weight_decay=0.0,
)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
PAD Token: <|endoftext|> 50256
BOS Token <|endoftext|> 50256
EOS Token <|im_end|> 50295
Load tokenizer from C:\Users\PC\Documents\NEWGEN\text-generation-webui-main\models\dolphin-2_6-phi-2 over.
Traceback (most recent call last):
File "C:\Users\PC\Documents\Code-Trainer\FineTune\FINETUNERphiFP16.py", line 192, in
train()
File "C:\Users\PC\Documents\Code-Trainer\FineTune\FINETUNERphiFP16.py", line 145, in train
model = transformers.AutoModelForCausalLM.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Python311\Lib\site-packages\transformers\models\auto\auto_factory.py", line 561, in from_pretrained
return model_class.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Python311\Lib\site-packages\transformers\modeling_utils.py", line 3497, in from_pretrained
config = cls._autoset_attn_implementation(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Python311\Lib\site-packages\transformers\modeling_utils.py", line 1340, in _autoset_attn_implementation
cls._check_and_enable_flash_attn_2(
File "C:\Python311\Lib\site-packages\transformers\modeling_utils.py", line 1420, in _check_and_enable_flash_attn_2
raise ValueError(
ValueError: PhiForCausalLM does not support Flash Attention 2.0 yet. Please request to add support where the model is hosted, on its model hub page: https://huggingface.co/C:\Users\PC\Documents\NEWGEN\text-generation-webui-main\models\dolphin-2_6-phi-2/discussions/new or in the Transformers GitHub repo: https://github.com/huggingface/transformers/issues/new
`

Here is the script I am using:

`import copy
import random
from dataclasses import dataclass, field
from typing import Optional, Dict, Sequence

import torch
import transformers
from transformers import Trainer
from datasets import load_dataset

IGNORE_INDEX = -100
EOT_TOKEN = "<|EOT|>"

def build_instruction_prompt(instruction: str):
return '''
You are an AI programming assistant, utilizing the DeepSeek Coder model, developed by DeepSeek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.

Instruction:

{}

Response:

'''.format(instruction.strip()).lstrip()

@DataClass
class ModelArguments:
model_name_or_path: Optional[str] = field(default="deepseek-ai/deepseek-coder-6.7b-instruct")

@DataClass
class DataArguments:
data_path: str = field(default=None, metadata={"help": "Path to the training data."})

@DataClass
class TrainingArguments(transformers.TrainingArguments):
cache_dir: Optional[str] = field(default=None)
optim: str = field(default="adamw_torch")
model_max_length: int = field(
default=512,
metadata={"help": "Maximum sequence length. Sequences will be right padded (and possibly truncated)."},
)

def safe_save_model_for_hf_trainer(trainer: transformers.Trainer, output_dir: str):
"""Collects the state dict and dump to disk."""
state_dict = trainer.model.state_dict()
if trainer.args.should_save:
cpu_state_dict = {key: value.cpu() for key, value in state_dict.items()}
del state_dict
trainer._save(output_dir, state_dict=cpu_state_dict) # noqa

def _tokenize_fn(strings: Sequence[str], tokenizer: transformers.PreTrainedTokenizer) -> Dict:
"""Tokenize a list of strings."""
tokenized_list = [
tokenizer(
text,
return_tensors="pt",
padding="longest",
max_length=tokenizer.model_max_length,
truncation=True,
)
for text in strings
]

input_ids = labels = [tokenized.input_ids[0] for tokenized in tokenized_list]
input_ids_lens = labels_lens = [
    tokenized.input_ids.ne(tokenizer.pad_token_id).sum().item() for tokenized in tokenized_list
]

return dict(
    input_ids=input_ids,
    labels=labels,
    input_ids_lens=input_ids_lens,
    labels_lens=labels_lens,
)

def preprocess(
sources: Sequence[str],
targets: Sequence[str],
tokenizer: transformers.PreTrainedTokenizer,
) -> Dict:
"""Preprocess the data by tokenizing."""
examples = [s + t for s, t in zip(sources, targets)]
examples_tokenized, sources_tokenized = [_tokenize_fn(strings, tokenizer) for strings in (examples, sources)]
input_ids = examples_tokenized["input_ids"]

labels = copy.deepcopy(input_ids)
for label, source_len in zip(labels, sources_tokenized["input_ids_lens"]):
    label[:source_len] = IGNORE_INDEX
return dict(input_ids=input_ids, labels=labels)

@DataClass
class DataCollatorForSupervisedDataset(object):
"""Collate examples for supervised fine-tuning."""
tokenizer: transformers.PreTrainedTokenizer

def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
    input_ids, labels = tuple([instance[key] for instance in instances] for key in ("input_ids", "labels"))
    input_ids = [torch.tensor(x) for x in input_ids]
    input_ids = torch.nn.utils.rnn.pad_sequence(
        input_ids, batch_first=True, padding_value=self.tokenizer.pad_token_id
    )
    labels = [torch.tensor(x) for x in labels]
    labels = torch.nn.utils.rnn.pad_sequence(labels, batch_first=True, padding_value=IGNORE_INDEX)
    
    return dict(
        input_ids=input_ids,
        labels=labels,
        attention_mask=input_ids.ne(self.tokenizer.pad_token_id),
    )

def train_tokenize_function(examples, tokenizer):
sources = [
build_instruction_prompt(instruction)
for instruction in examples['instruction']
]
targets = [f"{output}\n{EOT_TOKEN}" for output in examples['output']]
data_dict = preprocess(sources, targets, tokenizer)
return data_dict

def train():
parser = transformers.HfArgumentParser((ModelArguments, DataArguments, TrainingArguments))
model_args, data_args, training_args = parser.parse_args_into_dataclasses()

if training_args.local_rank == 0:
    print('='*100)
    print(training_args)

tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_args.model_name_or_path,
    model_max_length=training_args.model_max_length,
    padding_side="right",
    use_fast=True,
    trust_remote_code=True
)
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

print("PAD Token:", tokenizer.pad_token, tokenizer.pad_token_id)
print("BOS Token", tokenizer.bos_token, tokenizer.bos_token_id)
print("EOS Token", tokenizer.eos_token, tokenizer.eos_token_id)

if training_args.local_rank == 0:
    print("Load tokenizer from {} over.".format(model_args.model_name_or_path))

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_args.model_name_or_path,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    attn_implementation="flash_attention_2",
)

if training_args.local_rank == 0:
    print("Load model from {} over.".format(model_args.model_name_or_path))


raw_train_datasets = load_dataset(
    'json',
    data_files=data_args.data_path,
    split="train",
    cache_dir=training_args.cache_dir
)
    
train_dataset = raw_train_datasets.map(
    train_tokenize_function,
    batched=True,
    batch_size=3000,
    num_proc=32,
    remove_columns=raw_train_datasets.column_names,
    load_from_cache_file=True, # not args.overwrite_cache
    desc="Running Encoding",
    fn_kwargs={ "tokenizer": tokenizer }
)


if training_args.local_rank == 0:
    print("Training dataset samples:", len(train_dataset))
    for index in random.sample(range(len(train_dataset)), 3):
        print(f"Sample {index} of the training set: {train_dataset[index]['input_ids']}, {train_dataset[index]['labels']}.")
        print(f"Sample {index} of the training set: {tokenizer.decode(list(train_dataset[index]['input_ids']))}.")

data_collator = DataCollatorForSupervisedDataset(tokenizer=tokenizer)
data_module = dict(train_dataset=train_dataset, eval_dataset=None, data_collator=data_collator)

trainer = Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module)

trainer.train()
trainer.save_state()
safe_save_model_for_hf_trainer(trainer=trainer, output_dir=training_args.output_dir)

if name == "main":
train()
`

NielsRogge · 2024-01-13T10:37:02Z

Hi @NickWithBotronics if you set trust_remote_code=True, then the code from the hub is used (in case of microsoft/phi-2 that's defined here), rather than modeling_phi.py defined natively in the Transformers library.

Hence it's recommended to convert the weights from the microsoft/phi-2 repo to a native one, which will work with Flash Attention 2. One can leverage the conversion script for that.

@ArthurZucker should we host the converted phi-2 weights as part of the Microsoft organization? Cause currently one will get a lot of mismatched keys when doing the following:

import torch
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    'microsoft/phi-2',
    use_flash_attention_2=True,
    torch_dtype=torch.bfloat16,
)

due to the the model in Transformers using a single matrix for queries, keys and values wheras the code on the hub uses separate matrices.

NicolasMejiaPetit · 2024-01-14T02:03:52Z

Thank you <3 !!!! that fixed that error(using the new modeling.py and converted hf format), now onto a new error that's due to my script I think?. :(
'
C:\Python311\Lib\site-packages\torch_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with model.to('cuda').
Traceback (most recent call last):
File "C:\Users\PC\Documents\Code-Trainer\FineTune\FINETUNERphiFP16.py", line 192, in
train()
File "C:\Users\PC\Documents\Code-Trainer\FineTune\FINETUNERphiFP16.py", line 145, in train
model = transformers.AutoModelForCausalLM.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Python311\Lib\site-packages\transformers\models\auto\auto_factory.py", line 561, in from_pretrained
return model_class.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Python311\Lib\site-packages\transformers\modeling_utils.py", line 3503, in from_pretrained
model = cls(config, *model_args, **model_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\PC.cache\huggingface\modules\transformers_modules\MiniPhi\modeling_phi.py", line 967, in init
self.model = PhiModel(config)
^^^^^^^^^^^^^^^^
File "C:\Users\PC.cache\huggingface\modules\transformers_modules\MiniPhi\modeling_phi.py", line 821, in init
[PhiDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
File "C:\Users\PC.cache\huggingface\modules\transformers_modules\MiniPhi\modeling_phi.py", line 821, in
[PhiDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\PC.cache\huggingface\modules\transformers_modules\MiniPhi\modeling_phi.py", line 629, in init
self.self_attn = PHI_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx=layer_idx)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\PC.cache\huggingface\modules\transformers_modules\MiniPhi\modeling_phi.py", line 412, in init
super().init(*args, **kwargs)
File "C:\Users\PC.cache\huggingface\modules\transformers_modules\MiniPhi\modeling_phi.py", line 245, in init
self.attention_dropout = config.attention_dropout
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Python311\Lib\site-packages\transformers\configuration_utils.py", line 265, in getattribute
return super().getattribute(key)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'PhiConfig' object has no attribute 'attention_dropout'

C:\Users\PC\Documents\Code-Trainer\FineTune>
"

Edit: fixed it by downloading the latest: Generation_config.json, Config.json, Configuration_phi.py, and Modeling_phi.py

NicolasMejiaPetit · 2024-01-14T23:04:50Z

while I got it working, the training loss was very wack. It started at 6 and went to 2 (after 3 epochs) but when I used the old config with out flash attention it was .6 to ~.29(also 3 epochs) same dataset same set up, same model. Just different config files and flash attention. I saw someone else experience the same thing on twitter.

ArthurZucker · 2024-01-15T10:25:16Z

Can you open a seperate issue for this? With a reproducible snippet

NicolasMejiaPetit · 2024-01-15T16:51:19Z

Gotcha, I’ll move to this ticket #28488

NielsRogge added the Good First Issue label Jan 8, 2024

ArthurZucker added the Feature request Request for a new feature label Jan 8, 2024

NielsRogge closed this as completed Jan 12, 2024

NielsRogge removed the Good First Issue label Jan 15, 2024

NicolasMejiaPetit mentioned this issue Jan 15, 2024

fine tuning the updated Phi-2 with flash-attn-2 produces very high loss > 2 #28488

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PhiForCausalLM does not support Flash Attention 2.0 #28381

PhiForCausalLM does not support Flash Attention 2.0 #28381

gmittal commented Jan 8, 2024

rootonchair commented Jan 8, 2024

NielsRogge commented Jan 8, 2024

susnato commented Jan 8, 2024 •

edited

Loading

nakranivaibhav commented Jan 9, 2024

NicolasMejiaPetit commented Jan 10, 2024

gugarosa commented Jan 12, 2024

NielsRogge commented Jan 12, 2024

NicolasMejiaPetit commented Jan 13, 2024

NielsRogge commented Jan 13, 2024 •

edited

Loading

NicolasMejiaPetit commented Jan 14, 2024 •

edited

Loading

NicolasMejiaPetit commented Jan 14, 2024 •

edited

Loading

ArthurZucker commented Jan 15, 2024

NicolasMejiaPetit commented Jan 15, 2024

PhiForCausalLM does not support Flash Attention 2.0 #28381

PhiForCausalLM does not support Flash Attention 2.0 #28381

Comments

gmittal commented Jan 8, 2024

rootonchair commented Jan 8, 2024

NielsRogge commented Jan 8, 2024

susnato commented Jan 8, 2024 • edited Loading

nakranivaibhav commented Jan 9, 2024

NicolasMejiaPetit commented Jan 10, 2024

gugarosa commented Jan 12, 2024

NielsRogge commented Jan 12, 2024

NicolasMejiaPetit commented Jan 13, 2024

Instruction:

Response:

NielsRogge commented Jan 13, 2024 • edited Loading

NicolasMejiaPetit commented Jan 14, 2024 • edited Loading

NicolasMejiaPetit commented Jan 14, 2024 • edited Loading

ArthurZucker commented Jan 15, 2024

NicolasMejiaPetit commented Jan 15, 2024

susnato commented Jan 8, 2024 •

edited

Loading

NielsRogge commented Jan 13, 2024 •

edited

Loading

NicolasMejiaPetit commented Jan 14, 2024 •

edited

Loading

NicolasMejiaPetit commented Jan 14, 2024 •

edited

Loading