Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PhiForCausalLM does not support Flash Attention 2.0 #28381

Closed
gmittal opened this issue Jan 8, 2024 · 13 comments
Closed

PhiForCausalLM does not support Flash Attention 2.0 #28381

gmittal opened this issue Jan 8, 2024 · 13 comments
Labels
Feature request Request for a new feature

Comments

@gmittal
Copy link

gmittal commented Jan 8, 2024

import torch
from transformers import AutoModelForCausalLM, AutoModel

model = AutoModelForCausalLM.from_pretrained(
    'microsoft/phi-2',
    use_flash_attention_2=True,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)

Throws:

ValueError: PhiForCausalLM does not support Flash Attention 2.0 yet. Please open an issue on GitHub to request support for this architecture: https://github.com/huggingface/transformers/issues/new
@ArthurZucker ArthurZucker added the Feature request Request for a new feature label Jan 8, 2024
@rootonchair
Copy link
Contributor

Hi, I would like to work on this issue

@NielsRogge
Copy link
Contributor

Support for Phi-2 is still WIP, you can follow the progress here: #28163

@susnato
Copy link
Contributor

susnato commented Jan 8, 2024

Hi @gmittal, Flash Attention is already implemented for Phi, PR

It seems that you are using the hub version of phi-2. Please use it from the library to properly enable Flash Attention.
For now microsoft/phi-2, does not have the correct order of the weights to be used with the library model so please use it from susnato/phi-2.

First update to the latest transformers version -

pip install -U transformers

then run -

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("susnato/phi-2", 
    use_flash_attention_2=True, 
    torch_dtype=torch.float16)

tokenizer = AutoTokenizer.from_pretrained("susnato/phi-2")

inputs = tokenizer('''def print_prime(n):
   """
   Print all primes between 1 and n
   """''', return_tensors="pt", return_attention_mask=False)

outputs = model.generate(**inputs, max_length=200)
text = tokenizer.batch_decode(outputs)[0]
print(text)

Let me know if this works or not.

@nakranivaibhav
Copy link
Contributor

I would like to work on this issue

@NicolasMejiaPetit
Copy link

Using HF alignment notebook, DPO script gives me this error regardless of transformers version. (I already force updated with pip). When I remove flash attention from the yaml it works (after a bit of code adjustment). I am able to fine tune with one of my sft scripts using flash attention, which is the strange part.

@gugarosa
Copy link
Contributor

Hello everyone!

This should be fixed in transformers 4.37.0.dev. If not using that version, please make sure that trust_remote_code=True when loading the model and it should work out-of-the-box with flash-attention 2.

@NielsRogge
Copy link
Contributor

Thanks! Closing as this was fixed in #28163

@NicolasMejiaPetit
Copy link

I installed from source so now i am on transformers 4.37.dev0 and i am still getting the Incompatible error, even with trust remote code set to true.

`C:\Users\PC\Documents\Code-Trainer\FineTune>py FINETUNERphiFP16.py --model_name_or_path C:\Users\PC\Documents\NEWGEN\text-generation-webui-main\models\dolphin-2_6-phi-2 --data_path MiniCoderW.json --output_dir C:\Users\PC\Documents\NEWGEN\text-generation-webui-main\models\TrainedPhi --num_train_epochs 3 --model_max_length 1024 --per_device_train_batch_size 1 --evaluation_strategy "no" --save_strategy "steps" --save_steps 1000 --save_total_limit 10 --learning_rate 2e-5 --warmup_steps 10 --logging_steps 10 --lr_scheduler_type "cosine" --report_to "tensorboard" --bf16 False --dataloader_num_workers 12 --optim paged_adamw_8bit
WARNING:tensorflow:From C:\Python311\Lib\site-packages\keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead.

====================================================================================================
TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
cache_dir=None,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=12,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=2e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=C:\Users\PC\Documents\NEWGEN\text-generation-webui-main\models\TrainedPhi\runs\Jan12_23-36-31_Nicolas,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=10,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_kwargs={},
lr_scheduler_type=SchedulerType.COSINE,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
model_max_length=1024,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=3.0,
optim=OptimizerNames.PAGED_ADAMW_8BIT,
optim_args=None,
output_dir=C:\Users\PC\Documents\NEWGEN\text-generation-webui-main\models\TrainedPhi,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=1,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=C:\Users\PC\Documents\NEWGEN\text-generation-webui-main\models\TrainedPhi,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=1000,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=10,
seed=42,
skip_memory_metrics=True,
split_batches=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=10,
weight_decay=0.0,
)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
PAD Token: <|endoftext|> 50256
BOS Token <|endoftext|> 50256
EOS Token <|im_end|> 50295
Load tokenizer from C:\Users\PC\Documents\NEWGEN\text-generation-webui-main\models\dolphin-2_6-phi-2 over.
Traceback (most recent call last):
File "C:\Users\PC\Documents\Code-Trainer\FineTune\FINETUNERphiFP16.py", line 192, in
train()
File "C:\Users\PC\Documents\Code-Trainer\FineTune\FINETUNERphiFP16.py", line 145, in train
model = transformers.AutoModelForCausalLM.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Python311\Lib\site-packages\transformers\models\auto\auto_factory.py", line 561, in from_pretrained
return model_class.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Python311\Lib\site-packages\transformers\modeling_utils.py", line 3497, in from_pretrained
config = cls._autoset_attn_implementation(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Python311\Lib\site-packages\transformers\modeling_utils.py", line 1340, in _autoset_attn_implementation
cls._check_and_enable_flash_attn_2(
File "C:\Python311\Lib\site-packages\transformers\modeling_utils.py", line 1420, in _check_and_enable_flash_attn_2
raise ValueError(
ValueError: PhiForCausalLM does not support Flash Attention 2.0 yet. Please request to add support where the model is hosted, on its model hub page: https://huggingface.co/C:\Users\PC\Documents\NEWGEN\text-generation-webui-main\models\dolphin-2_6-phi-2/discussions/new or in the Transformers GitHub repo: https://github.com/huggingface/transformers/issues/new
`

Here is the script I am using:

`import copy
import random
from dataclasses import dataclass, field
from typing import Optional, Dict, Sequence

import torch
import transformers
from transformers import Trainer
from datasets import load_dataset

IGNORE_INDEX = -100
EOT_TOKEN = "<|EOT|>"

def build_instruction_prompt(instruction: str):
return '''
You are an AI programming assistant, utilizing the DeepSeek Coder model, developed by DeepSeek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.

Instruction:

{}

Response:

'''.format(instruction.strip()).lstrip()

@DataClass
class ModelArguments:
model_name_or_path: Optional[str] = field(default="deepseek-ai/deepseek-coder-6.7b-instruct")

@DataClass
class DataArguments:
data_path: str = field(default=None, metadata={"help": "Path to the training data."})

@DataClass
class TrainingArguments(transformers.TrainingArguments):
cache_dir: Optional[str] = field(default=None)
optim: str = field(default="adamw_torch")
model_max_length: int = field(
default=512,
metadata={"help": "Maximum sequence length. Sequences will be right padded (and possibly truncated)."},
)

def safe_save_model_for_hf_trainer(trainer: transformers.Trainer, output_dir: str):
"""Collects the state dict and dump to disk."""
state_dict = trainer.model.state_dict()
if trainer.args.should_save:
cpu_state_dict = {key: value.cpu() for key, value in state_dict.items()}
del state_dict
trainer._save(output_dir, state_dict=cpu_state_dict) # noqa

def _tokenize_fn(strings: Sequence[str], tokenizer: transformers.PreTrainedTokenizer) -> Dict:
"""Tokenize a list of strings."""
tokenized_list = [
tokenizer(
text,
return_tensors="pt",
padding="longest",
max_length=tokenizer.model_max_length,
truncation=True,
)
for text in strings
]

input_ids = labels = [tokenized.input_ids[0] for tokenized in tokenized_list]
input_ids_lens = labels_lens = [
    tokenized.input_ids.ne(tokenizer.pad_token_id).sum().item() for tokenized in tokenized_list
]

return dict(
    input_ids=input_ids,
    labels=labels,
    input_ids_lens=input_ids_lens,
    labels_lens=labels_lens,
)

def preprocess(
sources: Sequence[str],
targets: Sequence[str],
tokenizer: transformers.PreTrainedTokenizer,
) -> Dict:
"""Preprocess the data by tokenizing."""
examples = [s + t for s, t in zip(sources, targets)]
examples_tokenized, sources_tokenized = [_tokenize_fn(strings, tokenizer) for strings in (examples, sources)]
input_ids = examples_tokenized["input_ids"]

labels = copy.deepcopy(input_ids)
for label, source_len in zip(labels, sources_tokenized["input_ids_lens"]):
    label[:source_len] = IGNORE_INDEX
return dict(input_ids=input_ids, labels=labels)

@DataClass
class DataCollatorForSupervisedDataset(object):
"""Collate examples for supervised fine-tuning."""
tokenizer: transformers.PreTrainedTokenizer

def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
    input_ids, labels = tuple([instance[key] for instance in instances] for key in ("input_ids", "labels"))
    input_ids = [torch.tensor(x) for x in input_ids]
    input_ids = torch.nn.utils.rnn.pad_sequence(
        input_ids, batch_first=True, padding_value=self.tokenizer.pad_token_id
    )
    labels = [torch.tensor(x) for x in labels]
    labels = torch.nn.utils.rnn.pad_sequence(labels, batch_first=True, padding_value=IGNORE_INDEX)
    
    return dict(
        input_ids=input_ids,
        labels=labels,
        attention_mask=input_ids.ne(self.tokenizer.pad_token_id),
    )

def train_tokenize_function(examples, tokenizer):
sources = [
build_instruction_prompt(instruction)
for instruction in examples['instruction']
]
targets = [f"{output}\n{EOT_TOKEN}" for output in examples['output']]
data_dict = preprocess(sources, targets, tokenizer)
return data_dict

def train():
parser = transformers.HfArgumentParser((ModelArguments, DataArguments, TrainingArguments))
model_args, data_args, training_args = parser.parse_args_into_dataclasses()

if training_args.local_rank == 0:
    print('='*100)
    print(training_args)

tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_args.model_name_or_path,
    model_max_length=training_args.model_max_length,
    padding_side="right",
    use_fast=True,
    trust_remote_code=True
)
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

print("PAD Token:", tokenizer.pad_token, tokenizer.pad_token_id)
print("BOS Token", tokenizer.bos_token, tokenizer.bos_token_id)
print("EOS Token", tokenizer.eos_token, tokenizer.eos_token_id)

if training_args.local_rank == 0:
    print("Load tokenizer from {} over.".format(model_args.model_name_or_path))

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_args.model_name_or_path,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    attn_implementation="flash_attention_2",
)

if training_args.local_rank == 0:
    print("Load model from {} over.".format(model_args.model_name_or_path))


raw_train_datasets = load_dataset(
    'json',
    data_files=data_args.data_path,
    split="train",
    cache_dir=training_args.cache_dir
)
    
train_dataset = raw_train_datasets.map(
    train_tokenize_function,
    batched=True,
    batch_size=3000,
    num_proc=32,
    remove_columns=raw_train_datasets.column_names,
    load_from_cache_file=True, # not args.overwrite_cache
    desc="Running Encoding",
    fn_kwargs={ "tokenizer": tokenizer }
)


if training_args.local_rank == 0:
    print("Training dataset samples:", len(train_dataset))
    for index in random.sample(range(len(train_dataset)), 3):
        print(f"Sample {index} of the training set: {train_dataset[index]['input_ids']}, {train_dataset[index]['labels']}.")
        print(f"Sample {index} of the training set: {tokenizer.decode(list(train_dataset[index]['input_ids']))}.")

data_collator = DataCollatorForSupervisedDataset(tokenizer=tokenizer)
data_module = dict(train_dataset=train_dataset, eval_dataset=None, data_collator=data_collator)

trainer = Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module)

trainer.train()
trainer.save_state()
safe_save_model_for_hf_trainer(trainer=trainer, output_dir=training_args.output_dir)

if name == "main":
train()
`

@NielsRogge
Copy link
Contributor

NielsRogge commented Jan 13, 2024

Hi @NickWithBotronics if you set trust_remote_code=True, then the code from the hub is used (in case of microsoft/phi-2 that's defined here), rather than modeling_phi.py defined natively in the Transformers library.

Hence it's recommended to convert the weights from the microsoft/phi-2 repo to a native one, which will work with Flash Attention 2. One can leverage the conversion script for that.

@ArthurZucker should we host the converted phi-2 weights as part of the Microsoft organization? Cause currently one will get a lot of mismatched keys when doing the following:

import torch
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    'microsoft/phi-2',
    use_flash_attention_2=True,
    torch_dtype=torch.bfloat16,
)

due to the the model in Transformers using a single matrix for queries, keys and values wheras the code on the hub uses separate matrices.

@NicolasMejiaPetit
Copy link

NicolasMejiaPetit commented Jan 14, 2024

Thank you <3 !!!! that fixed that error(using the new modeling.py and converted hf format), now onto a new error that's due to my script I think?. :(
'
C:\Python311\Lib\site-packages\torch_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with model.to('cuda').
Traceback (most recent call last):
File "C:\Users\PC\Documents\Code-Trainer\FineTune\FINETUNERphiFP16.py", line 192, in
train()
File "C:\Users\PC\Documents\Code-Trainer\FineTune\FINETUNERphiFP16.py", line 145, in train
model = transformers.AutoModelForCausalLM.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Python311\Lib\site-packages\transformers\models\auto\auto_factory.py", line 561, in from_pretrained
return model_class.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Python311\Lib\site-packages\transformers\modeling_utils.py", line 3503, in from_pretrained
model = cls(config, *model_args, **model_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\PC.cache\huggingface\modules\transformers_modules\MiniPhi\modeling_phi.py", line 967, in init
self.model = PhiModel(config)
^^^^^^^^^^^^^^^^
File "C:\Users\PC.cache\huggingface\modules\transformers_modules\MiniPhi\modeling_phi.py", line 821, in init
[PhiDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
File "C:\Users\PC.cache\huggingface\modules\transformers_modules\MiniPhi\modeling_phi.py", line 821, in
[PhiDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\PC.cache\huggingface\modules\transformers_modules\MiniPhi\modeling_phi.py", line 629, in init
self.self_attn = PHI_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx=layer_idx)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\PC.cache\huggingface\modules\transformers_modules\MiniPhi\modeling_phi.py", line 412, in init
super().init(*args, **kwargs)
File "C:\Users\PC.cache\huggingface\modules\transformers_modules\MiniPhi\modeling_phi.py", line 245, in init
self.attention_dropout = config.attention_dropout
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Python311\Lib\site-packages\transformers\configuration_utils.py", line 265, in getattribute
return super().getattribute(key)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'PhiConfig' object has no attribute 'attention_dropout'

C:\Users\PC\Documents\Code-Trainer\FineTune>
"

Edit: fixed it by downloading the latest: Generation_config.json, Config.json, Configuration_phi.py, and Modeling_phi.py

@NicolasMejiaPetit
Copy link

NicolasMejiaPetit commented Jan 14, 2024

while I got it working, the training loss was very wack. It started at 6 and went to 2 (after 3 epochs) but when I used the old config with out flash attention it was .6 to ~.29(also 3 epochs) same dataset same set up, same model. Just different config files and flash attention. I saw someone else experience the same thing on twitter.

@ArthurZucker
Copy link
Collaborator

Can you open a seperate issue for this? With a reproducible snippet

@NicolasMejiaPetit
Copy link

Gotcha, I’ll move to this ticket #28488

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

8 participants