Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Granite language models #31502

Merged
merged 87 commits into from
Aug 27, 2024
Merged
Show file tree
Hide file tree
Changes from 84 commits
Commits
Show all changes
87 commits
Select commit Hold shift + click to select a range
750ca7f
first commit
mayank31398 Jun 19, 2024
3b26730
drop tokenizer
mayank31398 Jun 19, 2024
9c017b0
drop tokenizer
mayank31398 Jun 19, 2024
876f4b5
drop tokenizer
mayank31398 Jun 19, 2024
0f716ec
Merge branch 'main' into granite
mayank31398 Jun 28, 2024
e3cdcaf
drop convert
mayank31398 Jun 28, 2024
3e4391e
granite
mayank31398 Jun 28, 2024
6f0cf35
drop tokenization test
mayank31398 Jun 28, 2024
2d1a58c
mup
mayank31398 Jun 30, 2024
ac560ae
fix
mayank31398 Jun 30, 2024
78c81a0
reformat
mayank31398 Jun 30, 2024
3b6c755
reformat
mayank31398 Jun 30, 2024
f46bf82
reformat
mayank31398 Jun 30, 2024
272af5c
fix docs
mayank31398 Jun 30, 2024
c9b2288
stop checking for checkpoint
mayank31398 Jun 30, 2024
19ec830
update support
mayank31398 Jun 30, 2024
a9dba03
attention multiplier
mayank31398 Jun 30, 2024
df90fbd
update model
mayank31398 Jul 1, 2024
c3369a0
tiny drop
mayank31398 Jul 1, 2024
6a7c814
saibo drop
mayank31398 Jul 1, 2024
dad1e4a
skip test
mayank31398 Jul 1, 2024
5cba841
fix test
mayank31398 Jul 1, 2024
e8f5886
fix test
mayank31398 Jul 1, 2024
1678792
drop
mayank31398 Jul 1, 2024
9498556
drop useless imports
mayank31398 Jul 1, 2024
039b377
update docs
mayank31398 Jul 1, 2024
1bea763
Merge branch 'main' into granite
mayank31398 Jul 2, 2024
2a9d734
Merge branch 'main' into granite
mayank31398 Jul 11, 2024
2442492
drop flash function
mayank31398 Jul 11, 2024
2efe0a6
copied from
mayank31398 Jul 11, 2024
8da50b5
drop pretraining tp
mayank31398 Jul 11, 2024
73d4f2d
drop pretraining tp
mayank31398 Jul 11, 2024
5f02075
drop pretraining tp
mayank31398 Jul 11, 2024
de33d60
drop unused import
mayank31398 Jul 11, 2024
42035c4
drop code path
mayank31398 Jul 12, 2024
f833ca6
change name
mayank31398 Jul 12, 2024
5ca5b08
softmax scale
mayank31398 Jul 12, 2024
abb359d
head dim
mayank31398 Jul 12, 2024
cfa8210
drop legacy cache
mayank31398 Jul 12, 2024
b1dad99
rename params
mayank31398 Jul 12, 2024
79bdf6b
cleanup
mayank31398 Jul 22, 2024
91a0253
Merge branch 'main' into granite
mayank31398 Jul 22, 2024
7df943f
fix copies
mayank31398 Jul 22, 2024
18d577d
comments
mayank31398 Jul 22, 2024
90c7906
add back legacy cache
mayank31398 Jul 22, 2024
a765b89
multipliers
mayank31398 Jul 22, 2024
ce070ad
multipliers
mayank31398 Jul 22, 2024
8b1b7e0
multipliers
mayank31398 Jul 22, 2024
fcc7bf7
text fix
mayank31398 Jul 22, 2024
bd6bee6
Merge branch 'main' into granite
mayank31398 Jul 23, 2024
37eb40f
fix copies
mayank31398 Jul 23, 2024
d743ff7
Merge branch 'main' into granite
mayank31398 Jul 23, 2024
1142dbb
merge
mayank31398 Jul 23, 2024
c3185de
multipliers
mayank31398 Jul 23, 2024
6ccf5b5
attention multiplier
mayank31398 Jul 23, 2024
52440ad
drop unused imports
mayank31398 Jul 24, 2024
b39cb7d
Merge branch 'main' into granite
mayank31398 Jul 26, 2024
6fa1774
Merge branch 'main' into granite
mayank31398 Jul 29, 2024
46524c7
Merge branch 'main' into granite
mayank31398 Jul 30, 2024
71c2cde
fix
mayank31398 Jul 30, 2024
fe64841
fix
mayank31398 Jul 30, 2024
559204d
fix
mayank31398 Jul 30, 2024
b64c16d
move rope?
mayank31398 Jul 30, 2024
cd9a911
Update src/transformers/models/granite/configuration_granite.py
mayank31398 Aug 1, 2024
124065c
fix
mayank31398 Aug 1, 2024
02c1073
Update src/transformers/models/granite/modeling_granite.py
mayank31398 Aug 1, 2024
8c9112f
fix
mayank31398 Aug 1, 2024
9493a86
fix
mayank31398 Aug 1, 2024
4932097
fix
mayank31398 Aug 1, 2024
b7fe0d3
fix
mayank31398 Aug 1, 2024
fa5550a
Merge branch 'main' into granite
mayank31398 Aug 11, 2024
1eeab96
fix-copies
mayank31398 Aug 11, 2024
6a13285
Merge branch 'main' into granite
mayank31398 Aug 14, 2024
7d68382
torch rmsnorm
mayank31398 Aug 14, 2024
4a4a581
add authors
mayank31398 Aug 19, 2024
57a5b9e
Merge branch 'main' into granite
mayank31398 Aug 24, 2024
c03e395
change model path
mayank31398 Aug 26, 2024
ffed3d1
fix
mayank31398 Aug 26, 2024
a7fbd30
test
mayank31398 Aug 27, 2024
8f00c1b
drop static cache test
mayank31398 Aug 27, 2024
586fdf1
uupdate readme
mayank31398 Aug 27, 2024
dc9faaa
drop non-causal
mayank31398 Aug 27, 2024
545449c
readme
mayank31398 Aug 27, 2024
5e5cad9
drop useless imports
mayank31398 Aug 27, 2024
24029b2
Update docs/source/en/model_doc/granite.md
mayank31398 Aug 27, 2024
eaeff2a
Update docs/source/en/model_doc/granite.md
mayank31398 Aug 27, 2024
ee9c0f6
Update docs/source/en/model_doc/granite.md
mayank31398 Aug 27, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -412,6 +412,8 @@
title: GPTSAN Japanese
- local: model_doc/gpt-sw3
title: GPTSw3
- local: model_doc/granite
title: Granite
- local: model_doc/herbert
title: HerBERT
- local: model_doc/ibert
Expand Down
1 change: 1 addition & 0 deletions docs/source/en/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,7 @@ Flax), PyTorch, and/or TensorFlow.
| [GPT-Sw3](model_doc/gpt-sw3) | ✅ | ✅ | ✅ |
| [GPTBigCode](model_doc/gpt_bigcode) | ✅ | ❌ | ❌ |
| [GPTSAN-japanese](model_doc/gptsan-japanese) | ✅ | ❌ | ❌ |
| [Granite](model_doc/granite) | ✅ | ❌ | ❌ |
| [Graphormer](model_doc/graphormer) | ✅ | ❌ | ❌ |
| [Grounding DINO](model_doc/grounding-dino) | ✅ | ❌ | ❌ |
| [GroupViT](model_doc/groupvit) | ✅ | ✅ | ❌ |
Expand Down
78 changes: 78 additions & 0 deletions docs/source/en/model_doc/granite.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->

# Granite

## Overview

The Granite model was proposed in [Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler](https://arxiv.org/abs/2408.13359) by Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox and Rameswar Panda.

PowerLM-3B is a 3B state-of-the-art small language model trained with the Power learning rate scheduler. It is trained on a wide range of open-source and synthetic datasets with permissive licenses. PowerLM-3B has shown promising results compared to other models in the size categories across various benchmarks, including natural language multi-choices, code generation, and math reasoning.

The abstract from the paper is the following:

*Finding the optimal learning rate for language model pretraining is a challenging task.
This is not only because there is a complicated correlation between learning rate, batch size, number of training tokens, model size, and other hyperparameters but also because it is prohibitively expensive to perform a hyperparameter search for large language models with Billions or Trillions of parameters. Recent studies propose using small proxy models and small corpus to perform hyperparameter searches and transposing the optimal parameters to large models and large corpus. While the zero-shot transferability is theoretically and empirically proven for model size related hyperparameters, like depth and width, the zero-shot transfer from small corpus to large corpus is underexplored.
In this paper, we study the correlation between optimal learning rate, batch size, and number of training tokens for the recently proposed WSD scheduler. After thousands of small experiments, we found a power-law relationship between variables and demonstrated its transferability across model sizes. Based on the observation, we propose a new learning rate scheduler, Power scheduler, that is agnostic about the number of training tokens and batch size. The experiment shows that combining the Power scheduler with Maximum Update Parameterization (\mup) can consistently achieve impressive performance with one set of hyperparameters regardless of the number of training tokens, batch size, model size, and even model architecture. Our 3B dense and MoE models trained with the Power scheduler achieve comparable performance as state-of-the-art small language models.
We [open source](https://huggingface.co/collections/ibm/power-lm-66be64ae647ddf11b9808000) these pretrained models.*

Tips:

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda" # or "cpu"
mayank31398 marked this conversation as resolved.
Show resolved Hide resolved
model_path = "ibm/PowerLM-3b"
tokenizer = AutoTokenizer.from_pretrained(model_path)

# drop device_map if running on CPU
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
mayank31398 marked this conversation as resolved.
Show resolved Hide resolved
model.eval()

# change input text as desired
prompt = "Write a code to find the maximum value in a list of numbers."

# tokenize the text
input_tokens = tokenizer(prompt, return_tensors="pt")
# transfer tokenized inputs to the device
for i in input_tokens:
input_tokens[i] = input_tokens[i].to(device)
mayank31398 marked this conversation as resolved.
Show resolved Hide resolved
# generate output tokens
output = model.generate(**input_tokens, max_new_tokens=100)
# decode output tokens into text
output = tokenizer.batch_decode(output)
# loop over the batch to print, in this example the batch size is 1
for i in output:
print(i)
```

This model was contributed by [mayank-mishra](https://huggingface.co/mayank-mishra).


## GraniteConfig

[[autodoc]] GraniteConfig

## GraniteModel

[[autodoc]] GraniteModel
- forward

## GraniteForCausalLM

[[autodoc]] GraniteForCausalLM
- forward
2 changes: 2 additions & 0 deletions docs/source/en/perf_infer_gpu_one.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ FlashAttention-2 is currently supported for the following architectures:
* [GPTNeo](https://huggingface.co/docs/transformers/model_doc/gpt_neo#transformers.GPTNeoModel)
* [GPTNeoX](https://huggingface.co/docs/transformers/model_doc/gpt_neox#transformers.GPTNeoXModel)
* [GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj#transformers.GPTJModel)
* [Granite](https://huggingface.co/docs/transformers/model_doc/granite#transformers.GraniteModel)
* [Idefics2](https://huggingface.co/docs/transformers/model_doc/idefics2#transformers.Idefics2Model)
* [Falcon](https://huggingface.co/docs/transformers/model_doc/falcon#transformers.FalconModel)
* [JetMoe](https://huggingface.co/docs/transformers/model_doc/jetmoe#transformers.JetMoeModel)
Expand Down Expand Up @@ -214,6 +215,7 @@ For now, Transformers supports SDPA inference and training for the following arc
* [GPT2](https://huggingface.co/docs/transformers/model_doc/gpt2)
* [GPTBigCode](https://huggingface.co/docs/transformers/model_doc/gpt_bigcode#transformers.GPTBigCodeModel)
* [GPTNeoX](https://huggingface.co/docs/transformers/model_doc/gpt_neox#transformers.GPTNeoXModel)
* [Granite](https://huggingface.co/docs/transformers/model_doc/granite#transformers.GraniteModel)
* [JetMoe](https://huggingface.co/docs/transformers/model_doc/jetmoe#transformers.JetMoeModel)
* [Jamba](https://huggingface.co/docs/transformers/model_doc/jamba#transformers.JambaModel)
* [Llama](https://huggingface.co/docs/transformers/model_doc/llama#transformers.LlamaModel)
Expand Down
14 changes: 14 additions & 0 deletions src/transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -461,6 +461,7 @@
"models.gpt_neox_japanese": ["GPTNeoXJapaneseConfig"],
"models.gpt_sw3": [],
"models.gptj": ["GPTJConfig"],
"models.granite": ["GraniteConfig"],
"models.grounding_dino": [
"GroundingDinoConfig",
"GroundingDinoProcessor",
Expand Down Expand Up @@ -2317,6 +2318,13 @@
"GPTJPreTrainedModel",
]
)
_import_structure["models.granite"].extend(
[
"GraniteForCausalLM",
"GraniteModel",
"GranitePreTrainedModel",
]
)
_import_structure["models.grounding_dino"].extend(
[
"GroundingDinoForObjectDetection",
Expand Down Expand Up @@ -5201,6 +5209,7 @@
GPTNeoXJapaneseConfig,
)
from .models.gptj import GPTJConfig
from .models.granite import GraniteConfig
from .models.grounding_dino import (
GroundingDinoConfig,
GroundingDinoProcessor,
Expand Down Expand Up @@ -6915,6 +6924,11 @@
GPTJModel,
GPTJPreTrainedModel,
)
from .models.granite import (
GraniteForCausalLM,
GraniteModel,
GranitePreTrainedModel,
)
from .models.grounding_dino import (
GroundingDinoForObjectDetection,
GroundingDinoModel,
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,7 @@
gpt_neox_japanese,
gpt_sw3,
gptj,
granite,
grounding_dino,
groupvit,
herbert,
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/models/auto/configuration_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,7 @@
("gpt_neox_japanese", "GPTNeoXJapaneseConfig"),
("gptj", "GPTJConfig"),
("gptsan-japanese", "GPTSanJapaneseConfig"),
("granite", "GraniteConfig"),
("graphormer", "GraphormerConfig"),
("grounding-dino", "GroundingDinoConfig"),
("groupvit", "GroupViTConfig"),
Expand Down Expand Up @@ -410,6 +411,7 @@
("gpt_neox_japanese", "GPT NeoX Japanese"),
("gptj", "GPT-J"),
("gptsan-japanese", "GPTSAN-japanese"),
("granite", "Granite"),
("graphormer", "Graphormer"),
("grounding-dino", "Grounding DINO"),
("groupvit", "GroupViT"),
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/models/auto/modeling_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,7 @@
("gpt_neox_japanese", "GPTNeoXJapaneseModel"),
("gptj", "GPTJModel"),
("gptsan-japanese", "GPTSanJapaneseForConditionalGeneration"),
("granite", "GraniteModel"),
("graphormer", "GraphormerModel"),
("grounding-dino", "GroundingDinoModel"),
("groupvit", "GroupViTModel"),
Expand Down Expand Up @@ -478,6 +479,7 @@
("gpt_neox", "GPTNeoXForCausalLM"),
("gpt_neox_japanese", "GPTNeoXJapaneseForCausalLM"),
("gptj", "GPTJForCausalLM"),
("granite", "GraniteForCausalLM"),
("jamba", "JambaForCausalLM"),
("jetmoe", "JetMoeForCausalLM"),
("llama", "LlamaForCausalLM"),
Expand Down
57 changes: 57 additions & 0 deletions src/transformers/models/granite/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Copyright 2024 EleutherAI and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING

from ...utils import (
OptionalDependencyNotAvailable,
_LazyModule,
is_torch_available,
)


_import_structure = {
"configuration_granite": ["GraniteConfig"],
}

try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["modeling_granite"] = [
"GraniteForCausalLM",
"GraniteModel",
"GranitePreTrainedModel",
]

if TYPE_CHECKING:
from .configuration_granite import GraniteConfig

try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .modeling_granite import (
GraniteForCausalLM,
GraniteModel,
GranitePreTrainedModel,
)

else:
import sys

sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
Loading