Command-R-Plus, Context Window Limitations #660

jeanromainroy · 2024-04-08T04:26:24Z

Cohere's new Command-R-Plus model reportedly features a 128k context window. However, testing with progressively longer prompts reveals it begins producing nonsensical output (e.g., "<PAD><PAD>...") after 8192 tokens, aligning with the "max_position_embeddings" value in the config.json file. The config also lists a "rope_theta" value, suggesting its role in achieving the large context window. Is "rope" supported in MLX?

awni · 2024-04-08T13:44:29Z

However, testing with progressively longer prompts reveals it begins producing nonsensical output (e.g., "...") after 8192 tokens, aligning with the "max_position_embeddings" value in the config.json file

🤔 not sure what would cause that. Do you have a prompt that should work in the MLX version that doesn't? Also if you are able to provide some expected output that would also be helpful.

The config also lists a "rope_theta" value, suggesting its role in achieving the large context window. Is "rope" supported in MLX?

MLX has RoPE and it should be used correctly already.

fblissjr · 2024-04-08T15:48:02Z

I'm getting random Cyrillic in my responses when using tokenizer.apply_tool_use_template. Anyone else? Seems to only be when using that tool template from the tokenizer.

Example output:

Write 'Action:' followed by a json-formatted list of actions that you want to perform in order to produce a good response to the user's last input. You can use any of the supplied tools any number of times, but you should aim to execute the minimum number of necessary actions for the input. You should use the directly-answer tool if calling the other tools is unnecessary. The list of actions you want to call should be formatted as a list of json objects, for example:

[
    {
        "tool_name": title of the tool in the specification,
        "parameters": a dict of parameters to input into the tool as they are defined in the specs, or {} if it takes no parameters
    }
]```<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>
Action: ```json
[
    {
        "tool некоторыми": {},
        "tool_name": "internet_search"
    } forniscono]
```<EOS_TOKEN>

fblissjr · 2024-04-08T16:22:36Z

ignore, I was calling the tokenizer twice. fixed it in my code here for anyone who wants to test tool use (apologies in advance if there are bugs still lurking): https://github.com/fblissjr/mlx-funbox

fblissjr · 2024-04-08T16:42:05Z

looks like there's still random switching to multilingual and random Cyrillic (using a simple generate + apply tool template). has anyone tested on CUDA to see if similar?

fblissjr · 2024-04-08T17:32:34Z

Looks like it's the tokenizer.json that is not converting correctly. See the tokenizer.json from the Cohere HF model repo:

Compared to a freshly converted mlx_lm.convert -q (no other params) I just did from that same repo 20 minutes ago, which also matches the tokenizer.json from the mlx-community quant uploaded earlier (mlx-community/c4ai-command-r-plus-4bit):

fblissjr · 2024-04-08T17:35:22Z

Copying the original cohere tokenizer.json (https://huggingface.co/CohereForAI/c4ai-command-r-plus/blob/main/tokenizer.json) fixes this issue completely from my testing (output generation is slow, but so far so good!)

My guess is something is happening in the mlx_lm.convert process due to the large size of the vocab + the multilingual nature of the tokenizer + the strange tokenizer.json formatting.

edit: generation speed is also slightly faster now due to the correct tokenizer being used.

awni · 2024-04-08T17:45:36Z

That is very odd. The tokenizer copying is very simple in MLX LM. We basically load with Hugging Face and then save it with Hugging Face. There is no MLX code involved. https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/utils.py#L619

I wonder if we are somehow using the API incorrectly or maybe there is a bug in the way it's saved with Transformers.

awni · 2024-04-08T17:49:27Z

@fblissjr you can reproduce the behavior with:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-plus")
tokenizer.save_pretrained(".")

I feel that should not break the tokenizer.. so it might be worth filing an issue with the Cohere HF repo or the Transformer repos? Wdyt?

fblissjr · 2024-04-08T17:50:28Z

@awni my guess is the latter. looks more like it's saved incorrectly (and oddly just by looking at it) in the hf repo. Haven't seen a tokenizer.json like this before. here's a quick sample of 1 page on more tokenizer.json from https://huggingface.co/CohereForAI/c4ai-command-r-plus/blob/main/tokenizer.json

{"version": "1.0", "truncation": null, "padding": null, "added_tokens": [{"id": 0, "content": "", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true}, {"id": 1, "content": "", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true}, {"id": 2, "content": "", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true}, {"id": 3, "content": "", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true}, {"id": 4, "content": "<MASK_TOKEN>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true}, {"id": 5, "content": "<BOS_TOKEN>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true}, {"id": 6, "content": "<EOS_TOKEN>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true}, {"id": 7, "content": "<EOP_TOKEN>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true}, {"id": 255000, "special": false, "content": "<|START_OF_TURN_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255001, "special": false, "content": "<|END_OF_TURN_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255002, "special": false, "content": "<|YES_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255003, "special": false, "content": "<|NO_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255004, "special": false, "content": "<|GOOD_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255005, "special": false, "content": "<|BAD_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255006, "special": false, "content": "<|USER_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255007, "special": false, "content": "<|CHATBOT_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255008, "special": false, "content": "<|SYSTEM_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255009, "special": false, "content": "<|USER_0_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255010, "special": false, "content": "<|USER_1_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255011, "special": false, "content": "<|USER_2_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255012, "special": false, "content": "<|USER_3_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255013, "special": false, "content": "<|USER_4_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255014, "special": false, "content": "<|USER_5_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255015, "special": false, "content": "<|USER_6_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255016, "special": false, "content": "<|USER_7_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255017, "special": false, "content": "<|USER_8_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255018, "special": false, "content": "<|USER_9_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255019, "special": false, "content": "<|EXTRA_0_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255020, "special": false, "content": "<|EXTRA_1_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255021, "special": false, "content": "<|EXTRA_2_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255022, "special": false, "content": "<|EXTRA_3_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255023, "special": false, "content": "<|EXTRA_4_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255024, "special": false, "content": "<|EXTRA_5_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255025, "special": false, "content": "<|EXTRA_6_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255026, "special": false, "content": "<|EXTRA_7_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255027, "special": false, "content": "<|EXTRA_8_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255028, "special": false, "content": "<|EXTRA_9_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}], "normalizer": {"type": "NFC"}, "pre_tokenizer": {"type": "Sequence", "pretokenizers": [{"type": "Digits", "individual_digits": true}, {"type": "ByteLevel", "add_prefix_space": false, "trim_offsets": true, "use_regex": true}]}, "post_processor": {"add_prefix_space": true, "trim_offsets": false, "use_regex": true, "type": "TemplateProcessing", "single": [{"SpecialToken": {"id": "<BOS_TOKEN>", "type_id": 0}}, {"Sequence": {"id": "A", "type_id": 0}}, {"SpecialToken": {"id": "<|END_OF_TURN_TOKEN|>", "type_id": 1}}, {"SpecialToken": {"id": "<EOS_TOKEN>", "type_id": 1}}], "pair": [{"SpecialToken": {"id": "<BOS_TOKEN>", "type_id": 0}}, {"Sequence": {"id": "A", "type_id": 0}}, {"Sequence": {"id": "B", "type_id": 1}}, {"SpecialToken": {"id": "<|END_OF_TURN_TOKEN|>", "type_id": 1}}, {"SpecialToken": {"id": "<EOS_TOKEN>", "type_id": 1}}], "special_tokens": {"<BOS_TOKEN>": {"id": "<BOS_TOKEN>", "ids": [5], "tokens": ["<BOS_TOKEN>"]}, "<EOS_TOKEN>": {"id": "<EOS_TOKEN>", "ids": [6], "tokens": ["<EOS_TOKEN>"]}, "<|END_OF_TURN_TOKEN|>": {"id": "<|END_OF_TURN_TOKEN|>", "ids": [255001], "tokens": ["<|END_OF_TURN_TOKEN|>"]}}}, "decoder": {"type": "ByteLevel", "add_prefix_space": true, "trim_offsets": true, "use_regex": true}, "model": {"type": "BPE", "dropout": null, "unk_token": null, "continuing_subword_prefix": null, "end_of_word_suffix": null, "fuse_unk": false, "byte_fallback": false, "vocab": {"": 0, "": 1, "": 2, "": 3, "<MASK_TOKEN>": 4, "<BOS_TOKEN>": 5, "<EOS_TOKEN>": 6, "<EOP_TOKEN>": 7, "!": 8, """: 9, "#": 10, "$": 11, "%": 12, "&": 13, "'": 14, "(": 15, ")": 16, "*": 17, "+": 18, ",": 19, "-": 20, ".": 21, "/": 22, "0": 23, "1": 24, "2": 25, "3": 26, "4": 27, "5": 28, "6": 29, "7": 30, "8": 31, "9": 32, ":": 33, ";": 34, "<": 35, "=": 36, ">": 37, "?": 38, "@": 39, "A": 40, "B": 41, "C": 42, "D": 43, "E": 44, "F": 45, "G": 46, "H": 47, "I": 48, "J": 49, "K": 50, "L": 51, "M": 52, "N": 53, "O": 54, "P": 55, "Q": 56, "R": 57, "S": 58, "T": 59, "U": 60, "V": 61, "W": 62, "X": 63, "Y": 64, "Z": 65, "[": 66, "\": 67, "]": 68, "^": 69, "_": 70, "`": 71, "a": 72, "b": 73, "c": 74, "d": 75, "e": 76, "f": 77, "g": 78, "h": 79, "i": 80, "j": 81, "k": 82, "l": 83, "m": 84, "n": 85, "o": 86, "p": 87, "q": 88, "r": 89, "s": 90, "t": 91, "u": 92, "v": 93, "w": 94, "x": 95, "y": 96, "z": 97, "{": 98, "|": 99, "}": 100, "~": 101, "\u00a1": 102, "\u00a2": 103, "\u00a3": 104, "\u00a4": 105, "\u00a5": 106, "\u00a6": 107, "\u00a7": 108, "\u00a8": 109, "\u00a9": 110, "\u00aa": 111, "\u00ab": 112, "\u00ac": 113, "\u00ae": 114, "\u00af": 115, "\u00b0": 116, "\u00b1": 117, "\u00b2": 118, "\u00b3": 119, "\u00b4": 120, "\u00b5": 121, "\u00b6": 122, "\u00b7": 123, "\u00b8": 124, "\u00b9": 125, "\u00ba": 126, "\u00bb": 127, "\u00bc": 128, "\u00bd": 129, "\u00be": 130, "\u00bf": 131, "\u00c0": 132, "\u00c1": 133, "\u00c2": 134, "\u00c3": 135, "\u00c4": 136, "\u00c5": 137, "\u00c6": 138, "\u00c7": 139, "\u00c8": 140, "\u00c9": 141, "\u00ca": 142, "\u00cb": 143, "\u00cc": 144, "\u00cd": 145, "\u00ce": 146, "\u00cf": 147, "\u00d0": 148, "\u00d1": 149, "\u00d2": 150, "\u00d3": 151, "\u00d4": 152, "\u00d5": 153, "\u00d6": 154, "\u00d7": 155, "\u00d8": 156, "\u00d9": 157, "\u00da": 158, "\u00db": 159, "\u00dc": 160, "\u00dd": 161, "\u00de": 162, "\u00df": 163, "\u00e0": 164, "\u00e1": 165, "\u00e2": 166, "\u00e3": 167, "\u00e4": 168, "\u00e5": 169, "\u00e6": 170, "\u00e7": 171, "\u00e8": 172, "\u00e9": 173, "\u00ea": 174, "\u00eb": 175, "\u00ec": 176, "\u00ed": 177, "\u00ee": 178, "\u00ef": 179, "\u00f0": 180, "\u00f1": 181, "\u00f2": 182, "\u00f3": 183, "\u00f4": 184, "\u00f5": 185, "\u00f6": 186, "\u00f7": 187, "\u00f8": 188, "\u00f9": 189, "\u00fa": 190, "\u00fb": 191, "\u00fc": 192, "\u00fd": 193, "\u00fe": 194, "\u00ff": 195, "\u0100": 196, "\u0101": 197, "\u0102": 198, "\u0103": 199, "\u0104": 200, "\u0105": 201, "\u0106": 202, "\u0107": 203, "\u0108": 204, "\u0109": 205, "\u010a": 206, "\u010b": 207, "\u010c": 208, "\u010d": 209, "\u010e": 210, "\u010f": 211, "\u0110": 212, "\u0111": 213, "\u0112": 214, "\u0113": 215, "\u0114": 216, "\u0115": 217, "\u0116": 218, "\u0117": 219, "\u0118": 220, "\u0119": 221, "\u011a": 222, "\u011b": 223, "\u011c": 224, "\u011d": 225, "\u011e": 226, "\u011f": 227, "\u0120": 228, "\u0121": 229, "\u0122": 230, "\u0123": 231, "\u0124": 232, "\u0125": 233, "\u0126": 234, "\u0127": 235, "\u0128": 236, "\u0129": 237, "\u012a": 238, "\u012b": 239, "\u012c": 240, "\u012d": 241, "\u012e": 242, "\u012f": 243, "\u0130": 244, "\u0131": 245, "\u0132": 246, "\u0133": 247, "\u0134": 248, "\u0135": 249, "\u0136": 250, "\u0137": 251, "\u0138": 252, "\u0139": 253, "\u013a": 254, "\u013b": 255, "\u013c": 256, "\u013d": 257, "\u013e": 258, "\u013f": 259, "\u0140": 260, "\u0141": 261, "\u0142": 262, "\u0143": 263, "\u200d": 264, "\u203c": 265, "\u2049": 266, "\u20e3": 267, "\u2122": 268, "\u2139": 269, "\u2194": 270, "\u2195": 271, "\u2196": 272, "\u2197": 273, "\u2198": 274, "\u2199": 275, "\u21a9": 276, "\u21aa": 277, "\u231a": 278, "\u231b": 279, "\u2328": 280, "\u23cf": 281, "\u23e9": 282, "\u23ea": 283, "\u23eb": 284, "\u23ec": 285, "\u23ed": 286, "\u23ee": 287, "\u23ef": 288, "\u23f0": 289, "\u23f1": 290, "\u23f2": 291, "\u23f3": 292, "\u23f8": 293, "\u23f9": 294, "\u23fa": 295, "\u24c2": 296, "\u25aa": 297, "\u25ab": 298, "\u25b6": 299, "\u25c0": 300, "\u25fb": 301, "\u25fc": 302, "\u25fd": 303, "\u25fe": 304, "\u2600": 305, "\u2601": 306, "\u2602": 307, "\u2603": 308, "\u2604": 309, "\u260e": 310, "\u2611": 311, "\u2614": 312, "\u2615": 313, "\u2618": 314, "\u261d": 315, "\u2620": 316, "\u2622": 317, "\u2623": 318, "\u2626": 319, "\u262a": 320, "\u262e": 321, "\u262f": 322, "\u2638": 323, "\u2639": 324, "\u263a": 325, "\u2640": 326, "\u2642": 327, "\u2648": 328, "\u2649": 329, "\u264a": 330, "\u264b": 331, "\u264c": 332, "\u264d": 333, "\u264e": 334, "\u264f": 335, "\u2650": 336, "\u2651": 337, "\u2652": 338, "\u2653": 339, "\u265f": 340, "\u2660": 341, "\u2663": 342, "\u2665": 343, "\u2666": 344, "\u2668": 345, "\u267b": 346, "\u267e": 347, "\u267f": 348, "\u2692": 349, "\u2693": 350, "\u2694": 351, "\u2695": 352, "\u2696": 353, "\u2697": 354, "\u2699": 355, "\u269b": 356, "\u269c": 357, "\u26a0": 358, "\u26a1": 359, "\u26a7": 360, "\u26aa": 361, "\u26ab": 362, "\u26b0": 363, "\u26b1": 364, "\u26bd": 365, "\u26be": 366, "\u26c4": 367, "\u26c5": 368, "\u26c8": 369, "\u26ce": 370, "\u26cf": 371, "\u26d1": 372, "\u26d3": 373, "\u26d4": 374, "\u26e9": 375, "\u26ea": 376, "\u26f0": 377, "\u26f1": 378, "\u26f2": 379, "\u26f3": 380, "\u26f4": 381, "\u26f5": 382, "\u26f7": 383, "\u26f8": 384, "\u26f9": 385, "\u26fa": 386, "\u26fd": 387, "\u2702": 388, "\u2705": 389, "\u2708": 390, "\u2709": 391, "\u270a": 392, "\u270b": 393, "\u270c": 394, "\u270d": 395, "\u270f": 396, "\u2712": 397, "\u2714": 398, "\u2716": 399, "\u271d": 400, "\u2721": 401, "\u2728": 402, "\u2733": 403, "\u2734": 404, "\u2744": 405, "\u2747": 406, "\u274c": 407, "\u274e": 408, "\u2753": 409, "\u2754": 410, "\u2755": 411, "\u2757": 412, "\u2763": 413, "\u276

fblissjr · 2024-04-08T18:11:12Z

@fblissjr you can reproduce the behavior with:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-plus")
tokenizer.save_pretrained(".")
I feel that should not break the tokenizer.. so it might be worth filing an issue with the Cohere HF repo or the Transformer repos? Wdyt?

agreed. i made a community post on hf here: https://huggingface.co/CohereForAI/c4ai-command-r-plus/discussions/15

and here: huggingface/transformers#30027

fblissjr · 2024-04-08T20:09:28Z

so this is interesting - the tokenizer.json on the bitsandbytes repo linked from the main cohere repo is a different size, and looks nothing like the original. https://huggingface.co/CohereForAI/c4ai-command-r-plus-4bit/blob/main/tokenizer.json

fblissjr · 2024-04-08T20:29:17Z

another interesting difference between the 4 bit bnb tokenizer and the original - in the original one, token id token id 255001 <|END_OF_TURN_TOKEN|>, special is set to False. In the 4bit bnb one, it's True.

fblissjr · 2024-04-08T23:28:39Z

Per comments on the hugging face repo, the differences between the two tokenizers.json files are unicode differences. I'll assume I've got something bugging on my end unless anyone else sees the same.

jeanromainroy · 2024-04-09T14:38:24Z

# Libraries
from transformers import AutoTokenizer
import mlx.core as mx
import mlx_lm
from mlx_lm.utils import load_model, get_model_path


# Language Model
PATH_MODEL = "/Users/admin/Models/CohereForAI/c4ai-command-r-plus-4bit/"


# Load the model & tokenizer
tokenizer = AutoTokenizer.from_pretrained(PATH_MODEL)
model = load_model(get_model_path(PATH_MODEL))


# Incrementally longer texts
...
text_7500_tokens = "Lorem ipsum dolor sit..."    # Works
text_8500_tokens = "Lorem ipsum dolor sit..."    # Stops working
...


# Format as list of messages
messages = [
    {"role": "user", "content": f"{text_8500_tokens}\n\nSummarize the text above in one short paragraph."}    # <-- set a text
]


# Apply chat template
prompt_decorated = tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    add_generation_prompt=True
)


# Generate
response = mlx_lm.generate(
    model,
    tokenizer,
    prompt_decorated,
    temp=0.0,
    max_tokens=64
)

This is what I have been using – I removed the texts which are just some random wikipedia page. The output is good until I try the 8500 tokens text which just outputs <PAD><PAD><PAD><PAD><PAD>...

fblissjr · 2024-04-09T15:16:43Z

# Libraries
from transformers import AutoTokenizer
import mlx.core as mx
import mlx_lm
from mlx_lm.utils import load_model, get_model_path


# Language Model
PATH_MODEL = "/Users/admin/Models/CohereForAI/c4ai-command-r-plus-4bit/"


# Load the model & tokenizer
tokenizer = AutoTokenizer.from_pretrained(PATH_MODEL)
model = load_model(get_model_path(PATH_MODEL))


# Incrementally longer texts
...
text_7500_tokens = "Lorem ipsum dolor sit..."    # Works
text_8500_tokens = "Lorem ipsum dolor sit..."    # Stops working
...


# Format as list of messages
messages = [
    {"role": "user", "content": f"{text_8500_tokens}\n\nSummarize the text above in one short paragraph."}    # <-- set a text
]


# Apply chat template
prompt_decorated = tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    add_generation_prompt=True
)


# Generate
response = mlx_lm.generate(
    model,
    tokenizer,
    prompt_decorated,
    temp=0.0,
    max_tokens=64
)

This is what I have been using – I removed the texts which are just some random wikipedia page. The output is good until I try the 8500 tokens text which just outputs ...

Have you tried with apply_tool_template by chance? Curious if you see any of the oddities I see when using it.

Blaizzy · 2024-04-09T16:25:51Z

Hey guys @awni, @fblissjr and @jeanromainroy,

The cohere team limited the context to 8k for all Command-R variants on purpose. If you check the config file for both r-v01 and r+, the max_position_embeddings is set to 8192.

It's a limit to avoid users experiencing OOM.

You can read more here:
https://huggingface.co/CohereForAI/c4ai-command-r-v01/discussions/12

jeanromainroy · 2024-04-09T16:41:10Z

Hey @Blaizzy, I have run the exact same test with the new llama.cpp implementation of Command-R+ and it works way above 8k tokens.

Blaizzy · 2024-04-09T17:15:52Z

Copying the original cohere tokenizer.json (https://huggingface.co/CohereForAI/c4ai-command-r-plus/blob/main/tokenizer.json) fixes this issue completely from my testing (output generation is slow, but so far so good!)

My guess is something is happening in the mlx_lm.convert process due to the large size of the vocab + the multilingual nature of the tokenizer + the strange tokenizer.json formatting.

edit: generation speed is also slightly faster now due to the correct tokenizer being used.

@fblissjr Indeed the tokenizer created from the conversion is slightly smaller ~2MB than the original.

I updated as you suggested. Can you check it?

Blaizzy · 2024-04-09T17:21:44Z

Hey @Blaizzy, I have run the exact same test with the new llama.cpp implementation of Command-R+ and it works way above 8k tokens.

@jeanromainroy can you try again with the change in this branch, if it works I will make a PR.

pip install -U git+https://github.com/Blaizzy/mlx-examples.git@pc/commandR#subdirectory=llms --use-pep517

Link: https://github.com/Blaizzy/mlx-examples/tree/pc/commandR

Blaizzy · 2024-04-09T17:23:23Z

You can also try to increase the default max_position_embeddings and let me know if it works.

fblissjr · 2024-04-09T17:24:31Z

Copying the original cohere tokenizer.json (https://huggingface.co/CohereForAI/c4ai-command-r-plus/blob/main/tokenizer.json) fixes this issue completely from my testing (output generation is slow, but so far so good!)
My guess is something is happening in the mlx_lm.convert process due to the large size of the vocab + the multilingual nature of the tokenizer + the strange tokenizer.json formatting.
edit: generation speed is also slightly faster now due to the correct tokenizer being used.

@fblissjr Indeed the tokenizer created from the conversion is slightly smaller ~2MB than the original.

I updated as you suggested. Can you check it?

Actually did this myself yesterday with my own quant, and output was better and faster - no idea why. And now unsure if I just had a bug somewhere on my end or if it actually made a difference.

I'm planning to test out a larger CUDA machine later today or tomorrow to see how it works natively.

Blaizzy · 2024-04-09T17:29:00Z

Let me know how it goes, but for now according to your report the issue should be fixed.

jeanromainroy · 2024-04-09T18:00:23Z

Hey @Blaizzy , I tried your fork and the model is still outputting <PAD><PAD><PAD>... when I provide a long prompt.

Blaizzy · 2024-04-09T18:15:45Z

Hey @Blaizzy , I tried your fork and the model is still outputting ... when I provide a long prompt.

I have made a new change, can you try it again please :)

Blaizzy · 2024-04-09T18:29:08Z

Wait, I think I got it!

Give me 30 min :)

Blaizzy · 2024-04-09T18:44:45Z

@jeanromainroy can you try this branch, the previous one had a git issue:

https://github.com/Blaizzy/mlx-examples/tree/pc/command-R

jeanromainroy · 2024-04-09T19:52:35Z

Still outputting <PAD><PAD><PAD>... :(

Blaizzy · 2024-04-09T19:58:33Z

Only PAD ? Can you share the whole output?

jeanromainroy · 2024-04-09T20:00:53Z

It's outputting <PAD> for as long as I let it. In other words, max_tokens=256, results in 256 x <PAD>

Blaizzy · 2024-04-09T21:09:57Z

Got it!

@awni the cohere team added model_max_length set to 128K on both command-r models.

Is there a way of setting using this number with the nn.Rope? Are there any deep changes needed? If so, please point them, I can work on it.

awni · 2024-04-09T21:46:43Z

I'm not sure I fiollow your question. The nn.RoPE layer does not make any assumptions about the maximum sequence length.

jeanromainroy · 2024-04-09T21:48:24Z

He's talking about the Command-R+'s config:

awni · 2024-04-10T20:05:12Z

@jeanromainroy regarding:

I have run the exact same test with the new llama.cpp implementation of Command-R+ and it works way above 8k tokens.

My understanding is llama.cpp uses a fixed size context (the -c flag) so while it works for long context it does not really "work" in the sense that it does not use any context before a certain limit. It simply truncates everything before the -c tokens. Default value for -c is 512:

  -c N, --ctx-size N    size of the prompt context (default: 512, 0 = loaded from model)

Maybe we could provide something similar.. but I think the default behavior is a little misleading.

awni · 2024-04-10T20:05:49Z

@fblissjr could you share a command using mlx_lm.generate that produces strange responses? I would like to debug that issue if you are still experiencing it.

fblissjr · 2024-04-10T20:12:19Z

@fblissjr could you share a command using mlx_lm.generate that produces strange responses? I would like to debug that issue if you are still experiencing it.

I can't with mlx_lm.generate because it only happens when I run it with apply_tool_template in the tokenizer. Not at home right now and haven't tested since the day it happened. I think you can mock up something like this:

tools = (a json object similar to the cohere example on HF)
if tools, tokenizer.apply_tool_use_template (applied to prompt)

Basically you want to get the apply_tool_use_template to show up, which is a big page-ish long output that looks like this (copying and pasting from HF repo under tool use output example):

<BOS_TOKEN><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|># Safety Preamble
The instructions in this section override those in the task description and style guide sections. Don't answer questions that are harmful or immoral.

System Preamble

Basic Rules

You are a powerful conversational AI trained by Cohere to help people. You are augmented by a number of tools, and your job is to use and consume the output of these tools to best help the user. You will see a conversation history between yourself and a user, ending with an utterance from the user. You will then see a specific instruction instructing you what kind of response to generate. When you answer the user's requests, you cite your sources in your answers, according to those instructions.

User Preamble

Task and Context

You help people answer their questions and other requests interactively. You will be asked a very wide array of requests on all kinds of topics. You will be equipped with a wide range of search engines or similar tools to help you, which you use to research your answer. You should focus on serving the user's needs as best you can, which will be wide-ranging.

Style Guide

Unless the user asks for a different style of answer, you should answer in full sentences, using proper grammar and spelling.

Available Tools

Here is a list of tools that you have available to you:

Blaizzy · 2024-04-10T20:31:35Z

@awni you can use the example in the MLX model card to replicate @fblissjr example:

https://huggingface.co/mlx-community/c4ai-command-r-plus-4bit

Blaizzy · 2024-04-10T20:34:21Z

The nn.RoPE layer does not make any assumptions about the maximum sequence length.
My understanding is llama.cpp uses a fixed size context (the -c flag) so while it works for long context it does not really "work" in the sense that it does not use any context before a certain limit. It simply truncates everything before the -c tokens. Default value for -c is 512:
  -c N, --ctx-size N    size of the prompt context (default: 512, 0 = loaded from model)
Maybe we could provide something similar.. but I think the default behavior is a little misleading.

I see now.

I thought because the PyTorch implementation takes context window size into account we were missing something.

Something like this:

class RotaryPositionalEmbeddings(nn.Module):
    def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
        super().__init__()
        self.dim = dim
        self.max_position_embeddings = max_position_embeddings
        self.device=device
        self.scaling_factor = scaling_factor
        self.base = base
        inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2, dtype=torch.int64).float().to(device) / self.dim))
        self.register_buffer("inv_freq", inv_freq, persistent=False)

        # Build here to make `torch.jit.trace` work.
        self._set_cos_sin_cache(
            seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
        )

    def _set_cos_sin_cache(self, seq_len, device, dtype):
        self.max_seq_len_cached = seq_len
        t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.int64).type_as(self.inv_freq)
        t = t / self.scaling_factor
        freqs = torch.outer(t, self.inv_freq)
        # Different from paper, but it uses a different permutation in order to obtain the same calculation
        emb = torch.cat((freqs, freqs), dim=-1)
        self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
        self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)

    @torch.no_grad()
    def forward(self, x, seq_len=None):
        # x: [bs, num_attention_heads, seq_len, head_size]
        if seq_len > self.max_seq_len_cached:
            self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)

        return (
            self.cos_cached[:seq_len].to(dtype=x.dtype),
            self.sin_cached[:seq_len].to(dtype=x.dtype),
        )

Blaizzy · 2024-04-10T20:35:12Z

The nn.RoPE layer does not make any assumptions about the maximum sequence length.

Btw does that mean that we can use context size of any length with nn.RoPE? If not, what are the limitations?

M-I · 2024-05-06T09:25:41Z

Hi folks, still no consensus on what settings to set, what json files to use?

Using the mlx-community 4bit version, I have random japanese characters. What surprised me was that without even mentionning it in my prompt, at some point the model acknowledged and apoligized for their randomness and said it would try to avoid them.

Blaizzy · 2024-05-06T10:10:29Z

@M-I could you elaborate on what you mean?

M-I · 2024-05-06T15:39:16Z

starting with

from mlx_lm import load, generate
model, tokenizer = load("mlx-community/c4ai-command-r-plus-4bit")

at some point, generate(model, tokenizer, prompt=conversation, tokenize=False, add_generation_prompt=True), verbose=True, max_tokens=1000) generated a bit where there was:
"Let's start with defining the specific requirements and breaking down the project into manageable tasks. Feel free to share your thoughts and questions along the way! I'm excited to see your project come to life! 😊👍

P.S. Apologies for the random Japanese words (e.g., "グリーニング") that appeared in my response. It seems there might be an issue with my language model. I'll try to avoid this in future responses. 😅👍. I'm always ready to assist you with your project! 😊".

And it will go on and on, as is the end_token is never generated or acknowledged. I just assumed it was the price to pay for 4bit quantization, so I never mentioned the fact that there was Japanese, or even anything weird or out of place in it's response, but it just self reflected on its own.
I'll try to find a prompt that I can share that creates similar responses. But others 1 2 in the discussion tab of the 4bit mlx model on hugginface noticed some weirdness.

Blaizzy · 2024-05-06T16:55:30Z

I see, thank you for explaining :)

I think this should be a new issue.

As it's not related with this thread.

Command-R-Plus, Context Window Limitations #660

Command-R-Plus, Context Window Limitations #660

Comments

jeanromainroy commented Apr 8, 2024 • edited Loading

awni commented Apr 8, 2024

fblissjr commented Apr 8, 2024

fblissjr commented Apr 8, 2024

fblissjr commented Apr 8, 2024

fblissjr commented Apr 8, 2024

fblissjr commented Apr 8, 2024 • edited Loading

awni commented Apr 8, 2024

awni commented Apr 8, 2024

fblissjr commented Apr 8, 2024 • edited Loading

fblissjr commented Apr 8, 2024

fblissjr commented Apr 8, 2024

fblissjr commented Apr 8, 2024

fblissjr commented Apr 8, 2024

jeanromainroy commented Apr 9, 2024

fblissjr commented Apr 9, 2024

Blaizzy commented Apr 9, 2024 • edited Loading

jeanromainroy commented Apr 9, 2024

Blaizzy commented Apr 9, 2024

Blaizzy commented Apr 9, 2024 • edited Loading

Blaizzy commented Apr 9, 2024

fblissjr commented Apr 9, 2024

Blaizzy commented Apr 9, 2024

jeanromainroy commented Apr 9, 2024

Blaizzy commented Apr 9, 2024

Blaizzy commented Apr 9, 2024

Blaizzy commented Apr 9, 2024

jeanromainroy commented Apr 9, 2024

Blaizzy commented Apr 9, 2024

jeanromainroy commented Apr 9, 2024

Blaizzy commented Apr 9, 2024

awni commented Apr 9, 2024

jeanromainroy commented Apr 9, 2024

awni commented Apr 10, 2024

awni commented Apr 10, 2024

fblissjr commented Apr 10, 2024

System Preamble

Basic Rules

User Preamble

Task and Context

Style Guide

Available Tools

Blaizzy commented Apr 10, 2024

Blaizzy commented Apr 10, 2024 • edited Loading

Blaizzy commented Apr 10, 2024

M-I commented May 6, 2024

Blaizzy commented May 6, 2024

M-I commented May 6, 2024

Blaizzy commented May 6, 2024

jeanromainroy commented Apr 8, 2024 •

edited

Loading

fblissjr commented Apr 8, 2024 •

edited

Loading

fblissjr commented Apr 8, 2024 •

edited

Loading

Blaizzy commented Apr 9, 2024 •

edited

Loading

Blaizzy commented Apr 9, 2024 •

edited

Loading

Blaizzy commented Apr 10, 2024 •

edited

Loading