[Bug] cohere model(aya) doesn't seem to produce the correct output #3073

jhlee525 · 2024-12-21T10:23:18Z

🐛 Bug

I'm trying to use mlc-llm to run cohere's aya 8b models.

The model compiles and runs normally, but it seems generate weird answers: especially 1. it seems contextually weird. 2. the model doesn't produce the EOS token (or generates immediately in prefill)

This phenomenon both occurs in the quantized model (q4f16_1) and the unquantized model (q0f16).

I checked aya-23 8b and aya-expanse 8b. Results are not cognitively good enough, comparing to transformers & pytorch runs.

Also, I checked both metal(my macbook) and CUDA(linux), but the phenomenon is same.

In my opinion, cohere models are not properly ported in mlc-llm.

To Reproduce

Steps to reproduce the behavior:

Compile models in CLI as described in mlc-llm docs

mlc_llm convert_weight ~/.cache/huggingface/hub/models--CohereForAI--aya-expanse-8b/snapshots/d10159f8405826732641ce11c6892459d447d48c/ --quantization q0f16 -o ./dist/CohereForAI--aya-expanse-8b-q0f16-MLC
mlc_llm gen_config ~/.cache/huggingface/hub/models--CohereForAI--aya-expanse-8b/snapshots/d10159f8405826732641ce11c6892459d447d48c/ --quantization q0f16 --conv-template aya-23 -o dist/CohereForAI--aya-expanse-8b-q0f16-MLC/
mlc_llm compile ./dist/CohereForAI--aya-expanse-8b-q0f16-MLC/mlc-chat-config.json --device cuda -o dist/CohereForAI--aya-expanse-8b-q0f16-MLC/lib.so

Runs the model

engine.chat.completions.create(
  messages=[
    {"role":"user","content":"Anneme onu ne kadar sevdiğimi anlatan bir mektup yaz"}
  ], max_tokens=100
)

The resulting completion

ChatCompletionResponse(
  id='chatcmpl-6012330bdab54a3aaaa14ae8d9a606a2',
  choices=[
    ChatCompletionResponseChoice(
      finish_reason='length',
      index=0,
      message=ChatCompletionMessage(content='Sevgili [Alıcı Adı,\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nBenzeri\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', role='assistant', name=None, tool_calls=None, tool_call_id=None), 
      logprobs=None)
    ],
    created=1734774462,
    model=None,
    system_fingerprint='',
    object='chat.completion',
    usage=CompletionUsage(prompt_tokens=46, completion_tokens=100, total_tokens=146, extra={'prompt_tokens': 46, 'completion_tokens': 100, 'prefill_tokens': 16, 'decode_tokens': 99, 'jump_forward_tokens': 0, 'prefill_tokens_per_s': 363.73776380479154, 'decode_tokens_per_s': 40.11198637770363, 'end_to_end_latency_s': 2.512077922, 'ttft_s': 0.043987734, 'inter_token_latency_s': 0.02512077922})
)

Expected behavior

According to aya's official model card,

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "CohereForAI/aya-expanse-8b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

# Format the message with the chat template
messages = [{"role": "user", "content": "Anneme onu ne kadar sevdiğimi anlatan bir mektup yaz"}]
input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
## <BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Anneme onu ne kadar sevdiğimi anlatan bir mektup yaz<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>

gen_tokens = model.generate(
    input_ids, 
    max_new_tokens=100, 
    do_sample=True, 
    temperature=0.3,
    )

gen_text = tokenizer.decode(gen_tokens[0])
print(gen_text)

When I run this in my machine, it produces

<BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Anneme onu ne kadar sevdiğimi anlatan bir mektup yaz<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>Sevgili Annem,

Bu mektubu, kalbimdeki derin sevgiyi ve minnettarlığımı ifade etmek için yazıyorum. Senin için hissettiğim sevgi kelimelerle tam olarak anlatılamayacak kadar güçlü ve eşsiz. Hayatımın her anında bana verdiğin sevgi, destek ve rehberlik için sonsuz teşekkürlerimi sunuyorum.

Sen, benim için bir kahraman, bir öğretmen ve en iyi arkadaşım old

I don't know turkish, but seems normal answer. I tried English / Korean, but still seems broken.

Environment

How you installed MLC-LLM: Build from the source
How you installed TVM-Unity: Yes
CUDA/cuDNN version (if applicable): 12.2
TVM Unity Hash Tag (python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models):

USE_NVTX: OFF
USE_GTEST: AUTO
SUMMARIZE: OFF
TVM_DEBUG_WITH_ABI_CHANGE: OFF
USE_IOS_RPC: OFF
USE_MSC: OFF
CUDA_VERSION: 12.2
USE_LIBBACKTRACE: AUTO
DLPACK_PATH: 3rdparty/dlpack/include
USE_TENSORRT_CODEGEN: OFF
USE_OPENCL_EXTN_QCOM: NOT-FOUND
USE_TARGET_ONNX: OFF
USE_AOT_EXECUTOR: ON
BUILD_DUMMY_LIBTVM: OFF
USE_CUDNN: OFF
USE_TENSORRT_RUNTIME: OFF
USE_ARM_COMPUTE_LIB_GRAPH_EXECUTOR: OFF
USE_THRUST: OFF
USE_CCACHE: AUTO
USE_ARM_COMPUTE_LIB: OFF
USE_CPP_RTVM: OFF
USE_OPENCL_GTEST: /path/to/opencl/gtest
TVM_LOG_BEFORE_THROW: OFF
USE_MKL: OFF
USE_PT_TVMDSOOP: OFF
MLIR_VERSION: NOT-FOUND
USE_CLML: OFF
USE_STACKVM_RUNTIME: OFF
USE_GRAPH_EXECUTOR_CUDA_GRAPH: OFF
ROCM_PATH: /opt/rocm
USE_DNNL: OFF
USE_MSCCL: OFF
USE_NNAPI_RUNTIME: OFF
USE_VITIS_AI: OFF
USE_MLIR: OFF
USE_RCCL: OFF
USE_LLVM: ON
USE_VERILATOR: OFF
USE_TF_TVMDSOOP: OFF
USE_THREADS: ON
USE_MSVC_MT: OFF
BACKTRACE_ON_SEGFAULT: OFF
USE_GRAPH_EXECUTOR: ON
USE_NCCL: OFF
USE_ROCBLAS: OFF
GIT_COMMIT_HASH: 7ed4584952546fa5d54366b72a6862f919c18daa
USE_VULKAN: OFF
USE_RUST_EXT: OFF
USE_CUTLASS: OFF
USE_CPP_RPC: OFF
USE_HEXAGON: OFF
USE_CUSTOM_LOGGING: OFF
USE_UMA: OFF
USE_FALLBACK_STL_MAP: OFF
USE_SORT: ON
USE_RTTI: ON
GIT_COMMIT_TIME: 2024-12-15 09:56:40 -0500
USE_HIPBLAS: OFF
USE_HEXAGON_SDK: /path/to/sdk
USE_BLAS: none
USE_LIBTORCH: OFF
USE_RANDOM: ON
USE_CUDA: ON
USE_COREML: OFF
USE_AMX: OFF
BUILD_STATIC_RUNTIME: OFF
USE_KHRONOS_SPIRV: OFF
USE_CLML_GRAPH_EXECUTOR: OFF
USE_TFLITE: OFF
USE_HEXAGON_GTEST: /path/to/hexagon/gtest
PICOJSON_PATH: 3rdparty/picojson
USE_OPENCL_ENABLE_HOST_PTR: OFF
INSTALL_DEV: OFF
USE_PROFILER: ON
USE_NNPACK: OFF
LLVM_VERSION: 18.1.6
USE_MRVL: OFF
USE_OPENCL: OFF
COMPILER_RT_PATH: 3rdparty/compiler-rt
USE_NNAPI_CODEGEN: OFF
RANG_PATH: 3rdparty/rang/include
USE_SPIRV_KHR_INTEGER_DOT_PRODUCT: OFF
USE_OPENMP: none
USE_BNNS: OFF
USE_FLASHINFER: OFF
USE_CUBLAS: ON
USE_METAL: OFF
USE_HEXAGON_EXTERNAL_LIBS: OFF
USE_ALTERNATIVE_LINKER: AUTO
USE_BYODT_POSIT: OFF
USE_NVSHMEM: OFF
USE_HEXAGON_RPC: OFF
DMLC_PATH: 3rdparty/dmlc-core/include
INDEX_DEFAULT_I64: ON
USE_RELAY_DEBUG: OFF
USE_RPC: ON
USE_TENSORFLOW_PATH: none
TVM_CLML_VERSION:
USE_MIOPEN: OFF
USE_ROCM: OFF
USE_PAPI: OFF
USE_CURAND: OFF
TVM_CXX_COMPILER_PATH: /usr/bin/c++
HIDE_PRIVATE_SYMBOLS: OFF

Additional context

The text was updated successfully, but these errors were encountered:

jhlee525 added the bug Confirmed bugs label Dec 21, 2024

jhlee525 changed the title ~~[Bug] cohere models(aya) doesn't seem to produce the correct output~~ [Bug] cohere model(aya) doesn't seem to produce the correct output Dec 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] cohere model(aya) doesn't seem to produce the correct output #3073

[Bug] cohere model(aya) doesn't seem to produce the correct output #3073

jhlee525 commented Dec 21, 2024

[Bug] cohere model(aya) doesn't seem to produce the correct output #3073

[Bug] cohere model(aya) doesn't seem to produce the correct output #3073

Comments

jhlee525 commented Dec 21, 2024

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context