Tokenizer issue with Vicuna V1.1, EOS, BOS tokens seem to be blank #408

SupreethRao99 · 2023-04-13T12:24:03Z

Hello,

When I try and get the BOS and EOS token from the tokenizer. I'm getting '' as both EOS and BOS tokens. Tried it with both AutoTokenizer as well as LlamaTokenizer.

>>> tokenizer.eos_token
''
>>> tokenizer.bos_token
''

The documentation on HuggingFace says that the EOS token is "</s>". I further suspect that it is not the case since this is the special_tokens_map.json file

{
  "bos_token": {
    "content": "",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  },
  "eos_token": {
    "content": "",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  },
  "unk_token": {
    "content": "",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  }
}

Could Anyone tell me if they're experiencing the same and if it might be an error

The text was updated successfully, but these errors were encountered:

christianwengert · 2023-04-13T15:46:53Z

I have the following from within the code (debugging):

tokenizer.eos_token
'</s>'
tokenizer.bos_token
'<s>'

But on my system, once I asked a question, the ASSISTANT will go on forever with the conversation on his own. So I believe there is something odd with those tokenizers

merrymercy · 2023-04-13T18:22:55Z

Could you try the following steps?

Update the huggingface transformer to the latest main branch
Redo the weight conversion following https://huggingface.co/docs/transformers/main/model_doc/llama
Apply delta with the latest FasChat

Hugging face did some changes to the llama tokenizer recently.

christianwengert · 2023-04-13T19:45:40Z

ok I upgraded to

fastchat to latest
huggingface transformer to latest
apply delta with latest fastchat

And I still get the same problem, i.e. the assistant is doing the whole conversation between assistant and user on its own

:(

merrymercy · 2023-04-13T20:02:39Z

What weight version did you use? V0 or V1.1?
Please check their different and fschat version compatibility here
https://github.com/lm-sys/FastChat/blob/main/docs/weights_version.md

Could you share your chat history so we can know what happened?

phnessu4 · 2023-04-14T05:27:33Z

I have the same issue with Vicuna V1.1

suc16 · 2023-04-14T06:22:53Z

I have the same issue with fastchat 0.2.1. I have tried to update huggingface transformers and restart workers, but still not work.
vicuna v0 and vicuna v1.1 both have the same issue.
Only when I change fschat version to 0.1.10, the problem solved.
@merrymercy

bash99 · 2023-04-14T08:05:33Z

I fine with vicuna v1.0 and fastchat 0.2.1, but my model is converted on 0.1.9
And I've the same problem with v1.1, which is converted under 0.2.1

christianwengert · 2023-04-14T08:23:23Z

New models and v.0.1.10 works for me

suc16 · 2023-04-14T09:33:31Z

New models and v.0.1.10 works for me

so

fschat-v0.1.10 + vicuna-7b-v0 work
fschat-v0.1.10 + vicuna-7b-v1.1 work
fschat-v0.2.1 + vicuna-7b-v0 not work
fschat-v0.2.1 + vicuna-7b-v1.1 not work

merrymercy · 2023-04-17T02:01:25Z

I guess the blank EOS/BOS is not only related to fastchat or Vicuna weights but it is also related to how you convert the base llama model.

I suggest you use transformers>=4.28.0 and redo the weight conversion. In either v0 or v1.1, you should get a file named "
special_tokens_map.json" in your converted weight, with the same content as this file https://huggingface.co/lmsys/vicuna-13b-delta-v0/blob/main/special_tokens_map.json. If not, please copy special_tokens_map.json and tokenizer_config.json from https://huggingface.co/lmsys/vicuna-13b-delta-v0/tree/main to your converted weight folder (works for both v0 and v1.1)

In terms of the combability,
The v1.1 weights work best for fschat>=0.2.1, but also work for older fschat.
The v0 weights work best for fschat=0.1.10, but do not work for newer fschat, except you explicitly state the conversation template by using this

FastChat/fastchat/serve/inference.py

Line 30 in 898d4fc

    
           "2. Use the old conversation template by `python3 -m fastchat.serve.cli --model-path /path/to/vicuna-v0 --conv-template conv_one_shot`\n"

phnessu4 · 2023-04-17T02:33:46Z

redownload the models and do the transformer again will fix this.

suc16 · 2023-04-17T02:59:14Z

Thanks.
My environment
python3.9
transformers 4.28.1
fschat 0.2.2

After applying delta with latest fastchat, I still get the blank EOS/BOS in special_tokens_map.json
python3 -m fastchat.model.apply_delta --base /data/models/llama-7b-hf --target /data/models/vicuna-7b --delta /data/models/vicuna-7b-delta-v1.1

The problem solved, after copying special_tokens_map.json and tokenizer_config.json

Package Version

accelerate 0.18.0
aiofiles 23.1.0
aiohttp 3.8.4
aiosignal 1.3.1
altair 4.2.2
anyio 3.6.2
appdirs 1.4.4
async-timeout 4.0.2
attrs 22.2.0
certifi 2022.12.7
charset-normalizer 3.1.0
click 8.1.3
cmake 3.26.3
contourpy 1.0.7
cycler 0.11.0
docker-pycreds 0.4.0
entrypoints 0.4
fastapi 0.95.0
ffmpy 0.3.0
filelock 3.11.0
fonttools 4.39.3
frozenlist 1.3.3
fschat 0.2.2
fsspec 2023.4.0
gitdb 4.0.10
GitPython 3.1.31
gradio 3.23.0
h11 0.14.0
httpcore 0.17.0
httpx 0.24.0
huggingface-hub 0.13.4
idna 3.4
importlib-resources 5.12.0
Jinja2 3.1.2
jsonschema 4.17.3
kiwisolver 1.4.4
linkify-it-py 2.0.0
lit 16.0.1
markdown-it-py 2.2.0
markdown2 2.4.8
MarkupSafe 2.1.2
matplotlib 3.7.1
mdit-py-plugins 0.3.3
mdurl 0.1.2
mpmath 1.3.0
multidict 6.0.4
networkx 3.1
numpy 1.24.2
nvidia-cublas-cu11 11.10.3.66
nvidia-cuda-cupti-cu11 11.7.101
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11 8.5.0.96
nvidia-cufft-cu11 10.9.0.58
nvidia-curand-cu11 10.2.10.91
nvidia-cusolver-cu11 11.4.0.1
nvidia-cusparse-cu11 11.7.4.91
nvidia-nccl-cu11 2.14.3
nvidia-nvtx-cu11 11.7.91
orjson 3.8.10
packaging 23.0
pandas 2.0.0
pathtools 0.1.2
Pillow 9.5.0
pip 23.0.1
prompt-toolkit 3.0.38
protobuf 4.22.1
psutil 5.9.4
pydantic 1.10.7
pydub 0.25.1
Pygments 2.15.0
pyparsing 3.0.9
pyrsistent 0.19.3
python-dateutil 2.8.2
python-multipart 0.0.6
pytz 2023.3
PyYAML 6.0
regex 2023.3.23
requests 2.28.2
rich 13.3.3
semantic-version 2.10.0
sentencepiece 0.1.97
sentry-sdk 1.19.1
setproctitle 1.3.2
setuptools 65.6.3
shortuuid 1.0.11
six 1.16.0
smmap 5.0.0
sniffio 1.3.0
starlette 0.26.1
svgwrite 1.4.3
sympy 1.11.1
tokenizers 0.13.3
toolz 0.12.0
torch 2.0.0
tqdm 4.65.0
transformers 4.28.1
triton 2.0.0
typing_extensions 4.5.0
tzdata 2023.3
uc-micro-py 1.0.1
urllib3 1.26.15
uvicorn 0.21.1
wandb 0.14.2
wavedrom 2.0.3.post3
wcwidth 0.2.6
websockets 11.0.1
wheel 0.38.4
yarl 1.8.2
zipp 3.15.0

@merrymercy

SupreethRao99 · 2023-04-19T02:52:00Z

Thanks everyone, converting the LLaMA weights using the new converter from hugging face + applying the Vicuna v1.1 delta worked out of the box.

merrymercy closed this as completed Apr 17, 2023

weiddeng mentioned this issue Apr 29, 2023

Yet another strange Prefix #507

Closed

merrymercy mentioned this issue Apr 29, 2023

Update apply_delta.py to use tokenizer from delta weights #647

Merged

persistz mentioned this issue Aug 17, 2023

[Minor] Update the warning to follow the new conv_template file #2248

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer issue with Vicuna V1.1, EOS, BOS tokens seem to be blank #408

Tokenizer issue with Vicuna V1.1, EOS, BOS tokens seem to be blank #408

SupreethRao99 commented Apr 13, 2023

christianwengert commented Apr 13, 2023

merrymercy commented Apr 13, 2023 •

edited

Loading

christianwengert commented Apr 13, 2023

merrymercy commented Apr 13, 2023

phnessu4 commented Apr 14, 2023

suc16 commented Apr 14, 2023

bash99 commented Apr 14, 2023

christianwengert commented Apr 14, 2023

suc16 commented Apr 14, 2023

merrymercy commented Apr 17, 2023

phnessu4 commented Apr 17, 2023

suc16 commented Apr 17, 2023

SupreethRao99 commented Apr 19, 2023

Tokenizer issue with Vicuna V1.1, EOS, BOS tokens seem to be blank #408

Tokenizer issue with Vicuna V1.1, EOS, BOS tokens seem to be blank #408

Comments

SupreethRao99 commented Apr 13, 2023

christianwengert commented Apr 13, 2023

merrymercy commented Apr 13, 2023 • edited Loading

christianwengert commented Apr 13, 2023

merrymercy commented Apr 13, 2023

phnessu4 commented Apr 14, 2023

suc16 commented Apr 14, 2023

bash99 commented Apr 14, 2023

christianwengert commented Apr 14, 2023

suc16 commented Apr 14, 2023

merrymercy commented Apr 17, 2023

phnessu4 commented Apr 17, 2023

suc16 commented Apr 17, 2023

SupreethRao99 commented Apr 19, 2023

merrymercy commented Apr 13, 2023 •

edited

Loading