TokenTextSplitter not loading up HF tokenizer from `.from_huggingface_tokenizer()`; using `gpt2` instead #28056

bhavnicksm · 2024-11-12T16:59:52Z

Checked other resources

I added a very descriptive title to this issue.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain rather than my code.
The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

from langchain_text_splitters import TokenTextSplitter

tts = TokenTextSplitter(chunk_size=256, chunk_overlap=0).from_huggingface_tokenizer(tokenizer)
print(tts._tokenizer)
# output: <Encoding 'gpt2'>

tts = TokenTextSplitter.from_huggingface_tokenizer(tokenizer, chunk_size=256, chunk_overlap=0)
print(tts._tokenizer)
# output: <Encoding 'gpt2'>

tts = tts.from_huggingface_tokenizer(tokenizer)
tts._tokenizer
# output: <Encoding 'gpt2'>

Error Message and Stack Trace (if applicable)

No response

Description

It should show Huggingface Tokenizer
It should use the Huggingface tokenizer, instead of the GPT2 tokenizer

System Info

System Information

OS: Linux
OS Version: #1 SMP PREEMPT_DYNAMIC Thu Jun 27 21:05:47 UTC 2024
Python Version: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0]

Package Information

langchain_core: 0.3.16
langchain: 0.3.7
langchain_community: 0.3.6
langsmith: 0.1.139
langchain_experimental: 0.3.3
langchain_huggingface: 0.1.2
langchain_text_splitters: 0.3.2

Optional packages not installed

langgraph
langserve

Other Dependencies

aiohttp: 3.10.10
async-timeout: 4.0.3
dataclasses-json: 0.6.7
httpx: 0.27.2
httpx-sse: 0.4.0
huggingface-hub: 0.24.7
jsonpatch: 1.33
numpy: 1.26.4
orjson: 3.10.11
packaging: 24.1
pydantic: 2.9.2
pydantic-settings: 2.6.1
PyYAML: 6.0.2
requests: 2.32.3
requests-toolbelt: 1.0.0
sentence-transformers: 3.2.1
SQLAlchemy: 2.0.35
tenacity: 8.5.0
tokenizers: 0.19.1
transformers: 4.44.2
typing-extensions: 4.12.2

suifengfengye · 2024-11-13T02:42:41Z

please show how to create your tokenizer?

bhavnicksm · 2024-11-14T11:16:58Z

Normally, like it expects me to.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")

suifengfengye · 2024-11-16T11:48:36Z

@bhavnicksm
Now encoding_name needs to be passed in the TokenTextSplitter.from_huggingface_tokenizer(), otherwise the default value is "gpt2". But TokenTextSplitter use "tiktoken", it don't support "llama". Maybe you can use custom tokenizer length. Hope this helps.

from transformers import AutoTokenizer, AutoConfig
from langchain_text_splitters import TokenTextSplitter
# replace with your token
access_token = "xxxx"

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct", token=access_token)

# custom tokenizer function
def custom_tokenizer_length(text):
    tokens = tokenizer.tokenize(text)
    return len(tokens)

# create TokenTextSplitter instance
tts = TokenTextSplitter(chunk_size=256, chunk_overlap=0, length_function=custom_tokenizer_length)

bhavnicksm · 2024-11-16T11:53:21Z

@bhavnicksm Now encoding_name needs to be passed in the TokenTextSplitter.from_huggingface_tokenizer(), otherwise the default value is "gpt2". But TokenTextSplitter use "tiktoken", it don't support "llama". Maybe you can use custom tokenizer length. Hope this helps.

from transformers import AutoTokenizer, AutoConfig
from langchain_text_splitters import TokenTextSplitter
# replace with your token
access_token = "xxxx"

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct", token=access_token)

# custom tokenizer function
def custom_tokenizer_length(text):
    tokens = tokenizer.tokenize(text)
    return len(tokens)

# create TokenTextSplitter instance
tts = TokenTextSplitter(chunk_size=256, chunk_overlap=0, length_function=custom_tokenizer_length)

Thanks for the custom_tokenizer_length approach!

I still think having the .from_huggingface_tokenizer() should mean that it doesn't use tiktoken but uses tokenizers tokenizer. It doesn't make sense otherwise.

dosubot bot added the 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature label Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TokenTextSplitter not loading up HF tokenizer from `.from_huggingface_tokenizer()`; using `gpt2` instead #28056

TokenTextSplitter not loading up HF tokenizer from `.from_huggingface_tokenizer()`; using `gpt2` instead #28056

bhavnicksm commented Nov 12, 2024

suifengfengye commented Nov 13, 2024

bhavnicksm commented Nov 14, 2024

suifengfengye commented Nov 16, 2024 •

edited

Loading

bhavnicksm commented Nov 16, 2024

TokenTextSplitter not loading up HF tokenizer from .from_huggingface_tokenizer(); using gpt2 instead #28056

TokenTextSplitter not loading up HF tokenizer from .from_huggingface_tokenizer(); using gpt2 instead #28056

Comments

bhavnicksm commented Nov 12, 2024

Checked other resources

Example Code

Error Message and Stack Trace (if applicable)

Description

System Info

System Information

Package Information

Optional packages not installed

Other Dependencies

suifengfengye commented Nov 13, 2024

bhavnicksm commented Nov 14, 2024

suifengfengye commented Nov 16, 2024 • edited Loading

bhavnicksm commented Nov 16, 2024

TokenTextSplitter not loading up HF tokenizer from `.from_huggingface_tokenizer()`; using `gpt2` instead #28056

TokenTextSplitter not loading up HF tokenizer from `.from_huggingface_tokenizer()`; using `gpt2` instead #28056

suifengfengye commented Nov 16, 2024 •

edited

Loading