Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TokenTextSplitter not loading up HF tokenizer from .from_huggingface_tokenizer(); using gpt2 instead #28056

Open
5 tasks done
bhavnicksm opened this issue Nov 12, 2024 · 4 comments
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature

Comments

@bhavnicksm
Copy link

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

from langchain_text_splitters import TokenTextSplitter

tts = TokenTextSplitter(chunk_size=256, chunk_overlap=0).from_huggingface_tokenizer(tokenizer)
print(tts._tokenizer)
# output: <Encoding 'gpt2'>

tts = TokenTextSplitter.from_huggingface_tokenizer(tokenizer, chunk_size=256, chunk_overlap=0)
print(tts._tokenizer)
# output: <Encoding 'gpt2'>

tts = tts.from_huggingface_tokenizer(tokenizer)
tts._tokenizer
# output: <Encoding 'gpt2'>

Error Message and Stack Trace (if applicable)

No response

Description

  • It should show Huggingface Tokenizer
  • It should use the Huggingface tokenizer, instead of the GPT2 tokenizer

System Info

System Information

OS: Linux
OS Version: #1 SMP PREEMPT_DYNAMIC Thu Jun 27 21:05:47 UTC 2024
Python Version: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0]

Package Information

langchain_core: 0.3.16
langchain: 0.3.7
langchain_community: 0.3.6
langsmith: 0.1.139
langchain_experimental: 0.3.3
langchain_huggingface: 0.1.2
langchain_text_splitters: 0.3.2

Optional packages not installed

langgraph
langserve

Other Dependencies

aiohttp: 3.10.10
async-timeout: 4.0.3
dataclasses-json: 0.6.7
httpx: 0.27.2
httpx-sse: 0.4.0
huggingface-hub: 0.24.7
jsonpatch: 1.33
numpy: 1.26.4
orjson: 3.10.11
packaging: 24.1
pydantic: 2.9.2
pydantic-settings: 2.6.1
PyYAML: 6.0.2
requests: 2.32.3
requests-toolbelt: 1.0.0
sentence-transformers: 3.2.1
SQLAlchemy: 2.0.35
tenacity: 8.5.0
tokenizers: 0.19.1
transformers: 4.44.2
typing-extensions: 4.12.2

@dosubot dosubot bot added the 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature label Nov 12, 2024
@suifengfengye
Copy link
Contributor

please show how to create your tokenizer?

@bhavnicksm
Copy link
Author

Normally, like it expects me to.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")

@suifengfengye
Copy link
Contributor

suifengfengye commented Nov 16, 2024

@bhavnicksm
Now encoding_name needs to be passed in the TokenTextSplitter.from_huggingface_tokenizer(), otherwise the default value is "gpt2". But TokenTextSplitter use "tiktoken", it don't support "llama". Maybe you can use custom tokenizer length. Hope this helps.

from transformers import AutoTokenizer, AutoConfig
from langchain_text_splitters import TokenTextSplitter
# replace with your token
access_token = "xxxx"

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct", token=access_token)

# custom tokenizer function
def custom_tokenizer_length(text):
    tokens = tokenizer.tokenize(text)
    return len(tokens)

# create TokenTextSplitter instance
tts = TokenTextSplitter(chunk_size=256, chunk_overlap=0, length_function=custom_tokenizer_length)

@bhavnicksm
Copy link
Author

@bhavnicksm Now encoding_name needs to be passed in the TokenTextSplitter.from_huggingface_tokenizer(), otherwise the default value is "gpt2". But TokenTextSplitter use "tiktoken", it don't support "llama". Maybe you can use custom tokenizer length. Hope this helps.

from transformers import AutoTokenizer, AutoConfig
from langchain_text_splitters import TokenTextSplitter
# replace with your token
access_token = "xxxx"

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct", token=access_token)

# custom tokenizer function
def custom_tokenizer_length(text):
    tokens = tokenizer.tokenize(text)
    return len(tokens)

# create TokenTextSplitter instance
tts = TokenTextSplitter(chunk_size=256, chunk_overlap=0, length_function=custom_tokenizer_length)

Thanks for the custom_tokenizer_length approach!

I still think having the .from_huggingface_tokenizer() should mean that it doesn't use tiktoken but uses tokenizers tokenizer. It doesn't make sense otherwise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature
Projects
None yet
Development

No branches or pull requests

2 participants