Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add params to truncate documents to length when using LLMs #1539

Merged
merged 4 commits into from
Oct 11, 2023

Conversation

MaartenGr
Copy link
Owner

@MaartenGr MaartenGr commented Sep 21, 2023

Truncate documents to a specific length when using any LLM for fine-tuning topic representations (#1527). It introduces two parameters:

  • doc_length
    • The maximum length of each document. If a document is longer, it will be truncated. If None, the entire document is passed.
  • tokenizer
    • The tokenizer used to calculate to split the document into segments used to count the length of a document.
      • If tokenizer is 'char', then the document is split up into characters which are counted to adhere to doc_length
      • If tokenizer is 'whitespace', the the document is split up into words separated by whitespaces. These words are counted and truncated depending on doc_length
      • If tokenizer is 'vectorizer', then the internal CountVectorizer is used to tokenize the document. These tokens are counted and trunctated depending on doc_length
      • If tokenizer is a callable, then that callable is used to tokenized the document. These tokens are counted and truncated depending on doc_length

This means that the definition of doc_length changes depending on what constitutes a token in the tokenizer parameter. If a token is a character, then doc_length refers to max length in characters. If a token is a word, then doc_length refers to the max length in words.

The example below essentially states that documents cannot be longer than 100 tokens. Anything more than that will be truncated.

It uses tiktoken which can be installed as follows together with openai:

pip install tiktoken openai bertopic

Example:

import openai
import tiktoken
from bertopic.representation import OpenAI
from bertopic import BERTopic

# Tokenizer
tokenizer= tiktoken.encoding_for_model("gpt-3.5-turbo")

# Create your representation model
openai.api_key = MY_API_KEY
representation_model = OpenAI(
    model="gpt-3.5-turbo", 
    delay_in_seconds=2, 
    chat=True,
    doc_length=100,
    tokenizer=tokenizer
)

# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)

@zilch42
Copy link
Contributor

zilch42 commented Sep 25, 2023

Hi Maarten,

Great stuff! Thanks for implementing all of the different split options. I have tested the TextGeneration representation model with llama2 with all of the tokenizer options and it all appears to be working.

The only thing I'm not sure about is the last tokenizer case:

       elif isinstance(tokenizer, str):
            truncated_document = f"{tokenizer}".join(document.split(tokenizer)[:doc_length])

It isn't documented as an option and there is a risk that if a user misspells one of the other string based options (e.g. tokenizer = "victorizer") then they may get an unexpected result rather than it raising a value error. Is there a specific use case it is solving? If so, it should probably be documented and I wonder if there is any way the user could be more explicit about providing a custom string to split on (or more explicit about the other string based options...)? And if not, does it need to be allowed?

I appreciate the truncate_document function too. It makes testing the truncation a lot easier 😄

@MaartenGr
Copy link
Owner Author

Thanks for the feedback! I forgot to add documentation for that it seems. It was initially created for those who wanted to separate a token based on a specific separate instead of using whitespace. Having said that, I think I will just remove it. It might make more sense to give additional instructions with respect to using decode and encode options that will support that type of behavior.

@zilch42
Copy link
Contributor

zilch42 commented Sep 26, 2023

Before merging this, see #1545 with another potential use case for the split on character sequence option

@MaartenGr MaartenGr merged commit 62e97dd into master Oct 11, 2023
@MaartenGr MaartenGr deleted the llm_doc_truncation branch May 12, 2024 09:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants