Add params to truncate documents to length when using LLMs #1539

MaartenGr · 2023-09-21T14:03:00Z

Truncate documents to a specific length when using any LLM for fine-tuning topic representations (#1527). It introduces two parameters:

doc_length
- The maximum length of each document. If a document is longer, it will be truncated. If None, the entire document is passed.
tokenizer
- The tokenizer used to calculate to split the document into segments used to count the length of a document.
  - If tokenizer is 'char', then the document is split up into characters which are counted to adhere to doc_length
  - If tokenizer is 'whitespace', the the document is split up into words separated by whitespaces. These words are counted and truncated depending on doc_length
  - If tokenizer is 'vectorizer', then the internal CountVectorizer is used to tokenize the document. These tokens are counted and trunctated depending on doc_length
  - If tokenizer is a callable, then that callable is used to tokenized the document. These tokens are counted and truncated depending on doc_length

This means that the definition of doc_length changes depending on what constitutes a token in the tokenizer parameter. If a token is a character, then doc_length refers to max length in characters. If a token is a word, then doc_length refers to the max length in words.

The example below essentially states that documents cannot be longer than 100 tokens. Anything more than that will be truncated.

It uses tiktoken which can be installed as follows together with openai:

pip install tiktoken openai bertopic

Example:

import openai
import tiktoken
from bertopic.representation import OpenAI
from bertopic import BERTopic

# Tokenizer
tokenizer= tiktoken.encoding_for_model("gpt-3.5-turbo")

# Create your representation model
openai.api_key = MY_API_KEY
representation_model = OpenAI(
    model="gpt-3.5-turbo", 
    delay_in_seconds=2, 
    chat=True,
    doc_length=100,
    tokenizer=tokenizer
)

# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)

zilch42 · 2023-09-25T05:01:16Z

Hi Maarten,

Great stuff! Thanks for implementing all of the different split options. I have tested the TextGeneration representation model with llama2 with all of the tokenizer options and it all appears to be working.

The only thing I'm not sure about is the last tokenizer case:

       elif isinstance(tokenizer, str):
            truncated_document = f"{tokenizer}".join(document.split(tokenizer)[:doc_length])

It isn't documented as an option and there is a risk that if a user misspells one of the other string based options (e.g. tokenizer = "victorizer") then they may get an unexpected result rather than it raising a value error. Is there a specific use case it is solving? If so, it should probably be documented and I wonder if there is any way the user could be more explicit about providing a custom string to split on (or more explicit about the other string based options...)? And if not, does it need to be allowed?

I appreciate the truncate_document function too. It makes testing the truncation a lot easier 😄

MaartenGr · 2023-09-25T07:25:36Z

Thanks for the feedback! I forgot to add documentation for that it seems. It was initially created for those who wanted to separate a token based on a specific separate instead of using whitespace. Having said that, I think I will just remove it. It might make more sense to give additional instructions with respect to using decode and encode options that will support that type of behavior.

zilch42 · 2023-09-26T00:46:05Z

Before merging this, see #1545 with another potential use case for the split on character sequence option

Add params to truncate documents to length when using LLMs

47c5d6f

MaartenGr mentioned this pull request Sep 22, 2023

Question/request about representative document truncation #1527

Open

Update documentation

9230c3c

zilch42 mentioned this pull request Sep 26, 2023

Is there a way to send a different document to LLM representation models? #1545

Closed

MaartenGr mentioned this pull request Sep 27, 2023

n-gram Keywords need delimiting in OpenAI() #1546

Open

MaartenGr added 2 commits September 27, 2023 08:31

Fix #1546

817ad86

Track generated prompts

18671fb

MaartenGr mentioned this pull request Oct 6, 2023

LangChain representation does not support LCEL Runnables #1564

Closed

MaartenGr merged commit 62e97dd into master Oct 11, 2023

MaartenGr deleted the llm_doc_truncation branch May 12, 2024 09:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add params to truncate documents to length when using LLMs #1539

Add params to truncate documents to length when using LLMs #1539

MaartenGr commented Sep 21, 2023 •

edited

Loading

zilch42 commented Sep 25, 2023

MaartenGr commented Sep 25, 2023

zilch42 commented Sep 26, 2023

Add params to truncate documents to length when using LLMs #1539

Add params to truncate documents to length when using LLMs #1539

Conversation

MaartenGr commented Sep 21, 2023 • edited Loading

zilch42 commented Sep 25, 2023

MaartenGr commented Sep 25, 2023

zilch42 commented Sep 26, 2023

MaartenGr commented Sep 21, 2023 •

edited

Loading