Add params to truncate documents to length when using LLMs #1539
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Truncate documents to a specific length when using any LLM for fine-tuning topic representations (#1527). It introduces two parameters:
doc_length
tokenizer
'char'
, then the document is split up into characters which are counted to adhere todoc_length
'whitespace'
, the the document is split up into words separated by whitespaces. These words are counted and truncated depending ondoc_length
'vectorizer'
, then the internal CountVectorizer is used to tokenize the document. These tokens are counted and trunctated depending ondoc_length
doc_length
This means that the definition of
doc_length
changes depending on what constitutes a token in thetokenizer
parameter. If a token is a character, thendoc_length
refers to max length in characters. If a token is a word, thendoc_length
refers to the max length in words.The example below essentially states that documents cannot be longer than 100 tokens. Anything more than that will be truncated.
It uses tiktoken which can be installed as follows together with openai:
Example: