Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple text chunking #188

Merged
merged 7 commits into from
Jun 18, 2023
Merged

Simple text chunking #188

merged 7 commits into from
Jun 18, 2023

Conversation

andreibondarev
Copy link
Collaborator

@andreibondarev andreibondarev commented Jun 17, 2023

Simple text chunking using the baran gem.

Example usage:

irb(main):001:0> Langchain::Loader.new("spec/fixtures/loaders/sample.docx").load.chunks
=> [{:text=>"Lorem ipsum \n\nLorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc ac faucibus odio.", :cursor=>0}, {:text=>"Vestibulum neque massa, scelerisque sit amet ligula eu, congue molestie mi. Praesent ut varius sem. Nullam at porttitor arcu, nec lacinia nisi. Ut ac dolor vitae odio interdum condimentum. Vivamus dapibus sodales ex, vitae malesuada 
... }]

or

irb(main):019:0* text    = Langchain::Loader.load('/Users/andrei/Downloads/agreement.pdf')
irb(main):020:0> chunker = Langchain::Chunker::Text.new(text.value)
irb(main):021:0> chunker.chunks
#=> [{ cursor: , text: }, { cursor: , text: }, ...]

Copy link
Contributor

@alchaplinsky alchaplinsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 👍

@andreibondarev andreibondarev marked this pull request as ready for review June 17, 2023 22:36
@andreibondarev
Copy link
Collaborator Author

@alchaplinsky We should figure out how the output of this chunks method plugs into the add_data or add_texts when indexing data into vector search DBs

@alchaplinsky
Copy link
Contributor

@andreibondarev Do different vector search databases require different chunk size?

@andreibondarev
Copy link
Collaborator Author

@andreibondarev Do different vector search databases require different chunk size?

It's not at the vector search DB level, it's a concern for LLMs.

For example when data is added to Chroma DB (and all other vector search DBs (almost) work the same way) here the text itself is added along with the associated embedding.

If we generate 1 embedding from the whole large text and then that whole text gets retrieved and passed to the LLM to synthesize the answer. The prompt literally looks like this:

Context:
{context} # The whole text here

Question: {question}

Answer:

The whole text might exceed the LLM context window so it needs to be split into smaller chunks.

@andreibondarev andreibondarev merged commit 1a52a6a into main Jun 18, 2023
@andreibondarev andreibondarev deleted the initial-text-chunking branch June 18, 2023 17:39
This was referenced Jun 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants