Simple text chunking #188

andreibondarev · 2023-06-17T00:31:40Z

Simple text chunking using the baran gem.

Example usage:

irb(main):001:0> Langchain::Loader.new("spec/fixtures/loaders/sample.docx").load.chunks
=> [{:text=>"Lorem ipsum \n\nLorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc ac faucibus odio.", :cursor=>0}, {:text=>"Vestibulum neque massa, scelerisque sit amet ligula eu, congue molestie mi. Praesent ut varius sem. Nullam at porttitor arcu, nec lacinia nisi. Ut ac dolor vitae odio interdum condimentum. Vivamus dapibus sodales ex, vitae malesuada 
... }]

or

irb(main):019:0* text    = Langchain::Loader.load('/Users/andrei/Downloads/agreement.pdf')
irb(main):020:0> chunker = Langchain::Chunker::Text.new(text.value)
irb(main):021:0> chunker.chunks
#=> [{ cursor: , text: }, { cursor: , text: }, ...]

alchaplinsky

Nice 👍

andreibondarev · 2023-06-17T23:36:59Z

@alchaplinsky We should figure out how the output of this chunks method plugs into the add_data or add_texts when indexing data into vector search DBs

alchaplinsky · 2023-06-18T14:24:35Z

@andreibondarev Do different vector search databases require different chunk size?

andreibondarev · 2023-06-18T14:39:00Z

@andreibondarev Do different vector search databases require different chunk size?

It's not at the vector search DB level, it's a concern for LLMs.

For example when data is added to Chroma DB (and all other vector search DBs (almost) work the same way) here the text itself is added along with the associated embedding.

If we generate 1 embedding from the whole large text and then that whole text gets retrieved and passed to the LLM to synthesize the answer. The prompt literally looks like this:

Context:
{context} # The whole text here

Question: {question}

Answer:

The whole text might exceed the LLM context window so it needs to be split into smaller chunks.

andreibondarev added 2 commits June 16, 2023 19:01

Simple text chunking

4b4d953

fix gemfile.lock

3239e68

alchaplinsky approved these changes Jun 17, 2023

View reviewed changes

andreibondarev and others added 2 commits June 17, 2023 16:45

Merge branch 'main' into initial-text-chunking

fbc1a9b

Specs for Langchain::Chunker::Text

67644cd

andreibondarev marked this pull request as ready for review June 17, 2023 22:36

Lock baran version

eec0e32

Merge branch 'main' into initial-text-chunking

2836ac0

Code comments

2cabd22

andreibondarev merged commit 1a52a6a into main Jun 18, 2023

andreibondarev deleted the initial-text-chunking branch June 18, 2023 17:39

This was referenced Jun 18, 2023

Add content chunkers #36

Closed

Chunkers #40

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simple text chunking #188

Simple text chunking #188

andreibondarev commented Jun 17, 2023 •

edited

Loading

alchaplinsky left a comment

andreibondarev commented Jun 17, 2023

alchaplinsky commented Jun 18, 2023

andreibondarev commented Jun 18, 2023

Simple text chunking #188

Simple text chunking #188

Conversation

andreibondarev commented Jun 17, 2023 • edited Loading

alchaplinsky left a comment

Choose a reason for hiding this comment

andreibondarev commented Jun 17, 2023

alchaplinsky commented Jun 18, 2023

andreibondarev commented Jun 18, 2023

andreibondarev commented Jun 17, 2023 •

edited

Loading