Skip to content

Commit

Permalink
Merge pull request #27 from MadBomber/split-on-sentences
Browse files Browse the repository at this point in the history
split-on-sentences
  • Loading branch information
moekiorg authored Sep 23, 2024
2 parents 354167c + 570556a commit 15ae138
Show file tree
Hide file tree
Showing 3 changed files with 56 additions and 0 deletions.
1 change: 1 addition & 0 deletions lib/baran.rb
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
require_relative "baran/markdown_splitter"
require_relative "baran/recursive_character_text_splitter"
require_relative "baran/character_text_splitter"
require_relative "baran/sentence_text_splitter"

module Baran
class Error < StandardError; end
Expand Down
14 changes: 14 additions & 0 deletions lib/baran/sentence_text_splitter.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# frozen_string_literal: true

module Baran
class SentenceTextSplitter < TextSplitter
def initialize(chunk_size: 1024, chunk_overlap: 64)
super(chunk_size: chunk_size, chunk_overlap: chunk_overlap)
end

def splitted(text)
# Use a regex to split text based on the specified sentence-ending characters followed by whitespace
text.scan(/[^.!?]+[.!?]+(?:\s+)/).map(&:strip)
end
end
end
41 changes: 41 additions & 0 deletions test/test_sentence_text_spliter.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
require 'minitest/unit'
require 'baran'

MiniTest::Unit.autorun

class TestSentenceTextSplitter < MiniTest::Unit::TestCase
def setup
@splitter = Baran::SentenceTextSplitter.new(chunk_size: 10, chunk_overlap: 5)
end

def test_chunks
story = <<~TEXT
Hack and jill
went up the hill to fetch
a pail of water. Jack fell
down and broke his crown and Jill
came tumbling after.
The pail went flying! Was the water spilled?
No, the water was splashed on Bo Peep.
TEXT

chunks = @splitter.chunks(story)

sentences = chunks
.map { |chunk|
chunk[:text]
.gsub(/\s+/, ' ')
.strip
}

expected = [
"Hack and jill went up the hill to fetch a pail of water.",
"Jack fell down and broke his crown and Jill came tumbling after.",
"The pail went flying!",
"Was the water spilled?",
"No, the water was splashed on Bo Peep."
]

assert_equal(sentences, expected)
end
end

0 comments on commit 15ae138

Please sign in to comment.