Skip to content

Commit

Permalink
Experimental: Add other threshold types to SemanticChunker (#16807)
Browse files Browse the repository at this point in the history
**Description**
Adding different threshold types to the semantic chunker. I’ve had much
better and predictable performance when using standard deviations
instead of percentiles.


![image](https://github.com/langchain-ai/langchain/assets/44395485/066e84a8-460e-4da5-9fa1-4ff79a1941c5)

For all the documents I’ve tried, the distribution of distances look
similar to the above: positively skewed normal distribution. All skews
I’ve seen are less than 1 so that explains why standard deviations
perform well, but I’ve included IQR if anyone wants something more
robust.

Also, using the percentile method backwards, you can declare the number
of clusters and use semantic chunking to get an ‘optimal’ splitting.

---------

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
  • Loading branch information
matthaigh27 and hwchase17 authored Feb 26, 2024
1 parent ce682f5 commit a4896da
Show file tree
Hide file tree
Showing 2 changed files with 284 additions and 25 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": 3,
"id": "613d4a3b",
"metadata": {},
"outputs": [],
Expand All @@ -95,7 +95,7 @@
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": 4,
"id": "295ec095",
"metadata": {},
"outputs": [
Expand All @@ -112,12 +112,195 @@
"print(docs[0].page_content)"
]
},
{
"cell_type": "markdown",
"id": "9aed73b2",
"metadata": {},
"source": [
"## Breakpoints\n",
"\n",
"This chunker works by determining when to \"break\" apart sentences. This is done by looking for differences in embeddings between any two sentences. When that difference is past some threshold, then they are split.\n",
"\n",
"There are a few ways to determine what that threshold is.\n",
"\n",
"### Percentile\n",
"\n",
"The default way to split is based on percentile. In this method, all differences between sentences are calculated, and then any difference greater than the X percentile is split."
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 12,
"id": "a9a3b9cd",
"metadata": {},
"outputs": [],
"source": [
"text_splitter = SemanticChunker(\n",
" OpenAIEmbeddings(), breakpoint_threshold_type=\"percentile\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "f311e67e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. Last year COVID-19 kept us apart. This year we are finally together again. Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. With a duty to one another to the American people to the Constitution. And with an unwavering resolve that freedom will always triumph over tyranny. Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. He met the Ukrainian people. From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. Groups of citizens blocking tanks with their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland. In this struggle as President Zelenskyy said in his speech to the European Parliament “Light will win over darkness.” The Ukrainian Ambassador to the United States is here tonight. Let each of us here tonight in this Chamber send an unmistakable signal to Ukraine and to the world. Please rise if you are able and show that, Yes, we the United States of America stand with the Ukrainian people. Throughout our history we’ve learned this lesson when dictators do not pay a price for their aggression they cause more chaos. They keep moving.\n"
]
}
],
"source": [
"docs = text_splitter.create_documents([state_of_the_union])\n",
"print(docs[0].page_content)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "5f5930de",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"26\n"
]
}
],
"source": [
"print(len(docs))"
]
},
{
"cell_type": "markdown",
"id": "b6b51104",
"metadata": {},
"source": [
"### Standard Deviation\n",
"\n",
"In this method, any difference greater than X standard deviations is split."
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "ff5e005c",
"metadata": {},
"outputs": [],
"source": [
"text_splitter = SemanticChunker(\n",
" OpenAIEmbeddings(), breakpoint_threshold_type=\"standard_deviation\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "01b8ffc0",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. Last year COVID-19 kept us apart. This year we are finally together again. Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. With a duty to one another to the American people to the Constitution. And with an unwavering resolve that freedom will always triumph over tyranny. Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. He met the Ukrainian people. From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. Groups of citizens blocking tanks with their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland. In this struggle as President Zelenskyy said in his speech to the European Parliament “Light will win over darkness.” The Ukrainian Ambassador to the United States is here tonight. Let each of us here tonight in this Chamber send an unmistakable signal to Ukraine and to the world. Please rise if you are able and show that, Yes, we the United States of America stand with the Ukrainian people. Throughout our history we’ve learned this lesson when dictators do not pay a price for their aggression they cause more chaos. They keep moving. And the costs and the threats to America and the world keep rising. That’s why the NATO Alliance was created to secure peace and stability in Europe after World War 2. The United States is a member along with 29 other nations. It matters. American diplomacy matters. American resolve matters. Putin’s latest attack on Ukraine was premeditated and unprovoked. He rejected repeated efforts at diplomacy. He thought the West and NATO wouldn’t respond. And he thought he could divide us at home. Putin was wrong. We were ready. Here is what we did. We prepared extensively and carefully. We spent months building a coalition of other freedom-loving nations from Europe and the Americas to Asia and Africa to confront Putin. I spent countless hours unifying our European allies. We shared with the world in advance what we knew Putin was planning and precisely how he would try to falsely justify his aggression. We countered Russia’s lies with truth. And now that he has acted the free world is holding him accountable. Along with twenty-seven members of the European Union including France, Germany, Italy, as well as countries like the United Kingdom, Canada, Japan, Korea, Australia, New Zealand, and many others, even Switzerland. We are inflicting pain on Russia and supporting the people of Ukraine. Putin is now isolated from the world more than ever. Together with our allies –we are right now enforcing powerful economic sanctions. We are cutting off Russia’s largest banks from the international financial system. Preventing Russia’s central bank from defending the Russian Ruble making Putin’s $630 Billion “war fund” worthless. We are choking off Russia’s access to technology that will sap its economic strength and weaken its military for years to come. Tonight I say to the Russian oligarchs and corrupt leaders who have bilked billions of dollars off this violent regime no more. The U.S. Department of Justice is assembling a dedicated task force to go after the crimes of Russian oligarchs. We are joining with our European allies to find and seize your yachts your luxury apartments your private jets. We are coming for your ill-begotten gains. And tonight I am announcing that we will join our allies in closing off American air space to all Russian flights – further isolating Russia – and adding an additional squeeze –on their economy. The Ruble has lost 30% of its value. The Russian stock market has lost 40% of its value and trading remains suspended. Russia’s economy is reeling and Putin alone is to blame. Together with our allies we are providing support to the Ukrainians in their fight for freedom. Military assistance. Economic assistance. Humanitarian assistance. We are giving more than $1 Billion in direct assistance to Ukraine. And we will continue to aid the Ukrainian people as they defend their country and to help ease their suffering. Let me be clear, our forces are not engaged and will not engage in conflict with Russian forces in Ukraine. Our forces are not going to Europe to fight in Ukraine, but to defend our NATO Allies – in the event that Putin decides to keep moving west. For that purpose we’ve mobilized American ground forces, air squadrons, and ship deployments to protect NATO countries including Poland, Romania, Latvia, Lithuania, and Estonia. As I have made crystal clear the United States and our Allies will defend every inch of territory of NATO countries with the full force of our collective power. And we remain clear-eyed. The Ukrainians are fighting back with pure courage. But the next few days weeks, months, will be hard on them. Putin has unleashed violence and chaos. But while he may make gains on the battlefield – he will pay a continuing high price over the long run. And a proud Ukrainian people, who have known 30 years of independence, have repeatedly shown that they will not tolerate anyone who tries to take their country backwards. To all Americans, I will be honest with you, as I’ve always promised. A Russian dictator, invading a foreign country, has costs around the world. And I’m taking robust action to make sure the pain of our sanctions is targeted at Russia’s economy. And I will use every tool at our disposal to protect American businesses and consumers. Tonight, I can announce that the United States has worked with 30 other countries to release 60 Million barrels of oil from reserves around the world. America will lead that effort, releasing 30 Million barrels from our own Strategic Petroleum Reserve. And we stand ready to do more if necessary, unified with our allies. These steps will help blunt gas prices here at home. And I know the news about what’s happening can seem alarming.\n"
]
}
],
"source": [
"docs = text_splitter.create_documents([state_of_the_union])\n",
"print(docs[0].page_content)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "8938a5e3",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"4\n"
]
}
],
"source": [
"print(len(docs))"
]
},
{
"cell_type": "markdown",
"id": "6897261f",
"metadata": {},
"source": [
"### Interquartile\n",
"\n",
"In this method, the interquartile distance is used to split chunks."
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "8977355b",
"metadata": {},
"outputs": [],
"source": [
"text_splitter = SemanticChunker(\n",
" OpenAIEmbeddings(), breakpoint_threshold_type=\"interquartile\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "59a40364",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. Last year COVID-19 kept us apart. This year we are finally together again. Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. With a duty to one another to the American people to the Constitution. And with an unwavering resolve that freedom will always triumph over tyranny. Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. He met the Ukrainian people. From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. Groups of citizens blocking tanks with their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland. In this struggle as President Zelenskyy said in his speech to the European Parliament “Light will win over darkness.” The Ukrainian Ambassador to the United States is here tonight. Let each of us here tonight in this Chamber send an unmistakable signal to Ukraine and to the world. Please rise if you are able and show that, Yes, we the United States of America stand with the Ukrainian people. Throughout our history we’ve learned this lesson when dictators do not pay a price for their aggression they cause more chaos. They keep moving.\n"
]
}
],
"source": [
"docs = text_splitter.create_documents([state_of_the_union])\n",
"print(docs[0].page_content)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "3a0db107",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"25\n"
]
}
],
"source": [
"print(len(docs))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b1f65472",
"metadata": {},
"outputs": [],
"source": []
}
],
Expand Down
120 changes: 98 additions & 22 deletions libs/experimental/langchain_experimental/text_splitter.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import copy
import re
from typing import Any, Iterable, List, Optional, Sequence, Tuple
from typing import Any, Dict, Iterable, List, Literal, Optional, Sequence, Tuple, cast

import numpy as np
from langchain_community.utils.math import (
Expand Down Expand Up @@ -83,6 +83,14 @@ def calculate_cosine_distances(sentences: List[dict]) -> Tuple[List[float], List
return distances, sentences


BreakpointThresholdType = Literal["percentile", "standard_deviation", "interquartile"]
BREAKPOINT_DEFAULTS: Dict[BreakpointThresholdType, float] = {
"percentile": 95,
"standard_deviation": 3,
"interquartile": 1.5,
}


class SemanticChunker(BaseDocumentTransformer):
"""Split the text based on semantic similarity.
Expand All @@ -95,42 +103,110 @@ class SemanticChunker(BaseDocumentTransformer):
sentences, and then merges one that are similar in the embedding space.
"""

def __init__(self, embeddings: Embeddings, add_start_index: bool = False):
def __init__(
self,
embeddings: Embeddings,
add_start_index: bool = False,
breakpoint_threshold_type: BreakpointThresholdType = "percentile",
breakpoint_threshold_amount: Optional[float] = None,
number_of_chunks: Optional[int] = None,
):
self._add_start_index = add_start_index
self.embeddings = embeddings

def split_text(self, text: str) -> List[str]:
self.breakpoint_threshold_type = breakpoint_threshold_type
self.number_of_chunks = number_of_chunks
if breakpoint_threshold_amount is None:
self.breakpoint_threshold_amount = BREAKPOINT_DEFAULTS[
breakpoint_threshold_type
]
else:
self.breakpoint_threshold_amount = breakpoint_threshold_amount

def _calculate_breakpoint_threshold(self, distances: List[float]) -> float:
if self.breakpoint_threshold_type == "percentile":
return cast(
float,
np.percentile(distances, self.breakpoint_threshold_amount),
)
elif self.breakpoint_threshold_type == "standard_deviation":
return cast(
float,
np.mean(distances)
+ self.breakpoint_threshold_amount * np.std(distances),
)
elif self.breakpoint_threshold_type == "interquartile":
q1, q3 = np.percentile(distances, [25, 75])
iqr = q3 - q1

return np.mean(distances) + self.breakpoint_threshold_amount * iqr
else:
raise ValueError(
f"Got unexpected `breakpoint_threshold_type`: "
f"{self.breakpoint_threshold_type}"
)

def _threshold_from_clusters(self, distances: List[float]) -> float:
"""
Calculate the threshold based on the number of chunks.
Inverse of percentile method.
"""
if self.number_of_chunks is None:
raise ValueError(
"This should never be called if `number_of_chunks` is None."
)
x1, y1 = len(distances), 0.0
x2, y2 = 1.0, 100.0

x = max(min(self.number_of_chunks, x1), x2)

# Linear interpolation formula
y = y1 + ((y2 - y1) / (x2 - x1)) * (x - x1)
y = min(max(y, 0), 100)

return cast(float, np.percentile(distances, y))

def _calculate_sentence_distances(
self, single_sentences_list: List[str]
) -> Tuple[List[float], List[dict]]:
"""Split text into multiple components."""
# Splitting the essay on '.', '?', and '!'
single_sentences_list = re.split(r"(?<=[.?!])\s+", text)

# having len(single_sentences_list) == 1 would cause the following
# np.percentile to fail.
if len(single_sentences_list) == 1:
return single_sentences_list

sentences = [
_sentences = [
{"sentence": x, "index": i} for i, x in enumerate(single_sentences_list)
]
sentences = combine_sentences(sentences)
sentences = combine_sentences(_sentences)
embeddings = self.embeddings.embed_documents(
[x["combined_sentence"] for x in sentences]
)
for i, sentence in enumerate(sentences):
sentence["combined_sentence_embedding"] = embeddings[i]
distances, sentences = calculate_cosine_distances(sentences)
start_index = 0

# Create a list to hold the grouped sentences
chunks = []
breakpoint_percentile_threshold = 95
breakpoint_distance_threshold = np.percentile(
distances, breakpoint_percentile_threshold
) # If you want more chunks, lower the percentile cutoff
return calculate_cosine_distances(sentences)

def split_text(
self,
text: str,
) -> List[str]:
# Splitting the essay on '.', '?', and '!'
single_sentences_list = re.split(r"(?<=[.?!])\s+", text)

# having len(single_sentences_list) == 1 would cause the following
# np.percentile to fail.
if len(single_sentences_list) == 1:
return single_sentences_list
distances, sentences = self._calculate_sentence_distances(single_sentences_list)
if self.number_of_chunks is not None:
breakpoint_distance_threshold = self._threshold_from_clusters(distances)
else:
breakpoint_distance_threshold = self._calculate_breakpoint_threshold(
distances
)

indices_above_thresh = [
i for i, x in enumerate(distances) if x > breakpoint_distance_threshold
] # The indices of those breakpoints on your list
]

chunks = []
start_index = 0

# Iterate through the breakpoints to slice the sentences
for index in indices_above_thresh:
Expand Down

0 comments on commit a4896da

Please sign in to comment.