Skip to content

Commit

Permalink
Text splitter for Markdown files by header (langchain-ai#5860)
Browse files Browse the repository at this point in the history
This creates a new kind of text splitter for markdown files.

The user can supply a set of headers that they want to split the file
on.

We define a new text splitter class, `MarkdownHeaderTextSplitter`, that
does a few things:

(1) For each line, it determines the associated set of user-specified
headers
(2) It groups lines with common headers into splits

See notebook for example usage and test cases.
  • Loading branch information
rlancemartin authored and Undertone0809 committed Jun 19, 2023
1 parent 38e5e25 commit d7de7b8
Show file tree
Hide file tree
Showing 2 changed files with 474 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,324 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "70e9b619",
"metadata": {},
"source": [
"# MarkdownHeaderTextSplitter\n",
"\n",
"The objective is to split a markdown file by a specified set of headers.\n",
" \n",
"**Given this example:**\n",
"\n",
"# Foo\n",
"\n",
"## Bar\n",
"\n",
"Hi this is Jim \n",
"Hi this is Joe\n",
"\n",
"## Baz\n",
"\n",
"Hi this is Molly\n",
" \n",
"**Written as:**\n",
"\n",
"```\n",
"md = '# Foo\\n\\n ## Bar\\n\\nHi this is Jim \\nHi this is Joe\\n\\n ## Baz\\n\\n Hi this is Molly' \n",
"```\n",
"\n",
"**If we want to split on specified headers:**\n",
"```\n",
"[(\"#\", \"Header 1\"),(\"##\", \"Header 2\")]\n",
"```\n",
"\n",
"**Then we expect:** \n",
"```\n",
"{'content': 'Hi this is Jim \\nHi this is Joe', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar'}}\n",
"{'content': 'Hi this is Molly', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Baz'}}\n",
"```\n",
"\n",
"**Options:**\n",
" \n",
"This also includes `return_each_line` in case a user want to perform other types of aggregation. \n",
"\n",
"If `return_each_line=True`, each line and associated header metadata are returned. "
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "19c044f0",
"metadata": {},
"outputs": [],
"source": [
"from langchain.text_splitter import MarkdownHeaderTextSplitter"
]
},
{
"cell_type": "markdown",
"id": "ec8d8053",
"metadata": {},
"source": [
"`Test case 1`"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "5cd0a66c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'content': 'Hi this is Jim', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar'}}\n",
"{'content': 'Hi this is Joe', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar'}}\n",
"{'content': 'Hi this is Molly', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Baz'}}\n"
]
}
],
"source": [
"# Doc\n",
"markdown_document = '# Foo\\n\\n ## Bar\\n\\nHi this is Jim\\n\\nHi this is Joe\\n\\n ## Baz\\n\\n Hi this is Molly' \n",
" \n",
"# Test case 1\n",
"headers_to_split_on = [\n",
" (\"#\", \"Header 1\"),\n",
" (\"##\", \"Header 2\"),\n",
"]\n",
"\n",
"markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on,return_each_line=True)\n",
"\n",
"chunked_docs = markdown_splitter.split_text(markdown_document)\n",
"for chunk in chunked_docs:\n",
" print(chunk)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "67d25a1c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'content': 'Hi this is Jim \\nHi this is Joe', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar'}}\n",
"{'content': 'Hi this is Molly', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Baz'}}\n"
]
}
],
"source": [
"markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on,return_each_line=False)\n",
"chunked_docs = markdown_splitter.split_text(markdown_document)\n",
"for chunk in chunked_docs:\n",
" print(chunk)"
]
},
{
"cell_type": "markdown",
"id": "f1f74dfa",
"metadata": {},
"source": [
"`Test case 2`"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "2183c96a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'content': 'Text under H3.', 'metadata': {'Header 1': 'H1', 'Header 2': 'H2', 'Header 3': 'H3'}}\n",
"{'content': 'Text under H2_2.', 'metadata': {'Header 1': 'H1_2', 'Header 2': 'H2_2'}}\n"
]
}
],
"source": [
"headers_to_split_on = [\n",
" (\"#\", \"Header 1\"),\n",
" (\"##\", \"Header 2\"),\n",
" (\"###\", \"Header 3\"),\n",
"]\n",
"markdown_document = '# H1\\n\\n## H2\\n\\n### H3\\n\\nText under H3.\\n\\n# H1_2\\n\\n## H2_2\\n\\nText under H2_2.'\n",
"markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on,return_each_line=False)\n",
"chunked_docs = markdown_splitter.split_text(markdown_document)\n",
"for chunk in chunked_docs:\n",
" print(chunk)"
]
},
{
"cell_type": "markdown",
"id": "add24254",
"metadata": {},
"source": [
"`Test case 3`"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "c3f4690f",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'content': 'Hi this is Jim \\nHi this is Joe', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar'}}\n",
"{'content': 'Hi this is Lance', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}}\n",
"{'content': 'Hi this is Molly', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Baz'}}\n"
]
}
],
"source": [
"markdown_document = '# Foo\\n\\n ## Bar\\n\\nHi this is Jim\\n\\nHi this is Joe\\n\\n ### Boo \\n\\n Hi this is Lance \\n\\n ## Baz\\n\\n Hi this is Molly' \n",
" \n",
"headers_to_split_on = [\n",
" (\"#\", \"Header 1\"),\n",
" (\"##\", \"Header 2\"),\n",
" (\"###\", \"Header 3\"),\n",
"]\n",
"markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on,return_each_line=False)\n",
"chunked_docs = markdown_splitter.split_text(markdown_document)\n",
"for chunk in chunked_docs:\n",
" print(chunk)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "20907fb7",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'content': 'Hi this is Jim', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar'}}\n",
"{'content': 'Hi this is Joe', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar'}}\n",
"{'content': 'Hi this is Lance', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}}\n",
"{'content': 'Hi this is Molly', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Baz'}}\n"
]
}
],
"source": [
"markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on,return_each_line=True)\n",
"chunked_docs = markdown_splitter.split_text(markdown_document)\n",
"for chunk in chunked_docs:\n",
" print(chunk)"
]
},
{
"cell_type": "markdown",
"id": "9c448431",
"metadata": {},
"source": [
"`Test case 4`"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "9858ea51",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'content': 'Hi this is Jim \\nHi this is Joe', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar'}}\n",
"{'content': 'Hi this is Lance', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}}\n",
"{'content': 'Hi this is John', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo', 'Header 4': 'Bim'}}\n",
"{'content': 'Hi this is Molly', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Baz'}}\n"
]
}
],
"source": [
"markdown_document = '# Foo\\n\\n ## Bar\\n\\nHi this is Jim\\n\\nHi this is Joe\\n\\n ### Boo \\n\\n Hi this is Lance \\n\\n #### Bim \\n\\n Hi this is John \\n\\n ## Baz\\n\\n Hi this is Molly'\n",
" \n",
"headers_to_split_on = [\n",
" (\"#\", \"Header 1\"),\n",
" (\"##\", \"Header 2\"),\n",
" (\"###\", \"Header 3\"),\n",
" (\"####\", \"Header 4\"),\n",
"]\n",
" \n",
"markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on,return_each_line=False)\n",
"chunked_docs = markdown_splitter.split_text(markdown_document)\n",
"for chunk in chunked_docs:\n",
" print(chunk)"
]
},
{
"cell_type": "markdown",
"id": "bba6eb9e",
"metadata": {},
"source": [
"`Test case 5`"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "8af8f9a2",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'content': 'Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9] \\nMarkdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files.', 'metadata': {'Header 1': 'Intro', 'Header 2': 'History'}}\n",
"{'content': 'As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for \\nadditional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks.', 'metadata': {'Header 1': 'Intro', 'Header 2': 'Rise and divergence'}}\n",
"{'content': 'From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort.', 'metadata': {'Header 1': 'Intro', 'Header 2': 'Rise and divergence', 'Header 4': 'Standardization'}}\n",
"{'content': 'Implementations of Markdown are available for over a dozen programming languages.', 'metadata': {'Header 1': 'Intro', 'Header 2': 'Implementations'}}\n"
]
}
],
"source": [
"markdown_document = '# Intro \\n\\n ## History \\n\\n Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9] \\n\\n Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files. \\n\\n ## Rise and divergence \\n\\n As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for \\n\\n additional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks. \\n\\n #### Standardization \\n\\n From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort. \\n\\n ## Implementations \\n\\n Implementations of Markdown are available for over a dozen programming languages.'\n",
" \n",
"headers_to_split_on = [\n",
" (\"#\", \"Header 1\"),\n",
" (\"##\", \"Header 2\"),\n",
" (\"###\", \"Header 3\"),\n",
" (\"####\", \"Header 4\"),\n",
"]\n",
" \n",
"markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on,return_each_line=False)\n",
"chunked_docs = markdown_splitter.split_text(markdown_document)\n",
"for chunk in chunked_docs:\n",
" print(chunk)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading

0 comments on commit d7de7b8

Please sign in to comment.