forked from langchain-ai/langchain
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Text splitter for Markdown files by header (langchain-ai#5860)
This creates a new kind of text splitter for markdown files. The user can supply a set of headers that they want to split the file on. We define a new text splitter class, `MarkdownHeaderTextSplitter`, that does a few things: (1) For each line, it determines the associated set of user-specified headers (2) It groups lines with common headers into splits See notebook for example usage and test cases.
- Loading branch information
1 parent
38e5e25
commit d7de7b8
Showing
2 changed files
with
474 additions
and
0 deletions.
There are no files selected for viewing
324 changes: 324 additions & 0 deletions
324
docs/modules/indexes/text_splitters/examples/markdown_header_metadata.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,324 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "70e9b619", | ||
"metadata": {}, | ||
"source": [ | ||
"# MarkdownHeaderTextSplitter\n", | ||
"\n", | ||
"The objective is to split a markdown file by a specified set of headers.\n", | ||
" \n", | ||
"**Given this example:**\n", | ||
"\n", | ||
"# Foo\n", | ||
"\n", | ||
"## Bar\n", | ||
"\n", | ||
"Hi this is Jim \n", | ||
"Hi this is Joe\n", | ||
"\n", | ||
"## Baz\n", | ||
"\n", | ||
"Hi this is Molly\n", | ||
" \n", | ||
"**Written as:**\n", | ||
"\n", | ||
"```\n", | ||
"md = '# Foo\\n\\n ## Bar\\n\\nHi this is Jim \\nHi this is Joe\\n\\n ## Baz\\n\\n Hi this is Molly' \n", | ||
"```\n", | ||
"\n", | ||
"**If we want to split on specified headers:**\n", | ||
"```\n", | ||
"[(\"#\", \"Header 1\"),(\"##\", \"Header 2\")]\n", | ||
"```\n", | ||
"\n", | ||
"**Then we expect:** \n", | ||
"```\n", | ||
"{'content': 'Hi this is Jim \\nHi this is Joe', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar'}}\n", | ||
"{'content': 'Hi this is Molly', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Baz'}}\n", | ||
"```\n", | ||
"\n", | ||
"**Options:**\n", | ||
" \n", | ||
"This also includes `return_each_line` in case a user want to perform other types of aggregation. \n", | ||
"\n", | ||
"If `return_each_line=True`, each line and associated header metadata are returned. " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"id": "19c044f0", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from langchain.text_splitter import MarkdownHeaderTextSplitter" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "ec8d8053", | ||
"metadata": {}, | ||
"source": [ | ||
"`Test case 1`" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"id": "5cd0a66c", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"{'content': 'Hi this is Jim', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar'}}\n", | ||
"{'content': 'Hi this is Joe', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar'}}\n", | ||
"{'content': 'Hi this is Molly', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Baz'}}\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# Doc\n", | ||
"markdown_document = '# Foo\\n\\n ## Bar\\n\\nHi this is Jim\\n\\nHi this is Joe\\n\\n ## Baz\\n\\n Hi this is Molly' \n", | ||
" \n", | ||
"# Test case 1\n", | ||
"headers_to_split_on = [\n", | ||
" (\"#\", \"Header 1\"),\n", | ||
" (\"##\", \"Header 2\"),\n", | ||
"]\n", | ||
"\n", | ||
"markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on,return_each_line=True)\n", | ||
"\n", | ||
"chunked_docs = markdown_splitter.split_text(markdown_document)\n", | ||
"for chunk in chunked_docs:\n", | ||
" print(chunk)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"id": "67d25a1c", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"{'content': 'Hi this is Jim \\nHi this is Joe', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar'}}\n", | ||
"{'content': 'Hi this is Molly', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Baz'}}\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on,return_each_line=False)\n", | ||
"chunked_docs = markdown_splitter.split_text(markdown_document)\n", | ||
"for chunk in chunked_docs:\n", | ||
" print(chunk)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "f1f74dfa", | ||
"metadata": {}, | ||
"source": [ | ||
"`Test case 2`" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"id": "2183c96a", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"{'content': 'Text under H3.', 'metadata': {'Header 1': 'H1', 'Header 2': 'H2', 'Header 3': 'H3'}}\n", | ||
"{'content': 'Text under H2_2.', 'metadata': {'Header 1': 'H1_2', 'Header 2': 'H2_2'}}\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"headers_to_split_on = [\n", | ||
" (\"#\", \"Header 1\"),\n", | ||
" (\"##\", \"Header 2\"),\n", | ||
" (\"###\", \"Header 3\"),\n", | ||
"]\n", | ||
"markdown_document = '# H1\\n\\n## H2\\n\\n### H3\\n\\nText under H3.\\n\\n# H1_2\\n\\n## H2_2\\n\\nText under H2_2.'\n", | ||
"markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on,return_each_line=False)\n", | ||
"chunked_docs = markdown_splitter.split_text(markdown_document)\n", | ||
"for chunk in chunked_docs:\n", | ||
" print(chunk)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "add24254", | ||
"metadata": {}, | ||
"source": [ | ||
"`Test case 3`" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 6, | ||
"id": "c3f4690f", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"{'content': 'Hi this is Jim \\nHi this is Joe', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar'}}\n", | ||
"{'content': 'Hi this is Lance', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}}\n", | ||
"{'content': 'Hi this is Molly', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Baz'}}\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"markdown_document = '# Foo\\n\\n ## Bar\\n\\nHi this is Jim\\n\\nHi this is Joe\\n\\n ### Boo \\n\\n Hi this is Lance \\n\\n ## Baz\\n\\n Hi this is Molly' \n", | ||
" \n", | ||
"headers_to_split_on = [\n", | ||
" (\"#\", \"Header 1\"),\n", | ||
" (\"##\", \"Header 2\"),\n", | ||
" (\"###\", \"Header 3\"),\n", | ||
"]\n", | ||
"markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on,return_each_line=False)\n", | ||
"chunked_docs = markdown_splitter.split_text(markdown_document)\n", | ||
"for chunk in chunked_docs:\n", | ||
" print(chunk)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 7, | ||
"id": "20907fb7", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"{'content': 'Hi this is Jim', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar'}}\n", | ||
"{'content': 'Hi this is Joe', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar'}}\n", | ||
"{'content': 'Hi this is Lance', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}}\n", | ||
"{'content': 'Hi this is Molly', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Baz'}}\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on,return_each_line=True)\n", | ||
"chunked_docs = markdown_splitter.split_text(markdown_document)\n", | ||
"for chunk in chunked_docs:\n", | ||
" print(chunk)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "9c448431", | ||
"metadata": {}, | ||
"source": [ | ||
"`Test case 4`" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 8, | ||
"id": "9858ea51", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"{'content': 'Hi this is Jim \\nHi this is Joe', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar'}}\n", | ||
"{'content': 'Hi this is Lance', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}}\n", | ||
"{'content': 'Hi this is John', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo', 'Header 4': 'Bim'}}\n", | ||
"{'content': 'Hi this is Molly', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Baz'}}\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"markdown_document = '# Foo\\n\\n ## Bar\\n\\nHi this is Jim\\n\\nHi this is Joe\\n\\n ### Boo \\n\\n Hi this is Lance \\n\\n #### Bim \\n\\n Hi this is John \\n\\n ## Baz\\n\\n Hi this is Molly'\n", | ||
" \n", | ||
"headers_to_split_on = [\n", | ||
" (\"#\", \"Header 1\"),\n", | ||
" (\"##\", \"Header 2\"),\n", | ||
" (\"###\", \"Header 3\"),\n", | ||
" (\"####\", \"Header 4\"),\n", | ||
"]\n", | ||
" \n", | ||
"markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on,return_each_line=False)\n", | ||
"chunked_docs = markdown_splitter.split_text(markdown_document)\n", | ||
"for chunk in chunked_docs:\n", | ||
" print(chunk)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "bba6eb9e", | ||
"metadata": {}, | ||
"source": [ | ||
"`Test case 5`" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 9, | ||
"id": "8af8f9a2", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"{'content': 'Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9] \\nMarkdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files.', 'metadata': {'Header 1': 'Intro', 'Header 2': 'History'}}\n", | ||
"{'content': 'As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for \\nadditional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks.', 'metadata': {'Header 1': 'Intro', 'Header 2': 'Rise and divergence'}}\n", | ||
"{'content': 'From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort.', 'metadata': {'Header 1': 'Intro', 'Header 2': 'Rise and divergence', 'Header 4': 'Standardization'}}\n", | ||
"{'content': 'Implementations of Markdown are available for over a dozen programming languages.', 'metadata': {'Header 1': 'Intro', 'Header 2': 'Implementations'}}\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"markdown_document = '# Intro \\n\\n ## History \\n\\n Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9] \\n\\n Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files. \\n\\n ## Rise and divergence \\n\\n As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for \\n\\n additional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks. \\n\\n #### Standardization \\n\\n From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort. \\n\\n ## Implementations \\n\\n Implementations of Markdown are available for over a dozen programming languages.'\n", | ||
" \n", | ||
"headers_to_split_on = [\n", | ||
" (\"#\", \"Header 1\"),\n", | ||
" (\"##\", \"Header 2\"),\n", | ||
" (\"###\", \"Header 3\"),\n", | ||
" (\"####\", \"Header 4\"),\n", | ||
"]\n", | ||
" \n", | ||
"markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on,return_each_line=False)\n", | ||
"chunked_docs = markdown_splitter.split_text(markdown_document)\n", | ||
"for chunk in chunked_docs:\n", | ||
" print(chunk)" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.9.16" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
Oops, something went wrong.