LumberChunker 🪓

This is the official repository for the paper LumberChunker: Long-Form Narrative Document Segmentation by André V. Duarte, João D.S. Marques, Miguel Graça, Miguel Freire, Lei Li and Arlindo L. Oliveira

LumberChunker is a method leveraging an LLM to dynamically segment documents into semantically independent chunks. It iteratively prompts the LLM to identify the point within a group of sequential passages where the content begins to shift.

LumberChunker Example - Segmenting a Book

⚠ Important: Whether using Gemini or ChatGPT, don't forget to add the API key / (Project ID, Location) in LumberChunker-Segmentation.py

python LumberChunker-Segmentation.py --out_path <output directory path> --model_type <Gemini | ChatGPT> --book_name <target book name>

📚 GutenQA

GutenQA consists of book passages manually extracted from Project Gutenberg and subsequently segmented using LumberChunker.
It features: 100 Public Domain Narrative Books and 30 Question-Answer Pairs per Book.

The dataset is organized into the following columns:

Book Name: The title of the book from which the passage is extracted.
Book ID: A unique integer identifier assigned to each book.
Chunk ID: An integer identifier for each chunk of the book. Chunks are listed in the sequence they appear in the book.
Chapter: The name(s) of the chapter(s) from which the chunk is derived. If LumberChunker merged paragraphs from multiple chapters, the names of all relevant chapters are included.
Question: A question pertaining to the specific chunk of text. Note that not every chunk has an associated question, as only 30 questions are generated per book.
Answer: The answer corresponding to the question related to that chunk.
Chunk Must Contain: A specific substring from the chunk indicating where the answer can be found. This ensures that, despite the chunking methodology, the correct chunk includes this particular string.

📖 GutenQA Alternative Chunking Formats (Used for Baseline Methods)

We also release the same corpus present on GutenQA with different chunk granularities.

Paragraph: Books are extracted manually from Project Gutenberg. This is the format of the extraction prior to segmentation with LumberChunker.
Recursive Chunks: Documents are segmented based on a hierarchy of separators such as paragraph breaks, new lines, spaces, and individual characters, using Langchain's RecursiveCharacterTextSplitter function.
Semantic Chunks: Paragraph Chunks are embedded with OpenAI's text-ada-embedding-002. Text is segmented by identifying break points based on significant changes in adjacent chunks embedding distances.
Propositions: Text is segmented as introduced in the paper Dense X Retrieval. Generated questions are provided along with the correct Proposition Answer.

🤝 Compatibility

LumberChunker is compatible with any LLM with strong reasoning capabilities.

In our code, we provide implementation for Gemini and ChatGPT, but in fact models like LLaMA-3, Mixtral 8x7B or Command+R can also be used.

💬 Citation

If you find this work useful, please consider citing our paper:

@misc{duarte2024lumberchunker,
      title={LumberChunker: Long-Form Narrative Document Segmentation}, 
      author={André V. Duarte and João Marques and Miguel Graça and Miguel Freire and Lei Li and Arlindo L. Oliveira},
      year={2024},
      eprint={2406.17526},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2406.17526}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
Code		Code
Example		Example
.DS_Store		.DS_Store
LumberChunker_pipeline.png		LumberChunker_pipeline.png
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LumberChunker 🪓

LumberChunker Example - Segmenting a Book

📚 GutenQA

📖 GutenQA Alternative Chunking Formats (Used for Baseline Methods)

🤝 Compatibility

💬 Citation

About

Releases

Packages

Contributors 2

Languages

joaodsmarques/LumberChunker

Folders and files

Latest commit

History

Repository files navigation

LumberChunker 🪓

LumberChunker Example - Segmenting a Book

📚 GutenQA

📖 GutenQA Alternative Chunking Formats (Used for Baseline Methods)

🤝 Compatibility

💬 Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages