Skip to content

Latest commit

 

History

History
64 lines (48 loc) · 4.1 KB

README.md

File metadata and controls

64 lines (48 loc) · 4.1 KB

LumberChunker 🪓

This is the official repository for the paper LumberChunker: Long-Form Narrative Document Segmentation by André V. Duarte, João D.S. Marques, Miguel Graça, Miguel Freire, Lei Li and Arlindo L. Oliveira

LumberChunker is a method leveraging an LLM to dynamically segment documents into semantically independent chunks. It iteratively prompts the LLM to identify the point within a group of sequential passages where the content begins to shift.

GitHub Logo


LumberChunker Example - Segmenting a Book

⚠ Important: Whether using Gemini or ChatGPT, don't forget to add the API key / (Project ID, Location) in LumberChunker-Segmentation.py

python LumberChunker-Segmentation.py --out_path <output directory path> --model_type <Gemini | ChatGPT> --book_name <target book name>

📚 GutenQA

GutenQA consists of book passages manually extracted from Project Gutenberg and subsequently segmented using LumberChunker.
It features: 100 Public Domain Narrative Books and 30 Question-Answer Pairs per Book.

The dataset is organized into the following columns:

  • Book Name: The title of the book from which the passage is extracted.
  • Book ID: A unique integer identifier assigned to each book.
  • Chunk ID: An integer identifier for each chunk of the book. Chunks are listed in the sequence they appear in the book.
  • Chapter: The name(s) of the chapter(s) from which the chunk is derived. If LumberChunker merged paragraphs from multiple chapters, the names of all relevant chapters are included.
  • Question: A question pertaining to the specific chunk of text. Note that not every chunk has an associated question, as only 30 questions are generated per book.
  • Answer: The answer corresponding to the question related to that chunk.
  • Chunk Must Contain: A specific substring from the chunk indicating where the answer can be found. This ensures that, despite the chunking methodology, the correct chunk includes this particular string.

📖 GutenQA Alternative Chunking Formats (Used for Baseline Methods)

We also release the same corpus present on GutenQA with different chunk granularities.

  • Paragraph: Books are extracted manually from Project Gutenberg. This is the format of the extraction prior to segmentation with LumberChunker.
  • Recursive Chunks: Documents are segmented based on a hierarchy of separators such as paragraph breaks, new lines, spaces, and individual characters, using Langchain's RecursiveCharacterTextSplitter function.
  • Semantic Chunks: Paragraph Chunks are embedded with OpenAI's text-ada-embedding-002. Text is segmented by identifying break points based on significant changes in adjacent chunks embedding distances.
  • Propositions: Text is segmented as introduced in the paper Dense X Retrieval. Generated questions are provided along with the correct Proposition Answer.

🤝 Compatibility

LumberChunker is compatible with any LLM with strong reasoning capabilities.

  • In our code, we provide implementation for Gemini and ChatGPT, but in fact models like LLaMA-3, Mixtral 8x7B or Command+R can also be used.

💬 Citation

If you find this work useful, please consider citing our paper:

@misc{duarte2024lumberchunker,
      title={LumberChunker: Long-Form Narrative Document Segmentation}, 
      author={André V. Duarte and João Marques and Miguel Graça and Miguel Freire and Lei Li and Arlindo L. Oliveira},
      year={2024},
      eprint={2406.17526},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2406.17526}, 
}