initial commit

parthsarthi03 · Feb 27, 2024 · 71d307c · 71d307c
commit 71d307c
Show file tree

Hide file tree

Showing 21 changed files with 2,854 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,33 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# Vim
+*.swp
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# mac
+.DS_Store
+
+# Other
+.vscode
+*.tsv
+*.pt
+gpt*.txt
+*.env
+local/
+local_*
+build/
+*.egg-info/
+!/data/*.json
+/dist/
+checklist.md
+finetuning_ckpts/
+* copy*
+.idea
+assertion.log
+*.log
+*.db
diff --git a/LICENSE.txt b/LICENSE.txt
@@ -0,0 +1,21 @@
+The MIT License
+
+Copyright (c) Parth Sarthi
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.
diff --git a/README.md b/README.md
@@ -0,0 +1,192 @@
+<p align="center">
+  <img align="center" src="raptor.jpg" width="1000px" />
+</p>
+<p align="left">
+
+## RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
+
+**RAPTOR** introduces a novel approach to retrieval-augmented language models by constructing a recursive tree structure from documents. This allows for more efficient and context-aware information retrieval across large texts, addressing common limitations in traditional language models. 
+
+
+
+For detailed methodologies and implementations, refer to the original paper:
+
+- [RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval](https://arxiv.org/abs/2401.18059)
+
+[![Paper page](https://huggingface.co/datasets/huggingface/badges/resolve/main/paper-page-sm.svg)](https://huggingface.co/papers/2401.18059)
+
+## Installation
+
+Before using RAPTOR, ensure Python 3.8+ is installed. Clone the RAPTOR repository and install necessary dependencies:
+
+```bash
+git clone https://github.com/parthsarthi03/RAPTOR.git
+cd RAPTOR
+pip install -r requirements.txt
+```
+
+## Basic Usage
+
+To get started with RAPTOR, follow these steps:
+
+### Setting Up RAPTOR
+
+First, set your OpenAI API key and initialize the RAPTOR configuration:
+
+```python
+import os
+os.environ["OPENAI_API_KEY"] = "your-openai-api-key"
+
+from raptor import RetrievalAugmentation
+
+# Initialize with default configuration. For advanced configurations, check the documentation. [WIP]
+RA = RetrievalAugmentation()
+```
+
+### Adding Documents to the Tree
+
+Add your text documents to RAPTOR for indexing:
+
+```python
+with open('sample.txt', 'r') as file:
+    text = file.read()
+RA.add_documents(text)
+```
+
+### Answering Questions
+
+You can now use RAPTOR to answer questions based on the indexed documents:
+
+```python
+question = "How did Cinderella reach her happy ending?"
+answer = RA.answer_question(question=question)
+print("Answer: ", answer)
+```
+
+### Saving and Loading the Tree
+
+Save the constructed tree to a specified path:
+
+```python
+SAVE_PATH = "Demo/cinderalla"
+RA.save(SAVE_PATH)
+```
+
+Load the saved tree back into RAPTOR:
+
+```python
+RA = RetrievalAugmentation(tree=SAVE_PATH)
+answer = RA.answer_question(question=question)
+```
+
+
+### Extending RAPTOR with other Models
+
+RAPTOR is designed to be flexible and allows you to integrate any models for summarization, question-answering (QA), and embedding generation. Here is how to extend RAPTOR with your own models:
+
+#### Custom Summarization Model
+
+If you wish to use a different language model for summarization, you can do so by extending the `BaseSummarizationModel` class. Implement the `summarize` method to integrate your custom summarization logic:
+
+```python
+from raptor import BaseSummarizationModel
+
+class CustomSummarizationModel(BaseSummarizationModel):
+    def __init__(self):
+        # Initialize your model here
+        pass
+
+    def summarize(self, context, max_tokens=150):
+        # Implement your summarization logic here
+        # Return the summary as a string
+        summary = "Your summary here"
+        return summary
+```
+
+#### Custom QA Model
+
+For custom QA models, extend the `BaseQAModel` class and implement the `answer_question` method. This method should return the best answer found by your model given a context and a question:
+
+```python
+from raptor import BaseQAModel
+
+class CustomQAModel(BaseQAModel):
+    def __init__(self):
+        # Initialize your model here
+        pass
+
+    def answer_question(self, context, question):
+        # Implement your QA logic here
+        # Return the answer as a string
+        answer = "Your answer here"
+        return answer
+```
+
+#### Custom Embedding Model
+
+To use a different embedding model, extend the `BaseEmbeddingModel` class. Implement the `create_embedding` method, which should return a vector representation of the input text:
+
+```python
+from raptor import BaseEmbeddingModel
+
+class CustomEmbeddingModel(BaseEmbeddingModel):
+    def __init__(self):
+        # Initialize your model here
+        pass
+
+    def create_embedding(self, text):
+        # Implement your embedding logic here
+        # Return the embedding as a numpy array or a list of floats
+        embedding = [0.0] * embedding_dim  # Replace with actual embedding logic
+        return embedding
+```
+
+#### Integrating Custom Models with RAPTOR
+
+After implementing your custom models, integrate them with RAPTOR as follows:
+
+```python
+from raptor import RetrievalAugmentation, RetrievalAugmentationConfig
+
+# Initialize your custom models
+custom_summarizer = CustomSummarizationModel()
+custom_qa = CustomQAModel()
+custom_embedding = CustomEmbeddingModel()
+
+# Create a config with your custom models
+custom_config = RetrievalAugmentationConfig(
+    summarization_model=custom_summarizer,
+    qa_model=custom_qa,
+    embedding_model=custom_embedding
+)
+
+# Initialize RAPTOR with your custom config
+RA = RetrievalAugmentation(config=custom_config)
+```
+
+Check out `demo.ipynb` for examples on how to specify your own summarization/QA models, such as Llama/Mistral/Gemma, and Embedding Models such as SBERT, for use with RAPTOR.
+
+Note: More examples and ways to configure RAPTOR are forthcoming. Advanced usage and additional features will be provided in the documentation and repository updates.
+
+## Contributing
+
+RAPTOR is an open-source project, and contributions are welcome. Whether you're fixing bugs, adding new features, or improving documentation, your help is appreciated.
+
+## License
+
+RAPTOR is released under the MIT License. See the LICENSE file in the repository for full details.
+
+## Citation
+
+If RAPTOR assists in your research, please cite it as follows:
+
+```bibtex
+@inproceedings{sarthi2024raptor,
+    title={RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval},
+    author={Sarthi, Parth and Abdullah, Salman and Tuli, Aditi and Khanna, Shubh and Goldie, Anna and Manning, Christopher D.},
+    booktitle={International Conference on Learning Representations (ICLR)},
+    year={2024}
+}
+```
+
+Stay tuned for more examples, configuration guides, and updates.