Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Knowledge Graph Retriever trainer using PEFT #39

Merged
merged 7 commits into from
Aug 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
143 changes: 143 additions & 0 deletions graph_rag/graph_builder/Example/build_with_relic.MD
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
# Knowledge Graph with Relik and Llama-Index

This markdown file demonstrates an experiment in building a knowledge graph using the `Relik` and `Llama-Index` Property Graphs. The steps include coreference resolution with `Spacy`, relation extraction with `Relik`, and knowledge graph construction with `llama-index PropertyGraphs`,stored in `neo4j`.

## Import Necessary Libraries

Import the essential libraries required for the experiment. These include NLP tools (`Spacy`, `coreferee`), document readers, large language models (LLMs), embeddings, and Neo4j for graph storage.

```python
import spacy, coreferee
from llama_index.core import SimpleDirectoryReader
import nest_asyncio
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import PropertyGraphIndex
from llama_index.core import Settings
from llama_index.extractors.relik.base import RelikPathExtractor
from llama_index.graph_stores.neo4j import Neo4jPGStore
```

## Coreference Resolution Function

Sets up a function to resolve coreferences in a text. This is crucial for ensuring that the references to entities like "she" or "it" are correctly linked back to their antecedents,removing de-duplication of nodes from knowledge graph.

```python
coref_nlp = spacy.load('en_core_web_lg')
coref_nlp.add_pipe('coreferee')

def coref_text(text):
coref_doc = coref_nlp(text)
resolved_text = ""

for token in coref_doc:
repres = coref_doc._.coref_chains.resolve(token)
if repres:
resolved_text += " " + " and ".join(
[
t.text
if t.ent_type_ == ""
else [e.text for e in coref_doc.ents if t in e][0]
for t in repres
]
)
else:
resolved_text += " " + token.text

return resolved_text
```

### Example Usage of Coreference Resolution

An example is provided to demonstrate how the `coref_text` function resolves references in the text.

```python
coref_text("alice is great. she can study for long hours and remember")
# Output: alice is great. alice can study for long hours and remember
```

## Load and Process Documents

The documents are loaded from a specified directory and processed with the coreference resolution function to prepare them for knowledge graph construction.

```python
documents = SimpleDirectoryReader(input_dir='/content/data').load_data()
len(documents)

for doc in documents:
doc.text = coref_text(doc.text)
```

## Initialize Relik Path Extractor

Here, the `RelikPathExtractor` is initialized, which will be used to extract relationships between entities from the processed documents.

```python
relik = RelikPathExtractor(
model="relik-ie/relik-relation-extraction-small", model_config={"skip_metadata": True}
)
```

## Set Up Language Model and Embeddings

This section configures the LLM (`Ollama`) and the embedding model (`HuggingFaceEmbedding`) to be used for generating embeddings for the knowledge graph.

```python
llm = Ollama(base_url="http://localhost:11434", model="llama3.1")
embed_model = HuggingFaceEmbedding(model_name="microsoft/codebert-base")
Settings.llm = llm
```

## Configure Neo4j Graph Store

Sets up the connection to a Neo4j database, where the knowledge graph will be stored. Ensure to replace the placeholder for the password with your actual Neo4j password.

```python
username = "neo4j"
password = "*****************************"
url = "neo4j+s://45256b03.databases.neo4j.io"

graph_store = Neo4jPGStore(
username=username,
password=password,
url=url,
refresh_schema=False
)
```

## Build the Knowledge Graph

Here, the knowledge graph is constructed from the processed documents using the configured tools: `Relik`, `Ollama`, `HuggingFaceEmbedding`, and `Neo4j`.

```python
index = PropertyGraphIndex.from_documents(
documents,
kg_extractors=[relik],
llm=llm,
embed_model=embed_model,
property_graph_store=graph_store,
show_progress=True,
)
```
![Alt text](random/visualisation.png)


## Query the Knowledge Graph

Finally, a query engine is created, allowing you to query the knowledge graph. Example queries and their expected outputs are provided.

```python
query_engine = index.as_query_engine(include_text=True)

response = query_engine.query("what is keras nlp?")
print(str(response))

# Output: Keras NLP provides a simple way to fine-tune pre-trained language models for various natural language processing tasks...
```

```python
response = query_engine.query("format for citing keras nlp")
print(str(response))

# Output: To cite Keras NLP, you can refer to the following format: KerasNLP. (n.d.). Retrieved from <https://keras-nlp.github.io/>...
```
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
33 changes: 33 additions & 0 deletions graph_rag/graph_retrieval/README.MD
Original file line number Diff line number Diff line change
Expand Up @@ -41,4 +41,37 @@ from graph_rag.graph_retrieval.graph_retrieval import graph_query
response = graph_query("Your query here", query_engine)
print(response)
```
## Advanced Training with QLoRA and P-Tuning

>fine-tuning LLMs on data(masked language or Next toke Prediction) for few epochs, may result in better retrieval and response

### 1. Setup

To use QLoRA and P-Tuning, ensure your environment is set up with the required libraries and that your model and dataset configurations are defined in a `config.yaml` file.

### 2. Finetuning with QLoRA

Use the QLoRA method for efficient fine-tuning by passing the appropriate configurations in your `config.yaml`. This method is ideal when working with large models on limited hardware.

```bash
python qlora_adapter.py --config path/to/config.yaml
```
Execute the training script with the `--config` argument to specify your configuration file:

### 3. Fine-Tuning with P-Tuning

P-Tuning allows for parameter-efficient prompt-based fine-tuning. Adjust the number of virtual tokens and other related parameters in the `config.yaml` to customize the training process.

```bash
python p_tuning.py--config path/to/config.yaml
```
Execute the training script with the `--config` argument to specify your configuration file:






This will start the training process using the specified method (QLoRA or P-Tuning) and configurations.


Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
MODEL:
MODEL: "codellama/CodeLlama-7b-Instruct-hf"
SEQ_LENGTH: 2048
LOAD_IN_8BIT: False

DATA:
REPO_PATH: '/content/keras-io/templates'
SEED: 0
EXTENSIONS: [ 'md' ]
OUTPUT_FILE: 'merged_output.txt'# Column name containing the code content

TRAINING_ARGUMENTS:
BATCH_SIZE: 64
GR_ACC_STEPS: 1
LR: 5e-4
LR_SCHEDULER_TYPE: "cosine"
WEIGHT_DECAY: 0.01
NUM_WARMUP_STEPS: 30
EVAL_FREQ: 100
SAVE_FREQ: 100
LOG_FREQ: 10
OUTPUT_DIR:
BF16: True
FP16: False

LORA:
LORA_R: 8
LORA_ALPHA: 32
LORA_DROPOUT: 0.0
LORA_TARGET_MODULES:

BNB_CONFIG:
USE_NESTED_QUANT: True
BNB_4BIT_COMPUTE_DTYPE: "bfloat16"

Loading
Loading