- Software prerequisites
- GraphRAG initialization and indexation process
- Using the query engine
- Understanding the indexing pipeline output
- References
- Acknowledgements
- License
- Latest version of Docker Desktop
- VSCode
- Visual Studio Code Remote Development Extension
It is strongly recommended to use dev container to avoid python packages dependencies.
If you want to run the indexing process on your local machine to gain a better understanding of how GraphRAG works, you can follow the steps belows. Please note that these steps are well described in the original documentation at https://microsoft.github.io/graphrag/posts/get_started/. However, we have provided a brief overview of the steps below.
If you don't want to run the indexing process, please move to the next section to understand the output.
-
Clone the repo to your local machine (or Download as zip and extract):
git clone https://github.com/promisinganuj/graphrag-microhack.git
-
You have two options for running this: either through Dev Containers or within a local virtual environment.
-
Dev Container approach: Open the project inside the vscode dev container. For that open the command palette (
Ctrl+Shift+P
) and search forDev Containers: Open Folder in Container...
. Select the root folder and confirm.** If you are using MAC with Apple Silicon chip (M1/M2/M3), please ensure that you have the latest version of Docker Desktop installed to avoid the hard requirement to install Rosetta 2. This dependency is removed from Docker Desktop 4.3.0 onwards. Check here for more details.
-
Virtual environment approach: Within vscode terminal, go to the root directory of this repo. Create a virtual environment, activate it and install all necessary libraries
# step 1: Create virtual environment python -m venv .venv # step 2: Activate virtual environment # in linux/mac environments source .venv/bin/activate # in Windows .venv\Scripts\activate # step 3: Install required python libraries pip install -r requirements.txt
-
-
Choose a root directory called
sample
(or anything else), and create aninput
folder within it.mkdir -p ./sample/input
This will install the required packages for running the indexing process.
-
For running out the indexation process, you can use any dataset of your choice. As a sample, we have provided a copy of A Christmas Carol by Charles Dickens from Project Gutenburg. You can use this dataset to run the indexing process.
Copy the
a-christmas-carol.txt
file to the./sample
directory by using the following command:cp ./datasets/books/a-christmas-carol.txt ./sample/input/
-
Initialize graphrag by running the following command:
python -m graphrag.index --init --root ./sample
-
Setup workspace variables
This will create two files: .env and settings.yaml in the ./sample directory.
The
.env
file contains the environment variables required to run the GraphRAG pipeline. If you inspect the file, you'll see a single environment variable defined,GRAPHRAG_API_KEY=<API_KEY>
. This is the API key for the OpenAI API or Azure OpenAI endpoint. You can replace this with your own API key.The
settings.yaml
file contains the settings for the pipeline. You can modify this file to change the settings for the pipeline.OpenAI and Azure OpenAI To run in OpenAI mode, just make sure to update the value of
GRAPHRAG_API_KEY
in the.env
file with your OpenAI API key.Azure OpenAI In addition, Azure OpenAI users should set the following variables in the
settings.yaml
file. To find the appropriate sections, just search for thellm: configuration
, you should see two sections, one for the chat endpoint and one for the embeddings endpoint. Here is an example of how to configure the chat endpoint:type: azure_openai_chat # Or azure_openai_embedding for embeddings model: <azure_model_name> api_base: https://<instance>.openai.azure.com api_version: 2024-02-15-preview # You can customize this for other versions deployment_name: <azure_model_deployment_name>
Please follow the documentation for more information on how to set up the Azure OpenAI Service resource and deploy models.
Others
claim_extraction: ... enabled: true # This is disabled by default and have to be enabled for `create_final_covariates` file to get created
-
Run the Indexing pipeline
python -m graphrag.index --root ./sample
This process will take some time to run. This depends on the size of your input data, what model you're using, and the text chunk size being used (these can be configured in your
.env
file). Once the pipeline is complete, you should see a new folder called./sample/output/<timestamp>/artifacts
with a series of parquet files.Here are the screenshots of the output of a sample run:
Once the indexing is complete, you can use the Query Engine to query the knowledge graph. The Query Engine is the retrieval module of the Graph RAG Library.
To run a local search, you can use the following command:
python -m graphrag.query --root ./sample --method local "Who is Scrooge, and what are his main relationships?"
For details about the local search, please refer to the local search documentation.
To run a global search, you can use the following command:
python -m graphrag.query --root ./sample --method global "What are the top themes in this story?"
For details about the global search, please refer to the global search documentation.
The output of the indexing process is stored in the ./sample/output/<timestamp>/artifacts
directory. For reference, we have provided a sample output in the sample-output directory. Here is a brief overview of the output files:
create_base_documents.parquet -
create_base_text_units.parquet -
create_base_extracted_entities.parquet -
create_base_entity_graph.parquet -
create_final_communities.parquet -
create_final_community_reports.parquet -
create_summarized_entities.parquet -
create_final_entities.parquet -
create_final_nodes.parquet -
create_final_covariates.parquet
create_final_relationships.parquet -
create_final_text_units.parquet -
create_final_documents.parquet -
join_text_units_to_entity_ids.parquet -
join_text_units_to_relationship_ids.parquet -
join_text_units_to_covariate_ids.parquet -
stats.json -
Please follow the understanding-graphrag-output.ipynb notebook to understand the content of the output files. The notebook provides step-by-step instructions and code cells to:
- Load the output files as pandas dataframes.
- Extract graphML data from the output files and saves it as files in analysis/20240812-215728 directory.
- Explain the content and relationship of the output files.
Please follow the visualizing-graphrag-output.ipynb notebook to visualize the graph. The notebook provides step-by-step instructions and code cells to:
- Load and pre-process the text data.
- Reshape and load the knowledge graph data for visualization.
- Run graph enabled RAG queries against the knowledge graph
- Visualize the retrieved data as a graph
The notebook requires certain environment variables to be set. For that, please rename the .env.sample
file in the notebooks folder to .env
and update the values of the following environment variables:
GRAPHRAG_API_KEY=
GRAPHRAG_API_BASE=https://<instance>.openai.azure.com/
GRAPHRAG_API_VERSION=2024-05-01-preview
GRAPHRAG_LLM_MODEL=
GRAPHRAG_LLM_DEPLOYMENT=
GRAPHRAG_EMBEDDING_MODEL=
GRAPHRAG_EMBEDDING_DEPLOYMENT=
Please follow the visualizing-graphrag-output.ipynb notebook to perform Local Search queries. Refer to the section titled 'Run local search on sample queries'.
Please follow the global_search.ipynb notebook to perform Global Search queries. Refer to the section titled 'Run global search on sample queries'.
- Project GraphRAG
- Microsoft GraphRAG Documentation
- Microsoft GraphRAG Repo
- Microsoft GraphRAG Accelerator
- The developers of the Python libraries used for graph creation and visualization.
- The developers of the Microsoft's
graphrag
library. - Project Gutenburg for the dataset.
This project is licensed under the MIT License - see the LICENSE file for details.