Welcome! kg-gen
helps you extract knowledge graphs from any plain text using AI. It can process both small and large text inputs, and it can also handle messages in a conversation format.
Why generate knowledge graphs? kg-gen
is great if you want to:
- Create a graph to assist with RAG (Retrieval-Augmented Generation)
- Create graph synthetic data for model training and testing
- Structure any text into a graph
- Analyze the relationships between concepts in your source text
We support API-based and local model providers via LiteLLM, including OpenAI, Ollama, Anthropic, Gemini, Deepseek, and others. We also use DSPy for structured output generation.
- Try it out by running the scripts in
tests/
. - Instructions to run our KG benchmark MINE are in
MINE/
. - Read the paper: KGGen: Extracting Knowledge Graphs from Plain Text with Language Models
Install the module:
pip install kg-gen
Then import and use kg-gen
. You can provide your text input in one of two formats:
- A single string
- A list of Message objects (each with a role and content)
Below are some example snippets:
from kg_gen import KGGen
# Initialize KGGen with optional configuration
kg = KGGen(
model="openai/gpt-4o", # Default model
temperature=0.0, # Default temperature
api_key="YOUR_API_KEY" # Optional if set in environment
)
# EXAMPLE 1: Single string with context
text_input = "Linda is Josh's mother. Ben is Josh's brother. Andrew is Josh's father."
graph_1 = kg.generate(
input_data=text_input,
context="Family relationships"
)
# Output:
# entities={'Linda', 'Ben', 'Andrew', 'Josh'}
# edges={'is brother of', 'is father of', 'is mother of'}
# relations={('Ben', 'is brother of', 'Josh'),
# ('Andrew', 'is father of', 'Josh'),
# ('Linda', 'is mother of', 'Josh')}
# EXAMPLE 2: Large text with chunking and clustering
with open('large_text.txt', 'r') as f:
large_text = f.read()
# Example input text:
# """
# Neural networks are a type of machine learning model. Deep learning is a subset of machine learning
# that uses multiple layers of neural networks. Supervised learning requires training data to learn
# patterns. Machine learning is a type of AI technology that enables computers to learn from data.
# AI, also known as artificial intelligence, is related to the broader field of artificial intelligence.
# Neural nets (NN) are commonly used in ML applications. Machine learning (ML) has revolutionized
# many fields of study.
# ...
# """
graph_2 = kg.generate(
input_data=large_text,
chunk_size=5000, # Process text in chunks of 5000 chars
cluster=True # Cluster similar entities and relations
)
# Output:
# entities={'neural networks', 'deep learning', 'machine learning', 'AI', 'artificial intelligence',
# 'supervised learning', 'unsupervised learning', 'training data', ...}
# edges={'is type of', 'requires', 'is subset of', 'uses', 'is related to', ...}
# relations={('neural networks', 'is type of', 'machine learning'),
# ('deep learning', 'is subset of', 'machine learning'),
# ('supervised learning', 'requires', 'training data'),
# ('machine learning', 'is type of', 'AI'),
# ('AI', 'is related to', 'artificial intelligence'), ...}
# entity_clusters={
# 'artificial intelligence': {'AI', 'artificial intelligence'},
# 'machine learning': {'machine learning', 'ML'},
# 'neural networks': {'neural networks', 'neural nets', 'NN'}
# ...
# }
# edge_clusters={
# 'is type of': {'is type of', 'is a type of', 'is a kind of'},
# 'is related to': {'is related to', 'is connected to', 'is associated with'
# ...}
# }
# EXAMPLE 3: Messages array
messages = [
{"role": "user", "content": "What is the capital of France?"},
{"role": "assistant", "content": "The capital of France is Paris."}
]
graph_3 = kg.generate(input_data=messages)
# Output:
# entities={'Paris', 'France'}
# edges={'has capital'}
# relations={('France', 'has capital', 'Paris')}
# EXAMPLE 4: Combining multiple graphs
text1 = "Linda is Joe's mother. Ben is Joe's brother."
# Input text 2: also goes by Joe."
text2 = "Andrew is Joseph's father. Judy is Andrew's sister. Joseph also goes by Joe."
graph4_a = kg.generate(input_data=text1)
graph4_b = kg.generate(input_data=text2)
# Combine the graphs
combined_graph = kg.aggregate([graph4_a, graph4_b])
# Optionally cluster the combined graph
clustered_graph = kg.cluster(
combined_graph,
context="Family relationships"
)
# Output:
# entities={'Linda', 'Ben', 'Andrew', 'Joe', 'Joseph', 'Judy'}
# edges={'is mother of', 'is father of', 'is brother of', 'is sister of'}
# relations={('Linda', 'is mother of', 'Joe'),
# ('Ben', 'is brother of', 'Joe'),
# ('Andrew', 'is father of', 'Joe'),
# ('Judy', 'is sister of', 'Andrew')}
# entity_clusters={
# 'Joe': {'Joe', 'Joseph'},
# ...
# }
# edge_clusters={ ... }
We support models via LiteLLM. Check out how to pass in your desired model here: https://docs.litellm.ai/docs/providers
For large texts, you can specify a chunk_size
parameter to process the text in smaller chunks:
graph = kg.generate(
input_data=large_text,
chunk_size=5000 # Process in chunks of 5000 characters
)
You can cluster similar entities and relations either during generation or afterwards:
# During generation
graph = kg.generate(
input_data=text,
cluster=True,
context="Optional context to guide clustering"
)
# Or after generation
clustered_graph = kg.cluster(
graph,
context="Optional context to guide clustering"
)
You can combine multiple graphs using the aggregate method:
graph1 = kg.generate(input_data=text1)
graph2 = kg.generate(input_data=text2)
combined_graph = kg.aggregate([graph1, graph2])
When processing message arrays, kg-gen:
- Preserves the role information from each message
- Maintains message order and boundaries
- Can extract entities and relationships:
- Between concepts mentioned in messages
- Between speakers (roles) and concepts
- Across multiple messages in a conversation
For example, given this conversation:
messages = [
{"role": "user", "content": "What is the capital of France?"},
{"role": "assistant", "content": "The capital of France is Paris."}
]
The generated graph might include entities like:
- "user"
- "assistant"
- "France"
- "Paris"
And relations like:
- (user, "asks about", "France")
- (assistant, "states", "Paris")
- (Paris, "is capital of", "France")
model
: str = "openai/gpt-4o" - The model to use for generationtemperature
: float = 0.0 - Temperature for model samplingapi_key
: Optional[str] = None - API key for model access
input_data
: Union[str, List[Dict]] - Text string or list of message dictsmodel
: Optional[str] - Override the default modelapi_key
: Optional[str] - Override the default API keycontext
: str = "" - Description of data contextchunk_size
: Optional[int] - Size of text chunks to processcluster
: bool = False - Whether to cluster the graph after generationtemperature
: Optional[float] - Override the default temperatureoutput_folder
: Optional[str] - Path to save partial progress
graph
: Graph - The graph to clustercontext
: str = "" - Description of data contextmodel
: Optional[str] - Override the default modeltemperature
: Optional[float] - Override the default temperatureapi_key
: Optional[str] - Override the default API key
graphs
: List[Graph] - List of graphs to combine
The MIT License.