kg-gen: Knowledge Graph Generation from Any Text

Welcome! kg-gen helps you extract knowledge graphs from any plain text using AI. It can process both small and large text inputs, and it can also handle messages in a conversation format.

Why generate knowledge graphs? kg-gen is great if you want to:

Create a graph to assist with RAG (Retrieval-Augmented Generation)
Create graph synthetic data for model training and testing
Structure any text into a graph
Analyze the relationships between concepts in your source text

We support API-based and local model providers via LiteLLM, including OpenAI, Ollama, Anthropic, Gemini, Deepseek, and others. We also use DSPy for structured output generation.

Try it out by running the scripts in tests/.
Instructions to run our KG benchmark MINE are in MINE/.
Read the paper: KGGen: Extracting Knowledge Graphs from Plain Text with Language Models

Quick start

Install the module:

pip install kg-gen

Then import and use kg-gen. You can provide your text input in one of two formats:

A single string
A list of Message objects (each with a role and content)

Below are some example snippets:

from kg_gen import KGGen

# Initialize KGGen with optional configuration
kg = KGGen(
  model="openai/gpt-4o",  # Default model
  temperature=0.0,        # Default temperature
  api_key="YOUR_API_KEY"  # Optional if set in environment
)

# EXAMPLE 1: Single string with context
text_input = "Linda is Josh's mother. Ben is Josh's brother. Andrew is Josh's father."
graph_1 = kg.generate(
  input_data=text_input,
  context="Family relationships"
)
# Output: 
# entities={'Linda', 'Ben', 'Andrew', 'Josh'} 
# edges={'is brother of', 'is father of', 'is mother of'} 
# relations={('Ben', 'is brother of', 'Josh'), 
#           ('Andrew', 'is father of', 'Josh'), 
#           ('Linda', 'is mother of', 'Josh')}

# EXAMPLE 2: Large text with chunking and clustering
with open('large_text.txt', 'r') as f:
  large_text = f.read()
  
# Example input text:
# """
# Neural networks are a type of machine learning model. Deep learning is a subset of machine learning
# that uses multiple layers of neural networks. Supervised learning requires training data to learn
# patterns. Machine learning is a type of AI technology that enables computers to learn from data.
# AI, also known as artificial intelligence, is related to the broader field of artificial intelligence.
# Neural nets (NN) are commonly used in ML applications. Machine learning (ML) has revolutionized
# many fields of study.
# ...
# """

graph_2 = kg.generate(
  input_data=large_text,
  chunk_size=5000,  # Process text in chunks of 5000 chars
  cluster=True      # Cluster similar entities and relations
)
# Output:
# entities={'neural networks', 'deep learning', 'machine learning', 'AI', 'artificial intelligence', 
#          'supervised learning', 'unsupervised learning', 'training data', ...} 
# edges={'is type of', 'requires', 'is subset of', 'uses', 'is related to', ...} 
# relations={('neural networks', 'is type of', 'machine learning'),
#           ('deep learning', 'is subset of', 'machine learning'),
#           ('supervised learning', 'requires', 'training data'),
#           ('machine learning', 'is type of', 'AI'),
#           ('AI', 'is related to', 'artificial intelligence'), ...}
# entity_clusters={
#   'artificial intelligence': {'AI', 'artificial intelligence'},
#   'machine learning': {'machine learning', 'ML'},
#   'neural networks': {'neural networks', 'neural nets', 'NN'}
#   ...
# }
# edge_clusters={
#   'is type of': {'is type of', 'is a type of', 'is a kind of'},
#   'is related to': {'is related to', 'is connected to', 'is associated with'
#  ...}
# }

# EXAMPLE 3: Messages array
messages = [
  {"role": "user", "content": "What is the capital of France?"}, 
  {"role": "assistant", "content": "The capital of France is Paris."}
]
graph_3 = kg.generate(input_data=messages)
# Output: 
# entities={'Paris', 'France'} 
# edges={'has capital'} 
# relations={('France', 'has capital', 'Paris')}

# EXAMPLE 4: Combining multiple graphs
text1 = "Linda is Joe's mother. Ben is Joe's brother."

# Input text 2: also goes by Joe."
text2 = "Andrew is Joseph's father. Judy is Andrew's sister. Joseph also goes by Joe."

graph4_a = kg.generate(input_data=text1)
graph4_b = kg.generate(input_data=text2)

# Combine the graphs
combined_graph = kg.aggregate([graph4_a, graph4_b])

# Optionally cluster the combined graph
clustered_graph = kg.cluster(
  combined_graph,
  context="Family relationships"
)
# Output:
# entities={'Linda', 'Ben', 'Andrew', 'Joe', 'Joseph', 'Judy'} 
# edges={'is mother of', 'is father of', 'is brother of', 'is sister of'} 
# relations={('Linda', 'is mother of', 'Joe'),
#           ('Ben', 'is brother of', 'Joe'),
#           ('Andrew', 'is father of', 'Joe'),
#           ('Judy', 'is sister of', 'Andrew')}
# entity_clusters={
#   'Joe': {'Joe', 'Joseph'},
#   ...
# }
# edge_clusters={ ... }

Model Providers

We support models via LiteLLM. Check out how to pass in your desired model here: https://docs.litellm.ai/docs/providers

Features

Chunking Large Texts

For large texts, you can specify a chunk_size parameter to process the text in smaller chunks:

graph = kg.generate(
  input_data=large_text,
  chunk_size=5000  # Process in chunks of 5000 characters
)

Clustering Similar Entities and Relations

You can cluster similar entities and relations either during generation or afterwards:

# During generation
graph = kg.generate(
  input_data=text,
  cluster=True,
  context="Optional context to guide clustering"
)

# Or after generation
clustered_graph = kg.cluster(
  graph,
  context="Optional context to guide clustering"
)

Aggregating Multiple Graphs

You can combine multiple graphs using the aggregate method:

graph1 = kg.generate(input_data=text1)
graph2 = kg.generate(input_data=text2)
combined_graph = kg.aggregate([graph1, graph2])

Message Array Processing

When processing message arrays, kg-gen:

Preserves the role information from each message
Maintains message order and boundaries
Can extract entities and relationships:
- Between concepts mentioned in messages
- Between speakers (roles) and concepts
- Across multiple messages in a conversation

For example, given this conversation:

messages = [
  {"role": "user", "content": "What is the capital of France?"},
  {"role": "assistant", "content": "The capital of France is Paris."}
]

The generated graph might include entities like:

"user"
"assistant"
"France"
"Paris"

And relations like:

(user, "asks about", "France")
(assistant, "states", "Paris")
(Paris, "is capital of", "France")

API Reference

KGGen Class

Constructor Parameters

model: str = "openai/gpt-4o" - The model to use for generation
temperature: float = 0.0 - Temperature for model sampling
api_key: Optional[str] = None - API key for model access

generate() Method Parameters

input_data: Union[str, List[Dict]] - Text string or list of message dicts
model: Optional[str] - Override the default model
api_key: Optional[str] - Override the default API key
context: str = "" - Description of data context
chunk_size: Optional[int] - Size of text chunks to process
cluster: bool = False - Whether to cluster the graph after generation
temperature: Optional[float] - Override the default temperature
output_folder: Optional[str] - Path to save partial progress

cluster() Method Parameters

graph: Graph - The graph to cluster
context: str = "" - Description of data context
model: Optional[str] - Override the default model
temperature: Optional[float] - Override the default temperature
api_key: Optional[str] - Override the default API key

aggregate() Method Parameters

graphs: List[Graph] - List of graphs to combine

License

The MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
MINE		MINE
dist		dist
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kg-gen: Knowledge Graph Generation from Any Text

Quick start

Model Providers

Features

Chunking Large Texts

Clustering Similar Entities and Relations

Aggregating Multiple Graphs

Message Array Processing

API Reference

KGGen Class

Constructor Parameters

generate() Method Parameters

cluster() Method Parameters

aggregate() Method Parameters

License

About

Releases

Packages

Contributors 3

Languages

stair-lab/kg-gen

Folders and files

Latest commit

History

Repository files navigation

kg-gen: Knowledge Graph Generation from Any Text

Quick start

Model Providers

Features

Chunking Large Texts

Clustering Similar Entities and Relations

Aggregating Multiple Graphs

Message Array Processing

API Reference

KGGen Class

Constructor Parameters

generate() Method Parameters

cluster() Method Parameters

aggregate() Method Parameters

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages