SyGra: Graph-oriented Synthetic data generation Pipeline

Framework to easily generate complex synthetic data pipelines by visualizing and configuring the pipeline as a computational graph. LangGraph is used as the underlying graph configuration/execution library. Refer to LangGraph examples to get a sense of the different kinds of computational graph which can be configured.

Introduction

SyGra Framework is created to generate synthetic data. As it is a complex process to define the flow, this design simplifies the synthetic data generation process. SyGra platform will support the following:

Defining the seed data configuration
Define a task, which involves graph node configuration, flow between nodes and conditions between the node
Define the output location to dump the generated data

Seed data can be pulled from either Huggingface or file system. Once the seed data is loaded, SyGra platform allows datagen users to write any data processing using the data transformation module. When the data is ready, users can define the data flow with various types of nodes. A node can also be a subgraph defined in another yaml file.

Each node can be defined with preprocessing, post processing, and LLM prompt with model parameters. Prompts can use seed data as python template keys.
Edges define the flow between nodes, which can be conditional or non-conditional, with support for parallel and one-to-many flows.

At the end, generated data is collected in the graph state for a specific record, processed further to generate the final dictionary to be written to the configured data sink.

Installation

Pick how you want to use SyGra:

Which one should I choose?

Framework → Run end-to-end pipelines from YAML graphs + CLI tooling and project scaffolding. (Start here: Installation)
Library → Import SyGra in your own Python app/notebook; call APIs directly. (Start here: SyGra Library)

TL;DR – Framework Setup

See full steps in Installation.

git clone git@github.com:ServiceNow/SyGra.git

cd SyGra

poetry run python main.py --task examples.glaive_code_assistant --num_records=1

TL;DR – Library Setup

See full steps in Sygra Library.

pip install sygra

import sygra

workflow = sygra.Workflow("tasks/examples/glaive_code_assistant")
workflow.run(num_records=1)

Components

The SyGra architecture is composed of multiple components. The following diagrams illustrate the four primary components and their associated modules.

Data Handler

Data handler is used for reading and writing the data. Currently, it supports file handler with various file types and huggingface handler. When reading data from huggingface, it can read the whole dataset and process, or it can stream chunk of data.

Graph Node Module

This module is responsible for building various kind of nodes like LLM node, Multi-LLM node, Lambda node, Agent node etc. Each node is defined for various task, for example multi-llm node is used to load-balance the data among various inference point running same model.

Graph Edge Connection

Once node are built, we can connect them with simple edge or conditional edge. Conditional edge uses python code to decide the path. Conditional edge helps implimenting if-else flow as well as loops in the graph.

Model clients

SyGra doesn't support inference within the framework, but it supports various clients, which helps connecting with different kind of servers. For example, openai client is being supported by Huggingface TGI, vLLM server and Azure services. However, model configuration does not allow to change clients, but it can be configured in models code.

Task Components

SyGra supports extendability and ease of implementation—most tasks are defined as graph configuration YAML files. Each task consists of two major components: a graph configuration and Python code to define conditions and processors. YAML contains various parts:

Data configuration : Configure file or huggingface as source and sink for the task.
Data transformation : Configuration to transform the data into the format it can be used in the graph.
Node configuration : Configure nodes and corresponding properties, preprocessor and post processor.
Edge configuration : Connect the nodes configured above with or without conditions.
Output configuration : Configuration for data tranformation before writing the data into sink.

A node is defined by the node module, supporting types like LLM call, multiple LLM call, lambda node, and sampler node.

LLM-based nodes require a model configured in models.yaml and runtime parameters. Sampler nodes pick random samples from static YAML lists. For custom node types, you can implement new nodes in the platform.

As of now, LLM inference is supported for TGI, vLLM, Azure, Azure OpenAI, Ollama and Triton compatible servers. Model deployment is external and configured in models.yaml.

Contact

To contact us, please send us an email!

License

The package is licensed by ServiceNow, Inc. under the Apache 2.0 license. See LICENSE for more details.

Questions?
Open an issue or start a discussion! Contributions are welcome.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.github		.github
apps		apps
docs		docs
sygra		sygra
tasks/examples		tasks/examples
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
check.mk		check.mk
main.py		main.py
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
run_tools.sh		run_tools.sh
run_ui.sh		run_ui.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SyGra: Graph-oriented Synthetic data generation Pipeline

Introduction

Installation

Which one should I choose?

Components

Data Handler

Graph Node Module

Graph Edge Connection

Model clients

Task Components

Contact

License

About

Uh oh!

Releases

Packages

Contributors 7

Languages

License

ServiceNow/SyGra

Folders and files

Latest commit

History

Repository files navigation

SyGra: Graph-oriented Synthetic data generation Pipeline

Introduction

Installation

Which one should I choose?

Components

Data Handler

Graph Node Module

Graph Edge Connection

Model clients

Task Components

Contact

License

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Languages

Packages