Skip to content

Test data management tool for any data source, batch or real-time. Generate, validate and clean up data all in one tool.

License

Notifications You must be signed in to change notification settings

data-catering/data-caterer

 
 

Repository files navigation

Data Caterer - Test Data Management Tool

Overview

A test data management tool with automated data generation, validation and clean up.

Basic data flow for Data Caterer

Generate data for databases, files, messaging systems or HTTP requests via UI, Scala/Java SDK or YAML input and executed via Spark. Run data validations after generating data to ensure it is consumed correctly. Clean up generated data or consumed data in downstream data sources to keep your environments tidy. Define alerts to get notified when failures occur and deep dive into issues from the generated report.

Full docs can be found here.

Scala/Java examples found here.

A demo of the UI found here.

Features

Basic flow

Quick start

  1. Docker
    docker run -d -i -p 9898:9898 -e DEPLOY_MODE=standalone --name datacaterer datacatering/data-caterer:0.12.1
    Open localhost:9898.
  2. Run Scala/Java examples
    git clone git@github.com:data-catering/data-caterer-example.git
    cd data-caterer-example && ./run.sh
    #check results under docker/sample/report/index.html folder
  3. UI App: Mac download
  4. UI App: Windows download
    1. After downloading, go to 'Downloads' folder and 'Extract All' from data-caterer-windows
    2. Double-click 'DataCaterer-1.0.0' to install Data Caterer
    3. Click on 'More info' then at the bottom, click 'Run anyway'
    4. Go to '/Program Files/DataCaterer' folder and run DataCaterer application
    5. If your browser doesn't open, go to http://localhost:9898 in your preferred browser
  5. UI App: Linux download

Integrations

Supported data sources

Data Caterer supports the below data sources. Check here for the full roadmap.

Data Source Type Data Source Support
Cloud Storage AWS S3 âś…
Cloud Storage Azure Blob Storage âś…
Cloud Storage GCP Cloud Storage âś…
Database Cassandra âś…
Database MySQL âś…
Database Postgres âś…
Database Elasticsearch ❌
Database MongoDB ❌
File CSV âś…
File Delta Lake âś…
File JSON âś…
File Iceberg âś…
File ORC âś…
File Parquet âś…
File Hudi ❌
HTTP REST API âś…
Messaging Kafka âś…
Messaging Solace âś…
Messaging ActiveMQ ❌
Messaging Pulsar ❌
Messaging RabbitMQ ❌
Metadata Data Contract CLI âś…
Metadata Great Expectations âś…
Metadata Marquez âś…
Metadata OpenAPI/Swagger âś…
Metadata OpenMetadata âś…
Metadata Open Data Contract Standard (ODCS) âś…
Metadata Amundsen ❌
Metadata Datahub ❌
Metadata Solace Event Portal ❌

Sponsorship

Data Caterer is set up under a sponsorship model. If you require support or additional features from Data Caterer as an enterprise, you are required to be a sponsor for the project.

Find out more details here to help with sponsorship.

Contributing

View details here about how you can contribute to the project.

Additional Details

Run Configurations

Different ways to run Data Caterer based on your use case:

Types of run configurations

Design

Design motivations and details can be found here.

Roadmap

Can check here for full list.

Mildly Quick Start

Generate and validate data

postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")  //name and url
postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .schema(field.name("account_id").regex("ACC[0-9]{10}").unique(true))
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .schema(field.name("account_id").regex("ACC[0-9]{10}").unique(true))

val parquetValidation = parquet("output_parquet", "/data/parquet/customer")
  .validation(validation.count.isEqual(1000))
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .schema(field.name("account_id").regex("ACC[0-9]{10}").unique(true))

val parquetValidation = parquet("output_parquet", "/data/parquet/customer")
  .validation(
     validation.upstreamData(postgresTask)
       .joinColumns("account_id")
       .withValidation(validation.count().isEqual(1000))
  )
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .schema(field.name("account_id").regex("ACC[0-9]{10}").unique(true))

val parquetValidation = parquet("output_parquet", "/data/parquet/customer")
  .validation(
     validation.upstreamData(postgresTask)
       .joinColumns("account_id")
       .withValidation(validation.count().isEqual(1000))
  )
  .validationWait(waitCondition.file("/data/parquet/customer"))

Generate same data across data sources

kafka("my_kafka", "localhost:29092")
  .topic("account-topic")
  .schema(...)
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .schema(field.name("account_id").regex("ACC[0-9]{10}"))

val kafkaTask = kafka("my_kafka", "localhost:29092")
  .topic("account-topic")
  .schema(...)

plan.addForeignKeyRelationship(
   postgresTask, List("account_id"),
   List(kafkaTask -> List("account_id"))
)

Generate data and clean up

postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .table("account", "transactions")
  .count(count.recordsPerColumn(5, "account_id"))
postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .table("account", "transactions")
  .count(count.recordsPerColumnGenerator(generator.min(1).max(5), "account_id"))
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .table("account", "transactions")
  .count(count.recordsPerColumnGenerator(generator.min(0).max(5), "account_id"))

val conf = configuration
  .enableDeleteGeneratedRecords(true)
  .enableGenerateData(false)
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .table("account", "transactions")
  .count(count.recordsPerColumnGenerator(generator.min(0).max(5), "account_id"))

val cassandraTxns = cassandra("ingested_data", "localhost:9042")
  .table("account", "transactions")

val deletePlan = plan.addForeignKeyRelationship(
   postgresTask, List("account_id"),
   List(),
   List(cassandraTxns -> List("account_id"))
)

val conf = configuration
  .enableDeleteGeneratedRecords(true)
  .enableGenerateData(false)
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .count(count.recordsPerColumnGenerator(generator.min(0).max(5), "account_id"))

val cassandraTxns = cassandra("ingested_data", "localhost:9042")
  .table("account", "transactions")

val deletePlan = plan.addForeignKeyRelationship(
   postgresTask, List("account_id"),
   List(),
   List(cassandraTxns -> List("SUBSTR(account_id, 3) AS account_number"))
)

val conf = configuration
  .enableDeleteGeneratedRecords(true)
  .enableGenerateData(false)

Generate data with schema from metadata source

parquet("customer_parquet", "/data/parquet/customer")
  .schema(metadataSource.openDataContractStandard("/data/odcs/full-example.odcs.yaml"))
http("my_http")
  .schema(metadataSource.openApi("/data/http/petstore.json"))

Validate data using validations from metadata source

parquet("customer_parquet", "/data/parquet/customer")
  .validations(metadataSource.greatExpectations("/data/great-expectations/taxi-expectations.json"))

Packages

No packages published

Languages

  • Scala 81.9%
  • JavaScript 12.7%
  • Java 2.9%
  • HTML 1.6%
  • Shell 0.5%
  • CSS 0.3%
  • Dockerfile 0.1%