data-faker

A Scala Application for Generating Fake Datasets with Spark

The tool can generate any format given a provided schema, for example generate customers, transactions, and products.

The application requires a yaml file specifying the schema of tables to be generated.

Usage

Submit the jar artifact to a spark cluster with hive enabled, with the following arguments:

--database - Name of the Hive database to write the tables to.
--file - Path to the yaml file.

Example Yaml File Tables

tables:
- name: customers
  rows: 10
  columns:
  - name: customer_id
    data_type: Int
    column_type: Sequential
    start: 0
    step: 1
  - name: customer_code
    column_type: Expression
    expression: concat('0000000000', customer_id)

- name: products
  rows: 200
  columns:
  - name: product_id
    data_type: Int
    column_type: Sequential
    start: 0
    step: 1
  - name: product_code
    column_type: Expression
    expression: concat('0000000000', product_id)

- name: transactions
  rows: 100
  columns:
  - name: customer_id
    data_type: Int
    column_type: Random
    min: 0
    max: 10 # number of customers generated
  - name: product_id
    data_type: Int
    column_type: Random
    min: 0
    max: 200
  - name: quantity
    data_type: Int
    column_type: Random
    min: 0
    max: 10
  - name: cost
    data_type: Float
    column_type: Random
    min: 1
    max: 5
    decimal_places: 2
  - name: discount
    data_type: Float
    column_type: Random
    min: 1
    max: 2
    decimal_places: 2
  - name: spend
    column_type: Expression
    expression: round((cost * discount) * quantity, 2)
  - name: date
    data_type: Date
    column_type: Random
    min: 2017-01-01
    max: 2018-01-01
  partitions:
    - date

customers

customer_id	customer_code
0	0000000000
1	0000000001
2	0000000002

products

product_id	product_code
0	0000000000
1	0000000001
2	0000000002

transactions

customer_id	product_id	quantity	cost	discount	spend	date
0	25	1	1.53	1.2	1.83	2018-06-03
1	337	3	0.34	1.64	1.22	2018-04-12
2	550	6	4.84	1.03	29.91	2018-07-09

Example:

Call datafaker with example.yaml

Execute locally:

spark-submit --master local datafaker-assembly-0.1-SNAPSHOT.jar --database test --file example.yaml

Column Types

- Fixed

Supported Data Types: Int, Long, Float, Double, Date, Timestamp, String, Boolean

value - column value

- Random

Supported Data Types: Int, Long, Float, Double, Date, Timestamp, Boolean

min - minimum bound of random data (inclusive)

max - maximum bound of random data (inclusive)

- Selection

Supported Data Types: Int, Long, Float, Double, Date, Timestamp, String

values - set of values to be chosen from

- Sequential

Supported Data Types: Int, Long, Float, Double, Date, Timestamp

start - start value

step - increment between each row

- Expression

expression - a spark sql expression

Build Artifact

This project is written in Scala.

We compile a fat jar of the application, including all dependencies.

Build the jar with sbt assembly from the project's base directory, the artifact is written to target/scala-2.11/

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
example.yaml		example.yaml
pom.xml		pom.xml
scalastyle_config.xml		scalastyle_config.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

data-faker

Usage

Example Yaml File Tables

customers

products

transactions

Example:

Execute locally:

Column Types

- Fixed

- Random

- Selection

- Sequential

- Expression

Build Artifact

About

Releases

Packages

Languages

License

jlfsdtc/data-faker

Folders and files

Latest commit

History

Repository files navigation

data-faker

Usage

Example Yaml File Tables

customers

products

transactions

Example:

Execute locally:

Column Types

- Fixed

- Random

- Selection

- Sequential

- Expression

Build Artifact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages