A Scala Application for Generating Fake Datasets with Spark
The tool can generate any format given a provided schema, for example generate customers, transactions, and products.
The application requires a yaml file specifying the schema of tables to be generated.
Submit the jar artifact to a spark cluster with hive enabled, with the following arguments:
--database
- Name of the Hive database to write the tables to.--file
- Path to the yaml file.
tables:
- name: customers
rows: 10
columns:
- name: customer_id
data_type: Int
column_type: Sequential
start: 0
step: 1
- name: customer_code
column_type: Expression
expression: concat('0000000000', customer_id)
- name: products
rows: 200
columns:
- name: product_id
data_type: Int
column_type: Sequential
start: 0
step: 1
- name: product_code
column_type: Expression
expression: concat('0000000000', product_id)
- name: transactions
rows: 100
columns:
- name: customer_id
data_type: Int
column_type: Random
min: 0
max: 10 # number of customers generated
- name: product_id
data_type: Int
column_type: Random
min: 0
max: 200
- name: quantity
data_type: Int
column_type: Random
min: 0
max: 10
- name: cost
data_type: Float
column_type: Random
min: 1
max: 5
decimal_places: 2
- name: discount
data_type: Float
column_type: Random
min: 1
max: 2
decimal_places: 2
- name: spend
column_type: Expression
expression: round((cost * discount) * quantity, 2)
- name: date
data_type: Date
column_type: Random
min: 2017-01-01
max: 2018-01-01
partitions:
- date
customer_id | customer_code |
---|---|
0 | 0000000000 |
1 | 0000000001 |
2 | 0000000002 |
product_id | product_code |
---|---|
0 | 0000000000 |
1 | 0000000001 |
2 | 0000000002 |
customer_id | product_id | quantity | cost | discount | spend | date |
---|---|---|---|---|---|---|
0 | 25 | 1 | 1.53 | 1.2 | 1.83 | 2018-06-03 |
1 | 337 | 3 | 0.34 | 1.64 | 1.22 | 2018-04-12 |
2 | 550 | 6 | 4.84 | 1.03 | 29.91 | 2018-07-09 |
Call datafaker with example.yaml
spark-submit --master local datafaker-assembly-0.1-SNAPSHOT.jar --database test --file example.yaml
Supported Data Types: Int
, Long
, Float
, Double
, Date
, Timestamp
, String
, Boolean
value
- column value
Supported Data Types: Int
, Long
, Float
, Double
, Date
, Timestamp
, Boolean
min
- minimum bound of random data (inclusive)
max
- maximum bound of random data (inclusive)
Supported Data Types: Int
, Long
, Float
, Double
, Date
, Timestamp
, String
values
- set of values to be chosen from
Supported Data Types: Int
, Long
, Float
, Double
, Date
, Timestamp
start
- start value
step
- increment between each row
expression
- a spark sql expression
This project is written in Scala.
We compile a fat jar of the application, including all dependencies.
Build the jar with sbt assembly
from the project's base directory, the artifact is written to target/scala-2.11/