Skip to content

Latest commit

 

History

History
179 lines (107 loc) · 5.41 KB

README.md

File metadata and controls

179 lines (107 loc) · 5.41 KB

IBMGenerator

IBM Synthetic Data Generator for Itemsets and Sequences

Type make, which will create the executable file 'gen'

type ./gen -help for general help

For itemsets, type ./gen lit -help For sequences, type ./gen seq -help

Itemset Datasets

These datasets mimic the transactions in a retailing environment, where people tend to buy sets of items together, the so called potential maximal frequent set. The size of the maximal elements is clustered around a mean with a few long itemsets. A transaction may contain one or more of such frequent sets. The transaction size is also clustered around a mean, but a few of them may contain many items. Let D denote the number of transactions, T the average transaction size, I the size of a maximal potentially frequent itemset, L the number of maximal potentially frequent itemsets, and N the number of items. The data is generated using the following procedure. We first generate L maximal itemsets of average size I by choosing from the N items. We next generate D transactions of average size T by choosing from the L maximal itemsets.

Type: ./gen lit -help

for all the parameters to generate sequence datasets:

Command Line Options:

-ncust number_of_customers (in 1000's) (default: 100)

-slen avg_trans_per_customer (default: 10)

-tlen avg_items_per_transaction (default: 2.5)

-nitems number_of_different_items (in '000s) (default: 10000)

-rept repetition-level (default: 0)

-seq.npats number_of_seq_patterns (default: 5000)

-seq.patlen avg_length_of_maximal_pattern (default: 4)

-seq.corr correlation_between_patterns (default: 0.25)

-seq.conf avg_confidence_in_a_rule (default: 0.75)

-lit.npats number_of_patterns (default: 25000)

-lit.patlen avg_length_of_maximal_pattern (default: 1.25)

-lit.corr correlation_between_patterns (default: 0.25)

-lit.conf avg_confidence_in_a_rule (default: 0.75)

-fname (write to filename.data and filename.pat)

-ascii (Write data in ASCII format; default: False)

-version (to print out version info)

An example run can be:

./gen lit -ntrans 100 -tlen 10 -nitems 1 -npats 1000 -patlen 4 -fname T10I4D100K -ascii

This will generate a datafile named "T10I4D100K.data" In fact it generates three files:

[fname].data -- the actual data file

[fname].conf -- configuration info

[fname].pat -- the embedded patterns

Data Format

The generated file has the following format. Each line contains:

TID TID NITEMS ITEMSET

where TID is a transaction identifier, NITEMS is the number of items in that transaction, and ITEMSET is the set of items making up that transaction. All ITEMSETS are sorted lexicographically. Note that TID is repeated for consistency with the sequence generator.

Sequence Datasets

The generator generates sequence datasets that mimic real-world transactions, where people buy a sequence of sets of items. Some customers may buy only some items from the sequences, or they may buy items from multiple sequences. The input-sequence size and event size are clustered around a mean and a few of them may have many elements.

The datasets are generated using the following process. First NI maximal events of average size I are generated by choosing from N items. Then NS maximal sequences of average size S are created by assigning events from NI to each sequence. Next a customer (or input-sequence) of average C transactions (or events) is created, and sequences in NS are assigned to different customer elements, respecting the average transaction size of T. The generation stops when D input-sequences have been generated. Default values are NS = 5000, NI = 25000 and N = 10000.

Type: ./gen seq -help

for all the parameters to generate sequence datasets:

Command Line Options:

-ncust number_of_customers (in 1000's) (default: 100)

-slen avg_trans_per_customer (default: 10)

-tlen avg_items_per_transaction (default: 2.5)

-nitems number_of_different_items (in '000s) (default: 10000)

-rept repetition-level (default: 0)

-seq.npats number_of_seq_patterns (default: 5000)

-seq.patlen avg_length_of_maximal_pattern (default: 4)

-seq.corr correlation_between_patterns (default: 0.25)

-seq.conf avg_confidence_in_a_rule (default: 0.75)

-lit.npats number_of_patterns (default: 25000)

-lit.patlen avg_length_of_maximal_pattern (default: 1.25)

-lit.corr correlation_between_patterns (default: 0.25)

-lit.conf avg_confidence_in_a_rule (default: 0.75)

-fname (write to filename.data and filename.pat)

-ascii (Write data in ASCII format; default: False)

-version (to print out version info)

An example run can be:

./gen seq -ncust 200 -fname C10T2.5S4I1.25D200K -ascii

This will generate a datafile named "C10T2.5S4I1.25D200K.data" In fact, it generates four files:

[fname].data -- the actual data file

[fname].conf -- configuration info

[fname].pat -- the embedded patterns

[fname].ntpc -- info on number of trans per customer (ignore this file)

Data Format

The generated file has the following format. Each line contains:

SID TID NITEMS ITEMSET

where SID is the sequence identifier, TID is a transaction/event identifier, NITEMS is the number of items in that transaction, and ITEMSET is the set of items making up that transaction. The TIDs for an SID are listed in temporal order, i.e., TIDs are event ids within that sequence. All ITEMSETS are also sorted lexicographically.