-
Notifications
You must be signed in to change notification settings - Fork 13
Tuning the Generator
Datagen supports tuning some parts of the data generation process. This allows the user to change the way the degree distribution of the Person friendship subgraph is generated, the way the edges between the Persons are created and how data is serialized.
Datagen defines an interface to implement custom ways to generate friendship degree distributions for Persons. This interface can be found in the following file: ldbc.snb.datagen.generator.distribution.DegreeDistribution
This interface defines three methods:
-
initialize(Configuration conf)
: This is called once an instance implementing the interface is created at the beginning of the person generation process. The parameter conf is used to pass custom configuration parameters by means of theparams.ini
file, in the same way it is done for other parameters of Datagen. -
reset(long seed)
: This method is called everytime Datagen needs to set the data generator into a determined state. This method must set the class implementing the DegreeDistribution interface into a state in such a way that two identical number of calls to thenextDegree()
method after a call to reset with an identical seed, will produce the exact same sequence of numbers. This is done to guarantee determinism within Datagen. -
nextDegree()
: This method is called each time we want to get a new degree for a Person.
In order to tell Datagen to use a particular DegreeDistribution
implementation, add the following line in you params.ini
file:
ldbc.snb.datagen.generator.distribution.degreeDistribution:<full java classpath of the implementation>
Datagen already includes several different degree distributions with the following subclass relations:
-
ldbc.snb.datagen.generator.distribution.DegreeDistribution
-
ldbc.snb.datagen.generator.distribution.BucketedDistribution
ldbc.snb.datagen.generator.distribution.EmpiricalDistribution
ldbc.snb.datagen.generator.distribution.FacebookDegreeDistribution
-
ldbc.snb.datagen.generator.distribution.CumulativeBasedDegreeDistribution
ldbc.snb.datagen.generator.distribution.AltmannDistribution
ldbc.snb.datagen.generator.distribution.DiscreteWeibullDistribution
ldbc.snb.datagen.generator.distribution.GeoDistribution
ldbc.snb.datagen.generator.distribution.MoeZipfDistribution
ldbc.snb.datagen.generator.distribution.ZipfDistribution
-
The default distribution generator is ldbc.snb.datagen.generator.distribution.FacebookDegreeDistribution
This implements a degree distribution that tries to model that observed in Facebook.
This implements the Altmann Distribution, which accepts the following parameters:
Option | Default | Description |
---|---|---|
ldbc.snb.datagen.generator.distribution.AltmannDistribution.alpha |
0.4577 | The value of the parameter alpha of the Altmann Distribution |
ldbc.snb.datagen.generator.distribution.AltmannDistribution.beta |
0.0162 | The value of the parameter beta of the Altmann Distribution |
This implements the Discrete Weibull distribution, which accepts the following parameters:
Option | Default | Description |
---|---|---|
ldbc.snb.datagen.generator.distribution.DiscreteWeibullDistribution.alpha |
0.8505 | The value of the parameter beta of the Discrete Weibull Distribution |
ldbc.snb.datagen.generator.distribution.DiscreteWeibullDistribution.p |
0.0205 | The value of the parameter p of the Discrete Weibull Distribution |
This implements the Geometric distribution, which accepts the following parameters:
Option | Default | Description |
---|---|---|
ldbc.snb.datagen.generator.distribution.GeoDistribution.alpha |
0.12 | The value of the parameter alpha of the Geometric Distribution |
This implements the Zipf distribution, which accepts the following parameters:
Option | Default | Description |
---|---|---|
ldbc.snb.datagen.generator.distribution.ZipfDistribution.alpha |
1.7 | The value of the parameter alpha of the Zipf Distribution |
This implements the MoeZipf distribution, which accepts the following parameters:
Option | Default | Description |
---|---|---|
ldbc.snb.datagen.generator.distribution.MoeZipfDistribution.alpha |
1.7 | The value of the parameter alpha of the MoeZipf Distribution |
ldbc.snb.datagen.generator.distribution.MoeZipfDistribution.delta |
1.5 | The value of the parameter delta of the MoeZipf Distribution |
Similar to friendship degree distribution, Datagen defines an interface that can be implemented to change the way the knows edges are connected: ldbc.snb.datagen.generator.KnowsGenerator
This interface defines three methods
-
initialize(Configuration conf)
: This is called once an instance implementing the interface is created at the beginning of the edge generation process. The parameter conf is used to pass custom configuration parameters by means of theparams.ini
file, in the same way it is done for other parameters of Datagen. -
generateKnows(ArrayList<Person> persons, int seed, ArrayList<Float> percentages, int step_index)
: This is called once the edge generation process starts, in order to generate the edges for a given block of persons. The first parameter is the array of persons to generate the edges for. The second parameter is a seed used to seed any random number generator used by the implementation. The implementation of must behave identically for two identical seeds, in such a way that two consecutive and identical sequences of operations will produce the same result. The percentages is an array containing the percentage of edges that must be created for each person, out of the maximum number of desired edges. Finally,step_index
is used to know at which edge generation step we are, and to index the percentages array.
In order to tell Datagen to use a particular KnowsGenerator
implementation, add the following line in you params.ini
file:
ldbc.snb.datagen.generator.knowsGenerator:<full java classpath of the implementation>
Available knows generator implementations are:
ldbc.snb.datagen.generator.RandomKnowsGenerator
ldbc.snb.datagen.generator.DistanceKnowsGenerator
ldbc.snb.datagen.generator.ClusteringKnowsGenerator
The default generator is ldbc.snb.datagen.generator.DistanceKnowsGenerator
This generator creates edges between the Persons in the block totally randomly, trying to respect their set degrees, using the configuration model graph generator.
This is the original LDBC Datagen edge generator process. This creates edges between persons in the block, with a probability based on their distance in the block
This creates edges with the goal of obtaining a target clustering coefficient, based on having a community structure. This generator accepts the following parameter
Option | Default | Description |
---|---|---|
ldbc.snb.datagen.generator.ClusteringKnowsGenerator.clusteringCoefficient |
0.1 | The value of desired clustering coefficient |
You can customize the way Dates and DateTimes are formatted in Datagen. By implementing the DateFormatter
interface, you can control the actual format of the timestamps. In your params.ini
file, you can set the following option pointing to your implemented plugin as in the following example:
ldbc.snb.datagen.serializer.dateFormatter:ldbc.snb.datagen.serializer.formatter.LongDateFormatter
We provide two default formatters:
-
ldbc.snb.datagen.serializer.formatter.LongDateFormatter
, which outputs both Dates and DateTimes as unix epochs in milliseconds -
ldbc.snb.datagen.serializer.formatter.StringFormatter
, which outputs Date and DateTimes as strings.
For the StringFormatter
, the actual string format can be customized using the default Java way of specifying timestamp formats. For example:
ldbc.snb.datagen.serializer.formatter.StringDateFormatter.dateTimeFormat:"yyyy-MM-dd HH:mm:ss.SSS"
The computation of the weights on the edges of the person-knows-person subgraphs can be customized by means of an implementation of the Person.PersonSimilarity
interface. You can specify your actual implementation in your params.ini
file, like in the following example:
ldbc.snb.datagen.generator.person.similarity:ldbc.snb.datagen.objects.similarity.GeoDistanceSimilarity
Currently provided plugins are: GeoDistanceSimilarity
(default), which computes the weight based on how close are persons geographically, and InterestsSimilarity
, where the weight is based on the common interests of both persons