IIM is a novel algorithm that mines the itemsets that are most interesting under a probablistic model of transactions. Our model is able to efficiently infer interesting itemsets directly from the transaction database.
This is an implementation of the itemset miner from our paper:
A Bayesian Network Model for Interesting Itemsets
J. Fowkes and C. Sutton. PKDD 2016.
Simply import as a maven project into Eclipse using the File -> Import... menu option (note that this requires m2eclipse).
It's also possible to export a runnable jar from Eclipse using the File -> Export... menu option.
To compile a standalone runnable jar, simply run
mvn package
in the top-level directory (note that this requires maven). This will create the standalone runnable jar itemset-mining-1.0.jar
in the itemset-mining/target subdirectory. The main class is itemsetmining.main.ItemsetMining (see below).
IIM uses a Bayesian Network Model to determine which itemsets are the most interesting in a given dataset.
Main class itemsetmining.main.ItemsetMining mines itemsets from a specified transaction database file. It has the following command line options:
- -f database file to mine (in FIMI format)
- -i max. no. iterations
- -s max. no. structure steps
- -r max. runtime (min)
- -l log level (INFO/FINE/FINER/FINEST)
- -v print to console instead of log file
See the individual file javadocs in itemsetmining.main.ItemsetMining for information on the Java interface. In Eclipse you can set command line arguments for the IIM interface using the Run Configurations... menu option.
A complete example using the command line interface on a runnable jar. We can mine the provided example dataset example.dat
as follows:
$ java -cp itemset-mining/target/itemset-mining-1.0.jar itemsetmining.main.ItemsetMining
-i 100
-f example.dat
-v
which will output to the console. Omitting the -v
flag will redirect output to a log-file in /tmp/
.
IIM takes as input a transaction database file in FIMI format. The FIMI format is very simple: each line of the input file represents a transaction
and each transaction is a space-seperated list of items, represented by positive integers. The FIMI format requires the transaction items to be listed in increasing order
and does not allow duplicate items (however IIM is not sensitive to item order and ignores item duplicates). For example, a few lines (transactions) from example.dat
are:
6 10 22 31 32 41 52
2 12 14 26 50
3 18 25 31 34 38 63
17 28 30 37
16 19 45 46 49 51 52 54 56 65
Note that any other item formats (e.g. words for text corpora) need to be manually mapped to (and from) positive integers by means of a dictionary.
IIM outputs a list of interesting itemsets, one itemset per line, ordered first by their interestingness (given in the 'int' column) followed by their probability (given in the 'prob' column). For example, the first few lines of output for the usage example above are:
============= INTERESTING ITEMSETS =============
{18} prob: 0.34830 int: 1.00000
{14} prob: 0.13740 int: 1.00000
{5} prob: 0.11740 int: 1.00000
{16} prob: 0.09110 int: 1.00000
{6, 7, 22, 36, 65, 67} prob: 0.08440 int: 1.00000
{17, 28, 30, 37} prob: 0.07830 int: 1.00000
{1, 2, 8, 11, 12, 13, 20, 63, 64} prob: 0.07670 int: 1.00000
{59, 60, 62} prob: 0.06980 int: 1.00000
{43, 46, 55} prob: 0.06890 int: 1.00000
{53} prob: 0.06870 int: 1.00000
See the accompanying paper for details of how to interpret 'interestingness' and 'probability' under IIM's probabilistic model.
IIM also has a (beta) parallel implemetation using Spark in Standalone Mode with an HDFS filesystem (see e.g. relevant parts of this tutorial).
Basic IIM configuration for Spark and HDFS must be set in itemset-miner/src/main/resources/spark.properties
(see the example config provided):
- SparkHome Spark home directory
- SparkMaster URL of spark master server
- MachinesInCluster No. machines in the cluster
- HDFSMaster URL of HDFS master server
- HDFSConfFile Location of Hadoop
core-site.xml
Main class itemsetmining.main.SparkItemsetMining mines itemsets using a Standalone Spark Sever. It has the following additional command line options:
- -c no. Spark cores to use
- -j location of IIM standalone jar (default is
itemset-mining/target/itemset-mining-1.0.jar
)
See the individual file javadocs in itemsetmining.main.SparkItemsetMining for information on the Java interface.
A complete Spark example using the command line interface is as follows:
$ java -cp itemset-mining/target/itemset-mining-1.0.jar itemsetmining.main.SparkItemsetMining
-c 16
-i 100
-f example.dat
-v
which will output to the console. Omitting the -v
flag will redirect output to a log-file in /tmp/
.
Please report any bugs using GitHub's issue tracker.
This algorithm is released under the GNU GPLv3 license.