Website | Documentation | Blog | Demo | GitHub
Different groups of people often have different behaviors or trends. For example, the bones of older people are more porous than those of younger people. It is of great value to explore the behaviors and trends of different groups of people, especially in healthcare, because we could adopt appropriate measures in time to avoid tragedy. The easiest way to do this is cohort analysis.
But with a variety of big data accumulated over the years, query efficiency becomes one of the problems that OnLine Analytical Processing (OLAP) systems meet, especially for cohort analysis. Therefore, COOL is introduced to solve the problems.
COOL is an online cohort analytical processing system that supports various types of data analytics, including cube query, iceberg query and cohort query.
With the support of several newly proposed operators on top of a sophisticated storage layer, COOL could provide high performance (near real-time) analytical response for emerging data warehouse domains.
- Easy to use. COOL is easy to deploy on local or on cloud via docker.
- Near Real-time Responses. COOL is highly efficient, and therefore, can process cohort queries in near real-time analytical responses.
- Specialized Storage Layout. A specialized storage layout is designed for fast query processing and reduced space consumption.
- Self-designed Semantics. There are some novel self-designed semantics for the cohort query, which can simplify its complexity and improve its functionality.
- Flexible Integration. Flexible integration with other data systems via common data formats(e.g., CSV, Parquet, Avro, and Arrow).
- Artificial Intelligence Model. A new neural network model will be introduced soon.
Simply run mvn clean package
- dataset file: a csv file with "," delimiter (normally dumped from a database table), and the table header is removed.
- dimension file: a csv file with "," delimiter. Each line of this file has two fields: the first field is the name of a column in the dataset, and the second field is a value of this column. Each distinct value of each column in the dataset shall appear in this dimension file once.
- dataset schema file: a
table.yaml
file specifying the dataset's columns and their measure fields. - query file: a yaml file specify the parameters for running query server.
Before query processing, we need to load the dataset into COOL native format. The sample code to load csv dataset with data loader can be found in CsvLoader.java.
$ java -cp ./cool-core/target/cool-core-0.1-SNAPSHOT.jar com.nus.cool.functionality.CsvLoader path/to/your/source/directory path/to/your/.yaml path/to/your/dimensionfile path/to/your/datafile path/to/output/datasource/directory
The five arguments in the command have the following meaning:
- a unique dataset name given under the directory
- the table.yaml (the third required source)
- the dimension file (the second required source)
- the dataset file (the first required source)
- the output directory for the compacted dataset
We provide an example for cohort query processing in CohortAnalysis.java.
There are two types of queries in COOL. The first one includes two steps.
- Select the specific users.
$ java -cp ./cool-core/target/cool-core-0.1-SNAPSHOT.jar com.nus.cool.functionality.CohortSelection path/to/output/datasource/directory path/to/your/queryfile
- Executes cohort query users.
$ java -cp ./cool-core/target/cool-core-0.1-SNAPSHOT.jar com.nus.cool.functionality.CohortAnalysis path/to/output/datasource/directory path/to/your/cohortqueryfile
- Executes the funnel query.
$ java -cp ./cool-core/target/cool-core-0.1-SNAPSHOT.jar com.nus.cool.functionality.FunnelAnalysis path/to/output/datasource/directory path/to/your/funnelqueryfile
- Executes the following query in cool.
$ java -cp ./cool-core/target/cool-core-0.1-SNAPSHOT.jar com.nus.cool.functionality.IcebergLoader path/to/output/datasource/directory path/to/your/queryfile
We have provided examples in sogamo
directory and health
directory. Now we take sogamo
for example.
The COOL system supports CSV data format by default, and you can load sogamo
dataset with the following command.
$ java -cp ./cool-core/target/cool-core-0.1-SNAPSHOT.jar com.nus.cool.functionality.CsvLoader sogamo sogamo/table.yaml sogamo/dim.csv sogamo/test.csv datasetSource
In addition, you can run the following command to load dataset in other formats under the sogamo
directory.
- parquet format data
$ java -jar cool-extensions/parquet-extensions/target/parquet-extensions-0.1-SNAPSHOT.jar sogamo sogamo/table.yaml sogamo/dim.csv sogamo/test.parquet datasetSource
- Arrow format data
$ java -jar cool-extensions/arrow-extensions/target/arrow-extensions-0.1-SNAPSHOT.jar sogamo sogamo/table.yaml sogamo/dim.csv sogamo/test.arrow datasetSource
- Avro format data
$ java -jar cool-extensions/avro-extensions/target/avro-extensions-0.1-SNAPSHOT.jar sogamo sogamo/table.yaml sogamo/dim.csv sogamo/avro/test.avro datasetSource sogamo/avro/schema.avsc
Finally, there will be a cube generated under the datasetSource
directory, which is named sogamo
.
We use the health
dataset for example to demonstrate the cohort ananlysis.
- Select the specific users.
$ java -cp ./cool-core/target/cool-core-0.1-SNAPSHOT.jar com.nus.cool.functionality.CohortSelection datasetSource health/query1-0.json
where the three arguments are as follows:
datasetSource
: the output directory for the compacted datasethealth
: the cube name of the compacted datasethealth/query1-0.json
: the json file for the cohort query
- Display the selected all records of the cohort in terminal for exploration
$ java -cp ./cool-core/target/cool-core-0.1-SNAPSHOT.jar com.nus.cool.functionality.CohortExploration datasetSource health loyal
- Execute cohort query on the selected users.
$ java -cp ./cool-core/target/cool-core-0.1-SNAPSHOT.jar com.nus.cool.functionality.CohortAnalysis datasetSource health/query1-1.json
- Execute cohort query on all the users.
$ java -cp ./cool-core/target/cool-core-0.1-SNAPSHOT.jar com.nus.cool.functionality.CohortAnalysis datasetSource health/query2.json
Partial results for the query health/query2.json
on the health
dataset are as at result2.json
We use the sogamo
dataset for example to demonstrate the funnel analysis.
$ java -cp ./cool-core/target/cool-core-0.1-SNAPSHOT.jar com.nus.cool.functionality.FunnelAnalysis datasetSource sogamo/query1.json
We have provided examples in olap-tpch
directory.
The COOL system supports CSV data format by default, and you can load tpc-h
dataset with the following command.
java -cp ./cool-core/target/cool-core-0.1-SNAPSHOT.jar com.nus.cool.functionality.CsvLoader tpc-h-10g olap-tpch/table.yaml olap-tpch/scripts/dim.csv olap-tpch/scripts/data.csv datasetSource
Finally, there will be a cube generated under the datasetSource
directory, which is named tpc-h-10g
.
Now we could execute OLAP query on the generated datasets. eg, if we want to run following query,
SELECT cout(*), sum(O_TOTALPRICE)
FROM TPC_H WHERE O_ORDERPRIORITY = 2-HIGH AND R_NAME = EUROPE
GROUP BY N_NAME,R_NAME
HAVING O_ORDERDATE >= '1993-01-01' AND O_ORDERDATE <= '1994-01-01'
- Firstly define a query.json as shown in query.json
- Then execute following cmd
java -cp ./cool-core/target/cool-core-0.1-SNAPSHOT.jar com.nus.cool.functionality.IcebergLoader datasetSource olap-tpch/query.json
Results for the query query.json on the tpc-h-10g
dataset are as at result.json
We can start the COOL's query server with the following command
$ java -jar cool-queryserver/target/cool-queryserver-0.1-SNAPSHOT.jar datasetSource 8080
where the argument is as follows:
datasetSource
: the path to the repository of compacted datasets.8080
: the port of the server.
In this server, we implement many APIs and list their corresponding urls as follows:
- [server:port]:v1
- List all workable urls
- [server:port]:v1/reload?cube=[cube_name]
- Reload the cube
- [server:port]:v1/list
- List existing cubes
- [server:port]:v1/cohort/list?cube=[cube_name]
- List all cohorts from the selected cube
- [server:port]:v1/cohort/selection
- Cohort Selection
- [server:port]:v1/cohort/exploration
- Cohort Exploration
- [server:port]:v1/cohort/analysis
- Perform cohort analysis
- [server:port]:v1/funnel/analysis
- Perform funnel analysis
- [server:port]:v1/olap/iceberg
- Perform iceberg query
COOL has an StorageService interface, which will allow COOL standalone server/workers (coming soon) to handle data movement between local and an external storage service. A sample implementation for HDFS connection can be found under the hdfs-extensions.
- Z. Xie, H. Ying, C. Yue, M. Zhang, G. Chen, B. C. Ooi. Cool: a COhort OnLine analytical processing system, in 2020 IEEE 36th International Conference on Data Engineering, pp.577-588, 2020.
- Q. Cai, Z. Xie, M. Zhang, G. Chen, H.V. Jagadish and B.C. Ooi. Effective Temporal Dependence Discovery in Time Series Data, in Proceedings of the VLDB Endowment, 11(8), pp.893-905, 2018.
- Z. Xie, Q. Cai, F. He, G.Y. Ooi, W. Huang, B.C. Ooi. Cohort Analysis with Ease, in Proceedings of the 2018 International Conference on Management of Data, pp.1737-1740, 2018.
- D. Jiang, Q. Cai, G. Chen, H. V. Jagadish, B. C. Ooi, K.-L. Tan, and A. K. H. Tung. Cohort Query Processing, in Proceedings of the VLDB Endowment, 10(1), 2016.