In this tutorial, we demonstrate a typical life-cycle of using OpenMDLB, including creating database, data import, offline feature extraction, online SQL deployment, and online feature extraction.
We first need to download the sample data set and start the OpenMLDB CLI. We strongly recommend to use our docker image for quick start:
-
Pull the image (download size around 500 MB) and start the container
docker run -it 4pdosc/openmldb:0.3.2 bash
💡 After starting the container successfully, the following commands are all executed in the container.
-
Download the sample data
curl https://raw.githubusercontent.com/4paradigm/OpenMLDB/main/demo/standalone/data/data.csv --output ./data/data.csv
-
Start the OpenMLDB service and CLI
# 1. initialize the environment ./init.sh standalone # 2. Start the OpenMLDB CLI for the standalone mode ../openmldb/bin/openmldb --host 127.0.0.1 --port 6527
The below figure illustrates the successful execution of above commands and the CLI:
💡 The following commands are executed in the OpenMLDB CLI by default, unless specified otherwise (the command line prompt >
is used by CLI for differentiation).
> CREATE DATABASE demo_db;
> USE demo_db;
> CREATE TABLE demo_table1(c1 string, c2 int, c3 bigint, c4 float, c5 double, c6 timestamp, c7 date, INDEX(key=c1, ts=c6));
The INDEX
function used in CREATE TABLE
accepts two important parameters key
and ts
.
key
specifies the index column. If it is not specified, OpenMLDB will use the first applicable column as the index column automatically. Necessary indexes will be built when deploying online SQL.ts
specifies the ordered column, which is used byORDER BY
. Only thetimestamp
andbigint
columns can be specified asts
.ts
andkey
also determine whether certain SQL queries can be executed offline, please refer to Introduction to OpenMLDB's Performance-Sensitive Mode for more details.
We import the sample data file (downloaded in 1. Preparation) for feature extraction.
> LOAD DATA INFILE 'data/data.csv' INTO TABLE demo_table1;
The below SQL performs the feature extraction and outputs the result features into a file, which can be used by subsequent model training algorithms.
> SELECT c1, c2, sum(c3) OVER w1 as w1_c3_sum FROM demo_table1 WINDOW w1 AS (PARTITION BY demo_table1.c1 ORDER BY demo_table1.c6 ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) INTO OUTFILE '/tmp/feature.csv';
Now we are ready to deploy our feature extraction SQL for the online serving. Note that the same SQL script should be used for both offline and online feature extraction.
> DEPLOY demo_data_service SELECT c1, c2, sum(c3) OVER w1 AS w1_c3_sum FROM demo_table1 WINDOW w1 AS (PARTITION BY demo_table1.c1 ORDER BY demo_table1.c6 ROWS BETWEEN 2 PRECEDING AND CURRENT ROW);
After that, we can list deployments like using the command SHOW DEPLOYMENTS
, and also can discard a deployment by DROP DEPLOYMENT
.
💡 Note that, in this example, the same data set is used for both offline and online feature extraction. In practice, importing another recent data set is usually necessary.
> quit;
Now we have finished all tasks about development and deployment under the CLI, and also returned back to the OS command line.
We can use RESTful APIs to execute online feature extraction. The format of URL is as follows:
http://127.0.0.1:8080/dbs/demo_db/deployments/demo_data_service
\___________/ \____/ \_____________/
| | |
APIServer endpoint Database name Deployment name
The input data of an online request is in json
format, and we should pack a row into the field input
. For example:
curl http://127.0.0.1:8080/dbs/demo_db/deployments/demo_data_service -X POST -d'{"input": [["aaa", 11, 22, 1.2, 1.3, 1635247427000, "2021-05-20"]]}'
The expected return of the above request is shown below (the return features are packed in the field data
):
{"code":0,"msg":"ok","data":{"data":[["aaa",11,22]],"common_cols_data":[]}}
In order to help better understand the workflow, we provide another more complete demo that can be found here.