Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Kmeans and AD command documentation #493

Merged
merged 2 commits into from
Mar 15, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/category.json
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
"user/admin/settings.rst"
],
"ppl_cli": [
"user/ppl/cmd/ad.rst",
"user/ppl/cmd/dedup.rst",
"user/ppl/cmd/eval.rst",
"user/ppl/cmd/fields.rst",
Expand Down
61 changes: 61 additions & 0 deletions docs/user/ppl/cmd/ad.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
=============
ad
=============

.. rubric:: Table of contents

.. contents::
:local:
:depth: 2


Description
============
| The ``ad`` command applies Random Cut Forest (RCF) algorithm in ml-commons plugin on the search result returned by a PPL command. Based on the input, two types of RCF algorithms will be utilized: fixed in time RCF for processing time-series data, batch RCF for processing non-time-series data.


Fixed In Time RCF For Time-series Data Command Syntax
=====================================================
ad <shingle_size> <time_decay> <time_field>

* shingle_size: optional. A shingle is a consecutive sequence of the most recent records. The default value is 8.
* time_decay: optional. It specifies how much of the recent past to consider when computing an anomaly score. The default value is 0.001.
* time_field: mandatory. It specifies the time filed for RCF to use as time-series data.


Batch RCF for Non-time-series Data Command Syntax
=================================================
ad <shingle_size> <time_decay>

* shingle_size: optional. A shingle is a consecutive sequence of the most recent records. The default value is 8.
* time_decay: optional. It specifies how much of the recent past to consider when computing an anomaly score. The default value is 0.001.


Example1: Detecting events in New York City from taxi ridership data with time-series data
==========================================================================================

The example trains a RCF model and use the model to detect anomalies in the time-series ridership data.

PPL query::

os> source=nyc_taxi | fields value, timestamp | AD time_field='timestamp' | where value=10844.0'
+----------+---------------+-------+---------------+
| value | timestamp | score | anomaly_grade |
|----------+---------------+-------+---------------|
| 10844.0 | 1404172800000 | 0.0 | 0.0 |
+----------+---------------+-------+---------------+


Example2: Detecting events in New York City from taxi ridership data with non-time-series data
==============================================================================================

The example trains a RCF model and use the model to detect anomalies in the non-time-series ridership data.

PPL query::

os> source=nyc_taxi | fields value | AD | where value=10844.0'
+----------+--------+-----------+
| value | score | anomalous |
|----------+--------+-----------|
| 10844.0 | 0.0 | false |
+----------+--------+-----------+
38 changes: 38 additions & 0 deletions docs/user/ppl/cmd/kmeans.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
=============
kmeans
=============

.. rubric:: Table of contents

.. contents::
:local:
:depth: 2


Description
============
| The ``kmeans`` command applies kmeans algorithm in ml-commons plugin on the search result returned by a PPL command.


Syntax
======
kmeans <cluster-number>

* cluster-number: mandatory. The number of clusters you want to group your data points into.


Example: Clustering of Iris Dataset
===================================

The example shows how to classify three Iris species (Iris setosa, Iris virginica and Iris versicolor) based on the combination of four features measured from each sample: the length and the width of the sepals and petals.

PPL query::

os> source=iris_data | fields sepal_length_in_cm, sepal_width_in_cm, petal_length_in_cm, petal_width_in_cm | kmeans 3
+--------------------+-------------------+--------------------+-------------------+-----------+
| sepal_length_in_cm | sepal_width_in_cm | petal_length_in_cm | petal_width_in_cm | ClusterID |
|--------------------+-------------------+--------------------+-------------------+-----------|
| 5.1 | 3.5 | 1.4 | 0.2 | 1 |
| 5.6 | 3.0 | 4.1 | 1.3 | 0 |
| 6.7 | 2.5 | 5.8 | 1.8 | 2 |
+--------------------+-------------------+--------------------+-------------------+-----------+
4 changes: 4 additions & 0 deletions docs/user/ppl/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,12 +36,16 @@ The query start with search command and then flowing a set of command delimited

- `Syntax <cmd/syntax.rst>`_

- `ad command <cmd/ad.rst>`_

- `dedup command <cmd/dedup.rst>`_

- `eval command <cmd/eval.rst>`_

- `fields command <cmd/fields.rst>`_

- `kmeans command <cmd/kmeans.rst>`_

- `parse command <cmd/parse.rst>`_

- `rename command <cmd/rename.rst>`_
Expand Down
14 changes: 13 additions & 1 deletion doctest/build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
* SPDX-License-Identifier: Apache-2.0
*/

import java.util.concurrent.Callable
import org.opensearch.gradle.testclusters.RunTask

plugins {
Expand Down Expand Up @@ -49,7 +50,18 @@ clean.dependsOn(cleanBootstrap)

testClusters {
docTestCluster {
plugin ':plugin'
plugin(provider(new Callable<RegularFile>(){
@Override
RegularFile call() throws Exception {
return new RegularFile() {
@Override
File getAsFile() {
return fileTree("resources/ml-commons").getSingleFile()
}
}
}
}))

testDistribution = 'integ_test'
}
}
Expand Down
Binary file not shown.
Loading