Merging master for gitbook documentation changes

cuebook · Oct 13, 2021 · 6296061 · 6296061
2 parents dcc2fd3 + f54252e
commit 6296061
Show file tree

Hide file tree

Showing 36 changed files with 704 additions and 0 deletions.
diff --git a/.gitbook/assets/addconnection (1).png b/.gitbook/assets/addconnection (1).png
diff --git a/.gitbook/assets/addconnection.png b/.gitbook/assets/addconnection.png
diff --git a/.gitbook/assets/anomalies.png b/.gitbook/assets/anomalies.png
diff --git a/.gitbook/assets/anomalycard_daily.png b/.gitbook/assets/anomalycard_daily.png
diff --git a/.gitbook/assets/anomalycard_daily_cropped.png b/.gitbook/assets/anomalycard_daily_cropped.png
diff --git a/.gitbook/assets/anomalycard_hourly.png b/.gitbook/assets/anomalycard_hourly.png
diff --git a/.gitbook/assets/anomalycard_hourly_cropped.png b/.gitbook/assets/anomalycard_hourly_cropped.png
diff --git a/.gitbook/assets/anomalydefinition_cuel.gif b/.gitbook/assets/anomalydefinition_cuel.gif
diff --git a/.gitbook/assets/anomalydefinitions.png b/.gitbook/assets/anomalydefinitions.png
diff --git a/.gitbook/assets/anomalydeviation.png b/.gitbook/assets/anomalydeviation.png
diff --git a/.gitbook/assets/dataset_mapping_cropped.png b/.gitbook/assets/dataset_mapping_cropped.png
diff --git a/.gitbook/assets/dataset_sql.png b/.gitbook/assets/dataset_sql.png
diff --git a/.gitbook/assets/dataset_sql_cropped.png b/.gitbook/assets/dataset_sql_cropped.png
diff --git a/.gitbook/assets/minavgvalue.png b/.gitbook/assets/minavgvalue.png
diff --git a/.gitbook/assets/mincontribution.png b/.gitbook/assets/mincontribution.png
diff --git a/.gitbook/assets/overview.gif b/.gitbook/assets/overview.gif
diff --git a/.gitbook/assets/overview_anomaly.png b/.gitbook/assets/overview_anomaly.png
diff --git a/.gitbook/assets/overview_rca.png b/.gitbook/assets/overview_rca.png
diff --git a/.gitbook/assets/rca_analyze.png b/.gitbook/assets/rca_analyze.png
diff --git a/.gitbook/assets/rca_logs.png b/.gitbook/assets/rca_logs.png
diff --git a/.gitbook/assets/rca_result.png b/.gitbook/assets/rca_result.png
diff --git a/.gitbook/assets/screenshot-from-2021-08-26-17-52-09.png b/.gitbook/assets/screenshot-from-2021-08-26-17-52-09.png
diff --git a/.gitbook/assets/topn.png b/.gitbook/assets/topn.png
diff --git a/SUMMARY.md b/SUMMARY.md
@@ -0,0 +1,15 @@
+# Table of contents
+
+* [Overview](README.md)
+* [Why CueObserve](why-cueobserve.md)
+* [Getting Started](getting-started.md)
+* [Installation](installation.md)
+* [Anomalies](anomalies.md)
+* [Root Cause Analysis](root-cause-analysis.md)
+* [Datasets](datasets.md)
+* [Anomaly Definitions](anomaly-definitions.md)
+* [Anomaly Detection](anomaly-detection.md)
+* [Data Sources](sources.md)
+* [Development](development.md)
+* [Settings](settings.md)
+
diff --git a/anomalies.md b/anomalies.md
@@ -0,0 +1,16 @@
+# Anomalies
+
+Anomalies screen lists all published anomalies. Click on a row to view its anomaly card.
+
+![](.gitbook/assets/anomalies.png)
+
+Daily anomalies automatically unpublish if there's no anomaly for the next 5 days. Hourly anomalies unpublish after 1 day.
+
+## Anomaly Cards
+
+Anomaly cards follow a template. If you want, you can modify the templates.
+
+![Hourly Anomaly card](.gitbook/assets/anomalycard_hourly_cropped.png)
+
+![Daily Anomaly card](.gitbook/assets/anomalycard_daily_cropped.png)
+
diff --git a/anomaly-definitions.md b/anomaly-definitions.md
@@ -0,0 +1,90 @@
+# Anomaly Definitions
+
+You can define one or more anomaly detection jobs on a dataset. The anomaly detection job can monitor a measure at an aggregate level or split the measure by a dimension.
+
+To define an anomaly job, you 
+
+1. Select a dataset
+2. Select a measure from the dataset
+3. Select a dimension to split the measure _\(optional\)_
+4. Select an anomaly rule
+
+![](.gitbook/assets/anomalydefinitions.png)
+
+## Split Measure by Dimension
+
+`Measure` \[`Dimension` `Limit` \] \[`High/Low`\]
+
+To split a measure by a dimension, select the dimension and then limit the number of unique dimension values you want to split into.
+
+Choose the optional **High/Low** to detect only one type of anomalies. Choose **High** for an increase in measure or **Low** for a drop in measure.
+
+![](.gitbook/assets/anomalydefinition_cuel.gif)
+
+### Limit Dimension Values
+
+When you split a measure by a dimension, you must limit the number of unique dimension values. There are 3 ways to limit - **Top N**, **Min % Contribution**, and **Min Avg Value**.
+
+#### Top N
+
+Top N limits the number of dimension values based on the dimension value's contribution to the measure.
+
+Say you want to monitor Orders measure. But you want to monitor it for your top 10 states only. You would then define anomaly something like below:
+
+![](.gitbook/assets/topn.png)
+
+#### Min % Contribution
+
+Minimum % Contribution limits the number of dimension values based on the dimension value's contribution to the measure.
+
+Say you want to monitor Orders measure for every state that contributed at least 2% to the total Orders, your anomaly definition would look something like below:
+
+![](.gitbook/assets/mincontribution.png)
+
+#### Min Avg Value
+
+Minimum Average Value limits the number of dimension values based on the measure's average value.
+
+![](.gitbook/assets/minavgvalue.png)
+
+In the example above, only states where _average\(Orders\) &gt;= 10_ will be selected. If your granularity is daily, this means daily average orders. If your granularity is hourly, this means hourly average orders.
+
+## Anomaly Detection Algorithms
+
+CueObserve offers the following algorithms for anomaly detection.
+
+1. Prophet
+2. Percentage Change
+3. Lifetime High/Low
+4. Value Threshold
+
+### Prophet
+
+This algorithm uses the open-source [Prophet](https://github.com/facebook/prophet) procedure to generate a forecast for the timeseries. It then compares the actual value with the forecasted value. If the actual value is outside the forecast's confidence range \(_grey band in the image below_\), it marks the actual value as an anomalous data point.
+
+The metric's percentage deviation \(_45% in the image below_\) is calculated with respect to the threshold of the forecast's confidence range.
+
+![](.gitbook/assets/anomalydeviation.png)
+
+### Percentage Change
+
+This algorithm compares the metric's actual value with its previous value in the timeseries. For example, for a timeseries with daily granularity, it will compare the value on 15th August with the value on 14th August.
+
+If the percentage change is higher than the specified threshold, it marks the recent value as an anomalous data point.
+
+### Lifetime High/Low
+
+This algorithm finds the metric's highest and lowest values in the timeseries. It marks these data points as anomalous data points.
+
+### Value Threshold
+
+This algorithm identifies anomalous data points in the timeseries as per the mathematical rule specified in the anomaly definition screen. Below are a few sample rules:
+
+_Anomaly when Value greater than `X`_
+
+_Anomaly when Value not between `X` and `Y`_
+
+\_\_
+
+
+
diff --git a/anomaly-detection.md b/anomaly-detection.md
@@ -0,0 +1,32 @@
+# Anomaly Detection
+
+When you run an anomaly definition, CueObserve does the following:
+
+### Execute Dataset SQL
+
+As the first step in the anomaly detection process, CueObserve executes the dataset's SQL query and fetches the result as a dataframe. This dataframe acts as the source data for identifying dimension values and the anomaly detection process.
+
+### Generate sub dataframes
+
+Next CueObserve creates new dataframes on which the actual anomaly detection process will run. During this process, CueObserve finds dimension values and creates sub-dataframes by filtering on the dimension. This filtering process is done only if a dimension is specified in the anomaly definition.
+
+### Transform sub dataframe
+
+Once CueObserve has the final dataframes, it ignores all columns except the metric and the timestamp and then aggregates the metric on the timestamp column. That is, the metric is summed after grouping over the timestamp column. CueObserve now has dataframes agnostic of all metadata which is unnecessary for the actual anomaly detection process.
+
+### Generate Timeseries Forecast
+
+CueObserve now feeds the timeseries dataframe into [Prophet](https://github.com/facebook/prophet). Each dataframe is separately trained on Prophet and a forecast is generated. The number of forecast points is 24 if the granularity is hourly, else it is 15 for daily granularity.
+
+Each dataframe must have at least 20 data points after aggregation as anything less than that would be too little training data. For hourly granularity, CueObserve does not consider data older than a week for the training process.
+
+### Detect Anomaly
+
+Next CueObserve combines the actual data with the forecasted data from Prophet along with the uncertainty interval bands. These bands estimate the trend of the data and will be used as the threshold for determining a data point as an anomaly. For each data point in the original dataframe, CueObserve checks if it lies within the predicted bands or not and classifies it as an anomaly accordingly.
+
+### Create Card
+
+CueObserve saves the actual data with the bands and the forecast in its database. If the latest anomalous data point is not older than a certain time threshold, CueObserve publishes it as an anomaly and saves the dimension value and its contribution. The aforementioned time threshold depends on the granularity. It is 5 days if the granularity is daily and 1 day if the granularity is hourly.
+
+Finally, CueObserve stores all the individual results of the process along with the metadata in a format for easy visual representation in the UI.
+
diff --git a/datasets.md b/datasets.md
@@ -0,0 +1,33 @@
+# Datasets
+
+Datasets are similar to aggregated SQL VIEWS of your data. When you run an anomaly detection job, the associated dataset's SQL query is run and the results are stored as a Pandas dataframe in memory.
+
+![](.gitbook/assets/dataset_sql.png)
+
+You write a SQL GROUP BY query with aggregate functions to roll-up your data. You then map the columns as dimensions or measures.
+
+![](.gitbook/assets/dataset_mapping_cropped.png)
+
+1. Dataset must have only one timestamp column. This timestamp column is used to generate timeseries data for anomaly detection.
+2. Dataset must have at least one aggregate column. CueObserve currently supports only COUNT or SUM as aggregate functions. Aggregate columns must be mapped as measures.
+3. Dataset can have one or more dimension columns \(optional\).
+
+## SQL GROUP BY Query
+
+Your SQL must group by timestamp and all dimension columns. You must truncate the timestamp column to HOUR or DAY before grouping. For example, if you want to track hourly anomalies on the dataset, truncate the timestamp to HOUR.
+
+Below is a sample GROUP BY query for BigQuery. See [Data Sources](sources.md) for sample queries on other databases and data warehouses.
+
+```sql
+SELECT
+TIMESTAMP_TRUNC(CreatedTS, DAY) as OrderDate, -- HOUR or DAY granularity
+City, State, -- dimensions
+COUNT(1) as Orders, SUM(IFNULL(Order_Amount,0)) as OrderAmount -- measures
+FROM ORDERS
+WHERE CreatedTS >= TIMESTAMP_SUB(TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY), INTERVAL 400 DAY)  -- limit historical data to use for forecasting
+GROUP BY 1, 2, 3
+ORDER BY 1
+```
+
+Since the last time bucket might be partial, CueObserve ignores the last time bucket when generating timeseries.
+
diff --git a/development.md b/development.md
@@ -0,0 +1,180 @@
+---
+description: >-
+  If you plan to work on CueObserve code and make changes, this documentation
+  will give you a high level overview of the components used and how to modify
+  them.
+---
+
+# Development
+
+### Overview
+
+CueObserve has 5 basic components:
+
+1. Frontend single-page application written on [ReactJS](https://reactjs.org/).
+2. Backend based on [Django](https://www.djangoproject.com/) \(python framework\), which is responsible for the communication with the frontend application via REST APIs.
+3. [Celery](https://docs.celeryproject.org/) to execute the tasks asynchronously. Tasks like anomaly detection are handled by Celery.
+4. [Celery beat](https://docs.celeryproject.org/en/stable/userguide/periodic-tasks.html) scheduler to trigger the scheduled tasks.
+5. [Redis](https://redis.io/documentation) to handle the task queue of Celery.
+
+### Getting code
+
+Get the code by cloning our open source [github repo](https://github.com/cuebook/cueobserve)
+
+```text
+git clone https://github.com/cuebook/CueObserve.git
+cd CueObserve
+```
+
+### Frontend Development 
+
+The code for frontend is in `/ui` directory. CueObserve uses `npm` as the package manager. 
+
+**Prerequisites:**
+
+1. Node &gt;= 12
+2. npm &gt;= 6
+
+```bash
+cd ui
+npm install    # install dependencies
+npm start      # start development server
+```
+
+This starts the frontend server on [http://localhost:3000/](https://reactjs.org/)
+
+### Backend Development
+
+The code for the backend is in `/api` directory. As mentioned in the overview it is based on Django framework. 
+
+**Prerequisite:** 
+
+1. Python 3.7
+2. PostgreSQL Server running locally or on server \(Optional\)
+
+#### Setup Virtual Environment & Install Dependencies
+
+Setting up a virtual environment is necessary to have your python libraries for this project stored separately so that there is no conflict with other projects. 
+
+```bash
+cd api
+python3 -m virtualenv myenv         # Create Python3 virtual environment
+source myenv/bin/activate           # Activate virtual environment
+
+pip install -r requirements.txt     # Install project dependencies
+```
+
+#### Configure environment variables
+
+The environment variables required to run the backend server can be found in `api/.env.dev`. The file looks like below:
+
+```bash
+export ENVIRONMENT=dev
+
+## DB SETTINGS 
+export POSTGRES_DB_HOST="localhost"
+export POSTGRES_DB_USERNAME="postgres"
+export POSTGRES_DB_PASSWORD="postgres"
+export POSTGRES_DB_SCHEMA="cue_observe"
+export POSTGRES_DB_PORT=5432
+
+## SUPERUSER'S VARIABLE
+export DJANGO_SUPERUSER_USERNAME="User"
+export DJANGO_SUPERUSER_PASSWORD="admin"
+export DJANGO_SUPERUSER_EMAIL="admin@domain.com"
+
+## AUTHENTICATION
+export IS_AUTHENTICATION_REQUIRED=False 
+```
+
+Change the values based on your running PostgreSQL instance. If you do not wish to use PostgreSQL as your database for development, comment lines 4-8 and CueObserve will create a SQLite database file at the location `api/db/db.sqlite3`. 
+
+After changing the values, source the file to initialize all the environment variables. 
+
+```text
+source .env.dev
+```
+
+Then run the following commands to migrate the schema to your database and load static data required by CueObserve:
+
+```bash
+python manage.py migrate                     # Migrate db schema
+python manage.py loaddata seeddata/*.json    # Load seed data in database
+```
+
+After the above steps are completed successfully, we can start our backend server by running:
+
+```text
+python manage.py runserver
+```
+
+This starts the backend server on [http://localhost:8000/](https://reactjs.org/). 
+
+#### Celery Development 
+
+CueObserve uses Celery for executing asynchronous  tasks like anomaly detection. There are three components needed to run an asynchronous task, i.e. Redis, Celery and Celery Beat. Redis is used as the message queue by Celery, so before starting Celery services, Redis server should be running. Celery Beat is used as the scheduler and is responsible to trigger the scheduled tasks. Celery workers are used to execute the tasks. 
+
+**Starting Redis Server**
+
+Redis server can be easily started by its official docker image.
+
+```bash
+docker run -dp 6379:6379 redis    # Run redis docker on port 6379
+```
+
+#### Start Celery Beat
+
+To start celery beat service, activate the virtual environment created for the backend server and then source the .env.dev file to export all required environment variables.
+
+```bash
+cd api
+source myenv/bin/activate           # Activate virtual environment
+source .env.dev                     # Export environment variables.
+celery -A app beat -l info --scheduler django_celery_beat.schedulers:DatabaseScheduler --detach         # Run celery beat service
+```
+
+#### Start Celery 
+
+To start the celery service, its same as backend or celery beat, first activate the virual env created and then source .env.dev file to export all required environment variables. Celery service doesn't reloads on code changes so we have to install some additional libraries to make it happen. 
+
+```text
+cd api
+source myenv/bin/activate           # Activate virtual environment
+source .env.dev                     # Export environment variables
+
+pip install watchdog pyyaml argh    # Additional libraries to reload celery on code changes
+watchmedo auto-restart -- celery -A app worker -l info --purge      # Run celery
+```
+
+After these three services are running, you can trigger a task or wait for a scheduled task to run. 
+
+### Building Docker Image
+
+To build the docker image, run the following command in root directory:
+
+```text
+docker build -t <YOUR_TAG_NAME> .
+```
+
+To run the built image exposed on port 3000:
+
+```text
+docker run -dp 3000:3000 <YOUR_TAG_NAME>
+```
+
+### Testing
+
+At the moment, we have test cases only for the backend service, test cases for UI are in our roadmap. 
+
+Backend test environment is light and doesn't depend on services like Redis, Celery or Celery-Beat, they are mocked instead. Backend for API and services is tested using [PyTest](https://docs.pytest.org/en/6.2.x/).
+
+ To run the test cases virtual environment should be activated and then source .env.dev file to export all required environment variables. 
+
+```text
+cd api
+source myenv/bin/activate           # Activate virtual environment
+source .env.dev                     # Export environment variables
+
+pytest                              # Run tests
+```
+