Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding median absolute deviation documentation page #6808

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions _aggregations/metric/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ OpenSearch supports the following metric aggregations:
- [Geobounds]({{site.url}}{{site.baseurl}}/aggregations/metric/geobounds/)
- [Matrix stats]({{site.url}}{{site.baseurl}}/aggregations/metric/matrix-stats/)
- [Maximum]({{site.url}}{{site.baseurl}}/aggregations/metric/maximum/)
- [Median absolute deviation]({{site.url}}{{site.baseurl}}/aggregations/metric/median-absolute-deviation/)
- [Minimum]({{site.url}}{{site.baseurl}}/aggregations/metric/minimum/)
- [Percentile ranks]({{site.url}}{{site.baseurl}}/aggregations/metric/percentile-ranks/)
- [Percentile]({{site.url}}{{site.baseurl}}/aggregations/metric/percentile/)
Expand Down
156 changes: 156 additions & 0 deletions _aggregations/metric/median-absolute-deviation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
---
layout: default
title: Median absolute deviation
parent: Metric aggregations
grand_parent: Aggregations
nav_order: 65
redirect_from:
- /query-dsl/aggregations/metric/median-absolute-deviation/
---

# Median absolute deviation aggregations

The `median_absolute_deviation` metric is a single-value metric aggregation that returns median absolute deviation field. Median absolute deviation is a statistical measure of data variability. It is used to measure dispersion from the median, may be less impacted by outliers in a dataset.
vagimeli marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The `median_absolute_deviation` metric is a single-value metric aggregation that returns median absolute deviation field. Median absolute deviation is a statistical measure of data variability. It is used to measure dispersion from the median, may be less impacted by outliers in a dataset.
The `median_absolute_deviation` metric is a single-value metric aggregation that returns median absolute deviation field. Median absolute deviation is a statistical measure of data variability. Since the median absolute deviation measures dispersion from the median, it provides a more robust measure of variability that is less affected by outliers in a dataset.


Median absolute deviation is calculated with:<br>
median_absolute_deviation = median(|X<sub>i</sub> - Median(X<sub>i</sub>)|)

The following example calculates the median absolute deviation of the `DistanceMiles` field of the opensearch_dashboards_sample_data_flights:
Copy link
Contributor

@vagimeli vagimeli May 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The following example calculates the median absolute deviation of the `DistanceMiles` field of the opensearch_dashboards_sample_data_flights:
The following example calculates the median absolute deviation of the `DistanceMiles` field in the sample dataset `opensearch_dashboards_sample_data_flights`:



```json
GET opensearch_dashboards_sample_data_flights/_search
{
"size": 0,
"aggs": {
"median_absolute_deviation_DistanceMiles": {
"median_absolute_deviation": {
"field": "DistanceMiles"
}
}
}
}
```
{% include copy-curl.html %}

#### Example response

```json
{
"took": 35,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": null,
"hits": []
},
"aggregations": {
"median_absolute_deviation_distanceMiles": {
"value": 1829.8993624441966
}
}
}
```

### Missing
You can set a default value for missing fields from documents by specifying the `missing` parameter. This could be a missing field or a null value in a field.
Copy link

@sandeshkr419 sandeshkr419 May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just rephrase this to:

By default, a missing field in or a null value in a field from a document are ignored in computations. You can specify a value for them by specifying the missing parameter.

I don't like the idea of saying that we are providing a default value. It is more like treat the missing values as this value which I'm specifying.

Copy link
Contributor

@vagimeli vagimeli May 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You can set a default value for missing fields from documents by specifying the `missing` parameter. This could be a missing field or a null value in a field.
You can set a default value for fields that are missing from documents by specifying the `missing` parameter. The value can be a missing field or `null`, as shown in the following example:


```json
GET opensearch_dashboards_sample_data_flights/_search
{
"size": 0,
"aggs": {
"median_absolute_deviation_distanceMiles": {
"median_absolute_deviation": {
"field": "DistanceMiles",
"missing": 1000
}
}
}
}
```
{% include copy-curl.html %}

#### Example response

```json
{
"took": 7,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": null,
"hits": []
},
"aggregations": {
"median_absolute_deviation_distanceMiles": {
"value": 1829.6443646143355
}
}
}
```

### Compression
The calculation of the median absolute deviation utilizes [t-digest](https://github.com/tdunning/t-digest/tree/main) which controls the balance between performance and accuracy of estimation. The default value for TDigest's `compression` field is 1000. Decreasing the `compression` value will increase the performance while reducing the accuracy of the estimation.

Check failure on line 112 in _aggregations/metric/median-absolute-deviation.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: TDigest's. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: TDigest's. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_aggregations/metric/median-absolute-deviation.md", "range": {"start": {"line": 112, "column": 215}}}, "severity": "ERROR"}
Copy link

@sandeshkr419 sandeshkr419 May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to add in a little (very-little) detail on compression here.

How about:

The calculation of the median absolute deviation utilizes t-digest data-structure which controls the balance between performance and accuracy of estimation. T-Digest controls this balance through compression parameter, with a default value of 1000. Adjusting the compression value affects the trade-off between computational efficiency and the precision of the estimation. Lowering the compression value improves performance but may lead to slightly less accurate results, while increasing it enhances accuracy at the cost of increased computational overhead.

Copy link
Contributor

@vagimeli vagimeli May 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The calculation of the median absolute deviation utilizes [t-digest](https://github.com/tdunning/t-digest/tree/main) which controls the balance between performance and accuracy of estimation. The default value for TDigest's `compression` field is 1000. Decreasing the `compression` value will increase the performance while reducing the accuracy of the estimation.
The median absolute deviation is calculated using the [t-digest](https://github.com/tdunning/t-digest/tree/main) data structure, which balances between performance and estimation accuracy through the `compression` parameter (default value: `1000`). Adjusting the `compression` value affects the trade-off between computational efficiency and precision. Lower `compression` values improve performance but may reduce estimation accuracy, while higher values enhance accuracy at the cost of increased computational overhead.


```json
GET opensearch_dashboards_sample_data_flights/_search
{
"size": 0,
"aggs": {
"median_absolute_deviation_DistanceMiles": {
"median_absolute_deviation": {
"field": "DistanceMiles",
"compression": 10
}
}
}
}
```
{% include copy-curl.html %}

#### Example response

```json
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": null,
"hits": []
},
"aggregations": {
"median_absolute_deviation_DistanceMiles": {
"value": 1836.265614211182
}
}
}
```
Loading