Skip to content

Commit 0dba84d

Browse files
Update matrix-stats aggregation. (#9434) (#9964)
1 parent b8b6ea7 commit 0dba84d

File tree

1 file changed

+266
-48
lines changed

1 file changed

+266
-48
lines changed

_aggregations/metric/matrix-stats.md

Lines changed: 266 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,24 @@ redirect_from:
99

1010
# Matrix stats aggregations
1111

12-
The `matrix_stats` aggregation generates advanced stats for multiple fields in a matrix form.
13-
The following example returns advanced stats in a matrix form for the `taxful_total_price` and `products.base_price` fields:
12+
The `matrix_stats` aggregation is a multi-value metric aggregation that generates covariance statistics for two or more fields in matrix form.
13+
14+
The `matrix_stats` aggregation does not support scripting.
15+
{: .note}
16+
17+
## Parameters
18+
19+
The `matrix_stats` aggregation takes the following parameters.
20+
21+
| Parameter | Required/Optional | Data type | Description |
22+
| :-- | :-- | :-- | :-- |
23+
| `fields` | Required | String | An array of fields for which the matrix stats are computed. |
24+
| `missing` | Optional | Object | The value to use in place of missing values. By default, missing values are ignored. See [Missing values](#missing-values). |
25+
| `mode` | Optional | String | The value to use as a sample from a multi-valued or array field. Allowed values are `avg`, `min`, `max`, `sum`, and `median`. Default is `avg`. |
26+
27+
## Example
28+
29+
The following example returns statistics for the `taxful_total_price` and `products.base_price` fields in the OpenSearch Dashboards e-commerce sample data:
1430

1531
```json
1632
GET opensearch_dashboards_sample_data_ecommerce/_search
@@ -27,60 +43,262 @@ GET opensearch_dashboards_sample_data_ecommerce/_search
2743
```
2844
{% include copy-curl.html %}
2945

30-
#### Example response
46+
The response containes the aggregated results:
3147

3248
```json
33-
...
34-
"aggregations" : {
35-
"matrix_stats_taxful_total_price" : {
36-
"doc_count" : 4675,
37-
"fields" : [
38-
{
39-
"name" : "products.base_price",
40-
"count" : 4675,
41-
"mean" : 34.994239430147196,
42-
"variance" : 360.5035285833703,
43-
"skewness" : 5.530161335032702,
44-
"kurtosis" : 131.16306324042148,
45-
"covariance" : {
46-
"products.base_price" : 360.5035285833703,
47-
"taxful_total_price" : 846.6489362233166
49+
{
50+
"took": 250,
51+
"timed_out": false,
52+
"_shards": {
53+
"total": 1,
54+
"successful": 1,
55+
"skipped": 0,
56+
"failed": 0
57+
},
58+
"hits": {
59+
"total": {
60+
"value": 4675,
61+
"relation": "eq"
62+
},
63+
"max_score": null,
64+
"hits": []
65+
},
66+
"aggregations": {
67+
"matrix_stats_taxful_total_price": {
68+
"doc_count": 4675,
69+
"fields": [
70+
{
71+
"name": "products.base_price",
72+
"count": 4675,
73+
"mean": 34.99423943014724,
74+
"variance": 360.5035285833702,
75+
"skewness": 5.530161335032689,
76+
"kurtosis": 131.1630632404217,
77+
"covariance": {
78+
"products.base_price": 360.5035285833702,
79+
"taxful_total_price": 846.6489362233169
80+
},
81+
"correlation": {
82+
"products.base_price": 1,
83+
"taxful_total_price": 0.8444765264325269
84+
}
4885
},
49-
"correlation" : {
50-
"products.base_price" : 1.0,
51-
"taxful_total_price" : 0.8444765264325268
86+
{
87+
"name": "taxful_total_price",
88+
"count": 4675,
89+
"mean": 75.05542864304839,
90+
"variance": 2788.1879749835425,
91+
"skewness": 15.812149139923994,
92+
"kurtosis": 619.1235507385886,
93+
"covariance": {
94+
"products.base_price": 846.6489362233169,
95+
"taxful_total_price": 2788.1879749835425
96+
},
97+
"correlation": {
98+
"products.base_price": 0.8444765264325269,
99+
"taxful_total_price": 1
100+
}
52101
}
53-
},
54-
{
55-
"name" : "taxful_total_price",
56-
"count" : 4675,
57-
"mean" : 75.05542864304839,
58-
"variance" : 2788.1879749835402,
59-
"skewness" : 15.812149139924037,
60-
"kurtosis" : 619.1235507385902,
61-
"covariance" : {
62-
"products.base_price" : 846.6489362233166,
63-
"taxful_total_price" : 2788.1879749835402
102+
]
103+
}
104+
}
105+
}
106+
```
107+
108+
The following table describes the response fields.
109+
110+
| Statistic | Description |
111+
| :--- | :--- |
112+
| `count` | The number of documents sampled for the aggregation. |
113+
| `mean` | The average value of the field computed from the sample. |
114+
| `variance` | The square of deviation from the mean, a measure of data spread. |
115+
| `skewness` | A measure of the distribution's asymmetry relative to the mean. See [Skewness](https://en.wikipedia.org/wiki/Skewness). |
116+
| `kurtosis` | A measure of the tail-heaviness of a distribution. As the tails become lighter, kurtosis decreases. Kurtosis and skewness are evaluated to determine whether a population is likely to be [normally distributed](https://en.wikipedia.org/wiki/Normal_distribution). See [Kurtosis](https://en.wikipedia.org/wiki/Kurtosis).|
117+
| `covariance` | A measure of the joint variability between two fields. A positive value means their values move in the same direction. |
118+
| `correlation` | The normalized covariance, a measure of the strength of the relationship between two fields. Possible values are from -1 to 1, inclusive, indicating perfect negative to perfect positive linear correlation. A value of 0 indicates no discernible relationship between the variables. |
119+
120+
## Missing values
121+
122+
To define how missing values are treated, use the `missing` parameter. By default, missing values are ignored.
123+
124+
For example, create an index in which document 1 is missing the `gpa` and `class_grades` fields:
125+
126+
```json
127+
POST _bulk
128+
{ "create": { "_index": "students", "_id": "1" } }
129+
{ "name": "John Doe" }
130+
{ "create": { "_index": "students", "_id": "2" } }
131+
{ "name": "Jonathan Powers", "gpa": 3.85, "class_grades": [3.0, 3.9, 4.0] }
132+
{ "create": { "_index": "students", "_id": "3" } }
133+
{ "name": "Jane Doe", "gpa": 3.52, "class_grades": [3.2, 2.1, 3.8] }
134+
```
135+
{% include copy-curl.html %}
136+
137+
First, run a `matrix_stats` aggregation without providing a `missing` parameter:
138+
139+
```json
140+
GET students/_search
141+
{
142+
"size": 0,
143+
"aggs": {
144+
"matrix_stats_taxful_total_price": {
145+
"matrix_stats": {
146+
"fields": [
147+
"gpa",
148+
"class_grades"
149+
],
150+
"mode": "avg"
151+
}
152+
}
153+
}
154+
}
155+
```
156+
{% include copy-curl.html %}
157+
158+
OpenSearch ignores missing values when calculating the matrix statistics:
159+
160+
```json
161+
{
162+
"took": 5,
163+
"timed_out": false,
164+
"terminated_early": true,
165+
"_shards": {
166+
"total": 1,
167+
"successful": 1,
168+
"skipped": 0,
169+
"failed": 0
170+
},
171+
"hits": {
172+
"total": {
173+
"value": 3,
174+
"relation": "eq"
175+
},
176+
"max_score": null,
177+
"hits": []
178+
},
179+
"aggregations": {
180+
"matrix_stats_taxful_total_price": {
181+
"doc_count": 2,
182+
"fields": [
183+
{
184+
"name": "gpa",
185+
"count": 2,
186+
"mean": 3.684999942779541,
187+
"variance": 0.05444997482300096,
188+
"skewness": 0,
189+
"kurtosis": 1,
190+
"covariance": {
191+
"gpa": 0.05444997482300096,
192+
"class_grades": 0.09899998760223136
193+
},
194+
"correlation": {
195+
"gpa": 1,
196+
"class_grades": 0.9999999999999991
197+
}
64198
},
65-
"correlation" : {
66-
"products.base_price" : 0.8444765264325268,
67-
"taxful_total_price" : 1.0
199+
{
200+
"name": "class_grades",
201+
"count": 2,
202+
"mean": 3.333333333333333,
203+
"variance": 0.1800000381469746,
204+
"skewness": 0,
205+
"kurtosis": 1,
206+
"covariance": {
207+
"gpa": 0.09899998760223136,
208+
"class_grades": 0.1800000381469746
209+
},
210+
"correlation": {
211+
"gpa": 0.9999999999999991,
212+
"class_grades": 1
213+
}
214+
}
215+
]
216+
}
217+
}
218+
}
219+
```
220+
221+
To set the missing fields to `0`, provide the `missing` parameter as a key-value map. Even though `class_grades` is an array field, the `matrix_stats` aggregation flattens multi-valued numeric fields into a per-document average, so you must supply a single number as the missing value:
222+
223+
```json
224+
GET students/_search
225+
{
226+
"size": 0,
227+
"aggs": {
228+
"matrix_stats_taxful_total_price": {
229+
"matrix_stats": {
230+
"fields": ["gpa", "class_grades"],
231+
"mode": "avg",
232+
"missing": {
233+
"gpa": 0,
234+
"class_grades": 0
68235
}
69236
}
70-
]
237+
}
71238
}
72-
}
73239
}
74240
```
241+
{% include copy-curl.html %}
75242

76-
The following table lists all response fields.
77-
78-
Statistic | Description
79-
:--- | :---
80-
`count` | The number of samples measured.
81-
`mean` | The average value of the field measured from the sample.
82-
`variance` | How far the values of the field measured are spread out from its mean value. The larger the variance, the more it's spread from its mean value.
83-
`skewness` | An asymmetric measure of the distribution of the field's values around the mean.
84-
`kurtosis` | A measure of the tail heaviness of a distribution. As the tail becomes lighter, kurtosis decreases. As the tail becomes heavier, kurtosis increases. To learn about kurtosis, see [Wikipedia](https://en.wikipedia.org/wiki/Kurtosis).
85-
`covariance` | A measure of the joint variability between two fields. A positive value means their values move in the same direction and the other way around.
86-
`correlation` | A measure of the strength of the relationship between two fields. The valid values are between [-1, 1]. A value of -1 means that the value is negatively correlated and a value of 1 means that it's positively correlated. A value of 0 means that there's no identifiable relationship between them.
243+
OpenSearch substitutes `0` for any missing `gpa` or `class_grades` values when calculating the matrix statistics:
244+
245+
```json
246+
{
247+
"took": 23,
248+
"timed_out": false,
249+
"terminated_early": true,
250+
"_shards": {
251+
"total": 1,
252+
"successful": 1,
253+
"skipped": 0,
254+
"failed": 0
255+
},
256+
"hits": {
257+
"total": {
258+
"value": 3,
259+
"relation": "eq"
260+
},
261+
"max_score": null,
262+
"hits": []
263+
},
264+
"aggregations": {
265+
"matrix_stats_taxful_total_price": {
266+
"doc_count": 3,
267+
"fields": [
268+
{
269+
"name": "gpa",
270+
"count": 3,
271+
"mean": 2.456666628519694,
272+
"variance": 4.55363318017324,
273+
"skewness": -0.688130006360758,
274+
"kurtosis": 1.5,
275+
"covariance": {
276+
"gpa": 4.55363318017324,
277+
"class_grades": 4.143944374667273
278+
},
279+
"correlation": {
280+
"gpa": 1,
281+
"class_grades": 0.9970184390038257
282+
}
283+
},
284+
{
285+
"name": "class_grades",
286+
"count": 3,
287+
"mean": 2.2222222222222223,
288+
"variance": 3.793703722777191,
289+
"skewness": -0.6323693521730989,
290+
"kurtosis": 1.5000000000000002,
291+
"covariance": {
292+
"gpa": 4.143944374667273,
293+
"class_grades": 3.793703722777191
294+
},
295+
"correlation": {
296+
"gpa": 0.9970184390038257,
297+
"class_grades": 1
298+
}
299+
}
300+
]
301+
}
302+
}
303+
}
304+
```

0 commit comments

Comments
 (0)