Skip to content

Conversation

@dwelsch-esi
Copy link
Contributor

Description

Updated descriptions. Added explanation of result statistics.

This PR will probably require some major revisions before it's published. I have several questions. See line comments.

Issues Resolved

Version

Frontend features

Checklist

  • By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
    For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Dave Welsch <dwelsch@expertsupport.com>
@github-actions
Copy link

Thank you for submitting your PR. The PR states are In progress (or Draft) -> Tech review -> Doc review -> Editorial review -> Merged.

Before you submit your PR for doc review, make sure the content is technically accurate. If you need help finding a tech reviewer, tag a maintainer.

When you're ready for doc review, tag the assignee of this PR. The doc reviewer may push edits to the PR directly or leave comments and editorial suggestions for you to address (let us know in a comment if you have a preference). The doc reviewer will arrange for an editorial review.

| Parameter | Required/Optional | Data type | Description |
| :-- | :-- | :-- | :-- |
| `fields` | Required | String | An array of fields for which the matrix stats are computed. |
| `mode` | Optional | String | Which value to use as a sample from a multi-valued or array field. Allowed values are `avg`, `min`, `max`, `sum`, and `median`. |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mode did not seem to affect the result when I tested it with array data. Maybe it doesn't work when one field is a single value and another is an array? But that doesn't make much sense.

What is the default mode?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jainankitk Could you please answer this quiestion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dwelsch-esi - Can you share the query that you ran?

Copy link
Contributor Author

@dwelsch-esi dwelsch-esi Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jainankitk Try the following:

DELETE /students

POST _bulk
{ "create": { "_index": "students", "_id": "1" } }
{ "name": "John Doe", "gpa": 3.89, "class_grades": [3.0, 3.9, 4.0], "grad_year": 2022}
{ "create": { "_index": "students", "_id": "2" } }
{ "name": "Jonathan Powers", "gpa": 3.85, "class_grades": [4.1, 3.0, 4.0], "grad_year": 2025 }
{ "create": { "_index": "students", "_id": "3" } }
{ "name": "Jane Doe", "gpa": 3.52, "class_grades": [3.2, 2.1, 3.8], "grad_year": 2024 }

GET students/_search
{
  "size": 0,
  "aggs": {
    "matrix_stats_taxful_total_price": {
      "matrix_stats": {
        "fields": ["gpa", "class_grades"],
        "mode": "avg"
      }
    }
  }
}

I tried different modes (min, max) instead of avg. In all cases, the correlation between gpa and class_grades is the same, 0.9820867239098596. Surely it should be different if the value used for class_grades is computed differently for different mode values?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dwelsch-esi for close observation. I noticed that the first response is cached resulting in this discrepancy. You can try using the GET students/_search?request_cache=false parameter to avoid this issue. Or alternatively, POST /_cache/clear before each search request

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have PR for fixing this discrepancy - opensearch-project/OpenSearch#18254, in case anyone is curious


The `matrix_stats` aggregation generates advanced stats for multiple fields in a matrix form.
The following example returns advanced stats in a matrix form for the `taxful_total_price` and `products.base_price` fields:
The `matrix_stats` metric is a multi-value metric that generates covariance statistics for two or more fields in matrix form.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Does matrix-stats always sample every available document? If not, how is sampling done?

  2. What assumptions does matrix-stats make about independence of data samples? Is it sufficent to say that it assumes documents are independent of each other?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again will probably keep it as multi-value metric aggregation and maybe link to the metric aggregation page

Copy link
Contributor Author

@dwelsch-esi dwelsch-esi Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jainankitk Please see my technical questions throughout this file. Can you provide any insights?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jainankitk Could you please answer this quiestion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does matrix-stats always sample every available document? If not, how is sampling done?

I don't think there is any sampling, all the documents are processed

What assumptions does matrix-stats make about independence of data samples? Is it sufficent to say that it assumes documents are independent of each other?

Matrix-stats does not make any assumptions about the data. I am wondering why does it need to make any assumption?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jainankitk There are several assumptions that have to be met for a correlation to be valid: https://www.statology.org/pearson-correlation-assumptions/. However, that level of detail is probably beyond the scope of the documentation here. At some level, the user is assumed to know what they're doing if they use an aggregation.

@kolchfa-aws
Copy link
Collaborator

@jainankitk Could you please review this PR? Thanks!


The `matrix_stats` aggregation generates advanced stats for multiple fields in a matrix form.
The following example returns advanced stats in a matrix form for the `taxful_total_price` and `products.base_price` fields:
The `matrix_stats` metric is a multi-value metric that generates covariance statistics for two or more fields in matrix form.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again will probably keep it as multi-value metric aggregation and maybe link to the metric aggregation page

Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Signed-off-by: Dave Welsch <116022979+dwelsch-esi@users.noreply.github.com>
kolchfa-aws and others added 3 commits May 21, 2025 09:37
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
@kolchfa-aws
Copy link
Collaborator

@jainankitk Thanks for your feedback. I addressed your comments and added a section about the missing parameter.

Copy link
Collaborator

@natebower natebower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kolchfa-aws One comment and a couple of changes. Thanks!

kolchfa-aws and others added 2 commits May 21, 2025 11:36
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
@kolchfa-aws kolchfa-aws merged commit c8abc5e into opensearch-project:main May 22, 2025
8 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request May 22, 2025
* Update matrix-stats aggregation

Signed-off-by: Dave Welsch <dwelsch@expertsupport.com>

* Apply suggestions from code review

Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Signed-off-by: Dave Welsch <116022979+dwelsch-esi@users.noreply.github.com>

* Doc review

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Add missing parameter section

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Apply suggestions from code review

Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

* Update _aggregations/metric/matrix-stats.md

Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

---------

Signed-off-by: Dave Welsch <dwelsch@expertsupport.com>
Signed-off-by: Dave Welsch <116022979+dwelsch-esi@users.noreply.github.com>
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Co-authored-by: Fanit Kolchina <kolchfa@amazon.com>
Co-authored-by: Nathan Bower <nbower@amazon.com>
(cherry picked from commit c8abc5e)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
github-actions bot pushed a commit that referenced this pull request May 22, 2025
KishoreKicha14 pushed a commit to KishoreKicha14/documentation-website that referenced this pull request Jun 13, 2025
* Update matrix-stats aggregation

Signed-off-by: Dave Welsch <dwelsch@expertsupport.com>

* Apply suggestions from code review

Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Signed-off-by: Dave Welsch <116022979+dwelsch-esi@users.noreply.github.com>

* Doc review

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Add missing parameter section

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Apply suggestions from code review

Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

* Update _aggregations/metric/matrix-stats.md

Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

---------

Signed-off-by: Dave Welsch <dwelsch@expertsupport.com>
Signed-off-by: Dave Welsch <116022979+dwelsch-esi@users.noreply.github.com>
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Co-authored-by: Fanit Kolchina <kolchfa@amazon.com>
Co-authored-by: Nathan Bower <nbower@amazon.com>
epugh pushed a commit to o19s/documentation-website that referenced this pull request Jul 2, 2025
* Update matrix-stats aggregation

Signed-off-by: Dave Welsch <dwelsch@expertsupport.com>

* Apply suggestions from code review

Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Signed-off-by: Dave Welsch <116022979+dwelsch-esi@users.noreply.github.com>

* Doc review

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Add missing parameter section

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Apply suggestions from code review

Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

* Update _aggregations/metric/matrix-stats.md

Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

---------

Signed-off-by: Dave Welsch <dwelsch@expertsupport.com>
Signed-off-by: Dave Welsch <116022979+dwelsch-esi@users.noreply.github.com>
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Co-authored-by: Fanit Kolchina <kolchfa@amazon.com>
Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: Eric Pugh <epugh@opensourceconnections.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants