[Feature request] Add field mapping correlation type metadata concept #7082

YANG-DB · 2023-04-10T23:02:50Z

Is your feature request related to a problem?
As part of the Integration campaign and [Integration RFC(https://github.com/opensearch-project/OpenSearch-Dashboards/issues/3412) , we have introduction the SimpleSchema for Observability Domain that is based on the concept of a well-structured index which is based on a schema

Schema
A schema is associated to an index using the mapping configuration .

This mapping structure is also composable using the composed_of template capabilities which is used extensively to allow the different assemblies of various log types.

Another concept behind the schema is the capability of reflecting relationships.
This representation is currently defined in a proprietary way of adding this information to
the index mapping template's metadata

In the Observability domain - a log's entity relationship to a trace entity (:log)-[:associated]-(:trace) using the traceId correlation field is described in the log's mapping metadata section:

 "_meta": {
        "description": "Simple Schema For Observability",
        "catalog": "observability",
        "type": "logs",
        "correlations": [
          {
            "field": "spanId",
            "foreign-schema": "traces",
            "foreign-field": "spanId"
          },
          {
           "field": "traceId",
            "foreign-schema": "traces",
            "foreign-field": "traceId"
           }
          ]
        }

What solution would you like?

I would like that the field mapping API would be extended with this metadata information.

Recently there have been large extensions in the conceptual operation of opensearch as a search engine.
These extensions include:

Federation of queries from different external sources datasource
Adding materialized view backing external data-lake storage Query S3 in OpenSearch Observability
Adding BloomFilter data sketch as a new data-type bloomFilter data-type

The evolution of the knowledge layer on top of the data layer is an existing trend both in opensearch and in additional storage engines.

Key part of any knowledge layer is the concept of relationships between the different Entities .

P1 - The First Step

This step includes the introduction of the correlations concept into the field mapping.

Even though the concept of index relationships does exist today:

Both options imply a physical explicit index interrelationship that has a strong side effect of index physical storage and query time.
In addition, the specific field mapping has no reflection of this join which is only present in the higher index mapping level.

The new field-mapping-correlation feature is addressing the metadata aspect of the relationship between well-structured
entities residing in different indices.

A correlation is a weaker constraint in the sense that it doesn't impose a relational like DB foreign key constraint but rather implies that such correlation exist and may be joined
using a query engine

Another difference from the existing join fields is that this correlation will be at first a metadata declarative definition that will not be enforced with respect to the
actual data inside the indices - only the mapping correlation metadata will be enforced as detailed below.

New Correlation Section in Field mapping

Field mapping for a field which has a relationship to another foreign field in the target entity's index:
GET log/_mapping/field/traceId

Will respond with:

{
  "logs": {
    ...
    "mappings": {
      ...
        "traceId": {
          "ignore_above": 256,
          "type": "keyword"
        },
        "spanId": {
          "ignore_above": 256,
          "type": "keyword"
        },
        "traceIdFk": {
          "type": "correlation",
          "path": "traceId",
           "target_schema": "traces",
            "target_field":"traceId"
        },
        "spanIdFk": {
          "type": "correlation",
          "path": "spanId",
           "target_schema": "traces",
            "target_field":"spanId"
        },

    }
  }
}

This metadata information will be used by the SQL / PPL query engine to allow explicit correlation between different data-streams or datasources.
Having this information explicitly will allow better understanding and enhance investigation capabilities.

Once a SQL / PPL correlation (join) query is submitted to the corresponding index - it will create a regular sql join query.

Enforcement

In the first P1 step the mapping API would enforce the following when a field mapping correlation is requested:

validate target index schema foreign-schema mapping exists ( in the above example the "foreign-schema": "traces" must imply an index template traces exist)
validate target index schema foreign-field mapping exists ( in the above example the "foreign-field": "traceId" must imply a field named traceId must exist)
Field type must be in sync between the source and target field as well.

The correlations field may accept multiple correlations for additional remote indices including remote tables including datasources

P2 - The next Step

The next phase of the correlation capability would be including the actual precompute of the correlated data using some auxiliary data structure / indices
The auxiliary data structure may take the form of an eager correlation task which precomputes the join and materialized it into a secondary storage.
An additional skipping-index can be introduced to further optimize the filter based queries using bloomfilter of other probabilistic data sketch

The result of an SQL query would be much faster due to these auxiliary structures and allow faster and investigative driven use cases on top of huge indices and event data-lake
based correlations.

What alternatives have you considered?
A clear and concise description of any alternative solutions or features you've considered.

Do you have any additional context?

The text was updated successfully, but these errors were encountered:

saratvemulapalli · 2023-05-05T20:20:51Z

@YANG-DB are you looking for feedback or would contribute these changes?

YANG-DB · 2023-05-05T20:22:40Z

I wanted to get feedback on this suggestion and how it fits with the current correlation initiative

YANG-DB added enhancement Enhancement or improvement to existing feature or request untriaged labels Apr 10, 2023

RyanL1997 added the feature New feature or request label Apr 27, 2023

mch2 added the discuss Issues intended to help drive brainstorming and decision making label May 9, 2023

This was referenced May 11, 2023

[RFC] Security Analytics Correlation Engine opensearch-project/security-analytics#369

Closed

[RFC] OpenSearch Events Correlation Engine #6779

Open

YANG-DB mentioned this issue May 26, 2023

[FEATURE]Adding Observability Catalog opensearch-project/opensearch-catalog#14

Closed

anasalkouz added Indexing Indexing, Bulk Indexing and anything related to indexing and removed untriaged labels Jun 1, 2023

YANG-DB mentioned this issue Jul 16, 2023

create a correlation field mapper #8712

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature request] Add field mapping correlation type metadata concept #7082

[Feature request] Add field mapping correlation type metadata concept #7082

YANG-DB commented Apr 10, 2023 •

edited

Loading

saratvemulapalli commented May 5, 2023

YANG-DB commented May 5, 2023

[Feature request] Add field mapping correlation type metadata concept #7082

[Feature request] Add field mapping correlation type metadata concept #7082

Comments

YANG-DB commented Apr 10, 2023 • edited Loading

P1 - The First Step

New Correlation Section in Field mapping

Enforcement

P2 - The next Step

saratvemulapalli commented May 5, 2023

YANG-DB commented May 5, 2023

YANG-DB commented Apr 10, 2023 •

edited

Loading