Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Flatten result index mapping for visualizing nested objects in Dashboards #1306

Closed
jackiehanyang opened this issue Sep 10, 2024 · 2 comments
Assignees
Labels
enhancement New feature or request v2.19.0

Comments

@jackiehanyang
Copy link
Collaborator

jackiehanyang commented Sep 10, 2024

Flatten Result Index

Problem Statement

In Anomaly Detection, many values are not flattened, making it difficult to view them on the dashboard. For instance, entity values are nested objects, and features are arrays. The requirement is to reference a feature by name and apply conditions like f1 > 3. Additionally, there is a need to perform terms aggregation on categorical fields. This will require adjustments to the mapping and the addition of new fields in the result index.

What to Flat

Original result index mapping when a detector has anomalies:
{
"detector_id": "fylE53wBc9MCt6q12tKp",
"schema_version": 0,
"data_start_time": 1635927900000,
"data_end_time": 1635927960000,
"feature_data": [
{
"feature_id": "processing_bytes_max",
"feature_name": "processing bytes max",
"data": 2291
},
{
"feature_id": "processing_bytes_avg",
"feature_name": "processing bytes avg",
"data": 1677.3333333333333
},
{
"feature_id": "processing_bytes_min",
"feature_name": "processing bytes min",
"data": 1054
},
{
"feature_id": "processing_bytes_sum",
"feature_name": "processing bytes sum",
"data": 5032
},
{
"feature_id": "processing_time_max",
"feature_name": "processing time max",
"data": 11422
}
],
"anomaly_score": 1.1986675882872033,
"anomaly_grade": 0.26806225550178464,
"confidence": 0.9607519742565531,
"entity": [
{
"name": "process_name",
"value": "process_3"
}
],
"approx_anomaly_start_time": 1635927900000,
"relevant_attribution": [
{
"feature_id": "processing_bytes_max",
"data": 0.03628638020431366
},
{
"feature_id": "processing_bytes_avg",
"data": 0.03384479053991436
},
{
"feature_id": "processing_bytes_min",
"data": 0.058812549572819096
},
{
"feature_id": "processing_bytes_sum",
"data": 0.10154576265526988
},
{
"feature_id": "processing_time_max",
"data": 0.7695105170276828
}
],
"expected_values": [
{
"likelihood": 1,
"value_list": [
{
"feature_id": "processing_bytes_max",
"data": 2291
},
{
"feature_id": "processing_bytes_avg",
"data": 1677.3333333333333
},
{
"feature_id": "processing_bytes_min",
"data": 1054
},
{
"feature_id": "processing_bytes_sum",
"data": 6062
},
{
"feature_id": "processing_time_max",
"data": 23379
}
]
}
],
"threshold": 1.0993584705913992,
"execution_end_time": 1635898427895,
"execution_start_time": 1635898427803,
"past_values": [
{
"feature_id": "processing_bytes_max",
"data": 905
},
{
"feature_id": "processing_bytes_avg",
"data": 479
},
{
"feature_id": "processing_bytes_min",
"data": 128
},
{
"feature_id": "processing_bytes_sum",
"data": 1437
},
{
"feature_id": "processing_time_max",
"data": 8440
}
]
}
After flattening:
{
......SAME ORIGINAL CONTENT AS ABOVE......

// flattened feature_data fields
"feature_data_processing_bytes_max": 2322,
"feature_data_processing_bytes_avg": 1718.6666666666667,
"feature_data_processing_bytes_min": 1375,
"feature_data_processing_bytes_sum": 5156,
"feature_data_processing_time_max": 31198,

// flattened entity fields
"entity_process_name_value": "process_3",

// flattened relevant_attribution fields
"relevant_attribution_processing_bytes_max": 0.03628638020431366,
"relevant_attribution_processing_bytes_avg": 0.03384479053991436,
"relevant_attribution_processing_bytes_min": 0.058812549572819096,
"relevant_attribution_processing_bytes_sum": 0.10154576265526988,
"relevant_attribution_processing_time_max": 0.7695105170276828,

// flattened expected_values fields
"expected_values_processing_bytes_max": 2291,
"expected_values_processing_bytes_avg": 1677.3333333333333,
"expected_values_processing_bytes_min": 1054,
"expected_values_processing_bytes_sum": 6062,
"expected_values_processing_time_max": 23379

// flattened past_values fields
"past_values_processing_bytes_max": 905,
"past_values_processing_bytes_avg": 479,
"past_values_processing_bytes_min": 128,
"past_values_processing_bytes_sum": 1437,
"past_values_processing_time_max": 8440
}

Difficulties:

The following outlines the difficulties encountered during this project, with each point logically flowing as a consequence of the previous one.
  • When OpenSearch Visualization loads index data, it relies on the static mapping of the index rather than the actual content structure. Consequently, to enable accurate visualizations, we nee to create a separate result index as a flattened copy of the original result index. This flattened index ensures that the data structure aligns with our visualization requirements.
  • When using dynamic index mapping could alleviate concerns about mapping structure, it is unsuitable for cases where the flattening process is dynamically influence by detector configurations. Therefore, if dynamic mapping is not an option, we must carefully put together a static index mapping that accommodates these dynamic flattening requirements.
  • OpenSearch currently lacks support for aggregation on nested fields, which presents additional challenges. Although nested fields can appear in dotpath format on the IndexPattern and Discover pages, they are unavailable for aggregation on the Visualization page. Even if dotpath formats were supported in Visualization, this approach does not fulfill our need to flatten result indices for AD and enable customer-friendly aggregations. To achieve this, we need to extract specific field values as keys during the flattening process. This limitation necessitates leveraging painless script to dynamically flatten and reconstruct the data.
  • However, painless scripts in OpenSearch do not support making client calls within the script. This restriction means it is impossible to directly ingest transformed data from one index into another within the script itself. As a result, painless scripts can only handle flattening nested fields, and we must handle the task of hydrating the separate result index outside the script.
  • To hydrate the separate result index with flattened data, we could use the reindex API to copy flattened results from the original result index. However, the reindex API operates as a one-time action, meaning it cannot accommodate cases where the data flattening needs to occur on a recurring or scheduled basis.
  • To perform reindexing on a schedule, we could utilize an ISM policy that includes a reindex action, associating it with the original result index. This approach would enable scheduled reindexing to keep the separate result index up to date. However, this introduces a dependency on ISM, which we want to avoid in order to maintain flexibility and reduce reliance on additional OpenSearch components.

Solutions:

  | separate index needed? | is dynmiac mapping enabled for this separate index? | ingest pipeline needed? | index processor needed? | script processor needed? | when to hydrate the separate index | complexity/LOE -- | -- | -- | -- | -- | -- | -- | -- Approach 1 | Y | Y | Y | Y | Y | the index processor will take care of it | medium Approach 2 | Y | Y | Y | N | Y | when writing to the existing result index, directly write results into this separate index | small Approach 3 | Y | N | N | N | N | when writing to the existing result index, dynamically write results into this separate index according to its mapping. | large Approach 4 | N | N | N | N | N | N/A | extra large to unknown   |   |   |   |   |   |   |     |   |   |   |   |   |   |  


Approach 1. Setup a separate index and an ingest pipeline. Use an index processor to hydrate the separate index, and a script processor to flatten its nested fields.
Open Search currently doesn’t currently support an index processor in its ingest pipeline.

Approach 2 (proposing). Setup a separate index and hydrate it alongside the existing result index. Use an ingest pipeline with a script processor to flatten the nested fields in the separate index.
Set up a separate index alongside the custom result index, using the same mapping as the result index but with dynamic mapping enabled. After creating the index, configure an ingest pipeline with a script processor that uses a painless script to flatten all five nested list fields into the desired flattened format. Whenever results are written to the existing result index, also write to this separate index, ensuring consistency between the two. The ingest pipeline and script processor are triggered during writes to handle the flattening of the nested fields seamlessly.
Pros:

  • require the smallest effort among all approaches
Cons:
  • an additional index will be created for customers
  • an ingest pipeline will be created for customers

Approach 3. Setup a separate index and programmatically generate its index mapping. Hydrate the separate index alongside the existing result index.

Set up a separate index alongside the custom result index without defining a static mapping. Instead, programmatically generate the mapping by iterating through the config file (detector settings) to extract information from nested fields, such as the Feature list. During the hydration process, cross-compare the data with the config file to ensure results are appropriately written into this separate index.

Pros:
  • no additional resources like index or pipeline will be created for customers
Cons:
  • requires large amount of effort to make this change happen.
  • adding numerous if-else branches throughout the codebase to ensure we programmatically handle this optional feature correctly.

Approach 4. No action needed from AD side. The flattening process all happens in visualization side.
Pros:
  • the best practice solution for customers
  • brings border impact for open search as a whole
Cons:
  • requires extra large amount of effort, and involves many unknowns

@jackiehanyang jackiehanyang added enhancement New feature or request untriaged labels Sep 10, 2024
@jackiehanyang jackiehanyang self-assigned this Sep 10, 2024
@dblock dblock removed the untriaged label Sep 30, 2024
@dblock
Copy link
Member

dblock commented Sep 30, 2024

[Catch All Triage - 1, 2, 3, 4]

@minalsha minalsha moved this from New to In Progress in OpenSearch Roadmap Oct 14, 2024
@minalsha minalsha added v2.9.1 Issues targeting release v2.9.1 and removed v2.18.0 labels Oct 23, 2024
@jackiehanyang
Copy link
Collaborator Author

After setting up the ingest pipeline to flatten the nested fields, I can see the new flattened fields on the index pattern page. However, on the visualization side, the Field dropdown list is not loading the newly added flattened fields. I have created an issue on the OSD side regarding this matter - opensearch-project/OpenSearch-Dashboards#8722

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request v2.19.0
Projects
Status: In Progress
Development

No branches or pull requests

3 participants