Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RAC] Alerts as Data Schema Definition #93728

Closed
spong opened this issue Mar 5, 2021 · 70 comments
Closed

[RAC] Alerts as Data Schema Definition #93728

spong opened this issue Mar 5, 2021 · 70 comments
Assignees
Labels
discuss Team:Detections and Resp Security Detection Response Team Team:Observability Team label for Observability Team (for things that are handled across all of observability) Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Theme: rac label obsolete

Comments

@spong
Copy link
Member

spong commented Mar 5, 2021

This issue is for finalizing the Alerts as Data Schema definition.

The most recent proposal is as follows:

{
  @timestamp,           // The time the alert was detected
  ...ecsMapping,        // Schema for context about the alert
  alert: {              // Namespace for all non-ECS fields
    rule,               // Rule resulting in alert
    ruleTypeId,         // The id of the rule type for grouping purposes
    consumer,           // The consumer of the rule (for RBAC purposes)
    params,             // The parameters given during the execution of the rule
    alertId/instanceId, // The id of the alert instance for grouping purposes
    original_alert,     // If source document contains 'alert' field, original contents stored here
    workflow: {         // Namespace for mutable workflow fields  
      status,           // Current status of the alert open/closed/etc
      assignees,        // User(s) alert is assigned too
    },             
  }  
}

The current Detection Alert Schema is as follows:

.siem-signals schema

{
  ...ecsMapping,     // Schema for context about the alert
  signal: {          // Isolated field for capturing all non-ecs fields
    _meta: {
      version,
    },
    parent: {
      rule,
      index,
      id,
      type,
      depth,
    },
    parents: {
      rule,
      index,
      id,
      type,
      depth,
    },
    ancestors: {
      rule,
      index,
      id,
      type,
      depth,
    },
    group: {
      id,
      index,
    },
    rule: {
      id,
      rule_id,
      author,
      building_block_type,
      false_positives,
      saved_id,
      timeline_id,
      timeline_title,
      max_signals,
      risk_score,
      risk_score_mapping: {
        field,
        operator,
        value,
      }
      output_index,
      description,
      from,
      immutable,
      index,
      interval,
      language,
      license,
      name,
      rule_name_override,
      query,
      references,
      severity,
      severity_mapping: {
        field,
        operator,
        value,
        severity,
      },
      tags,
      threat: {
        framework,
        tactic: {
          id,
          name,
          reference,
        },
        technique: {
          id,
          name,
          reference,
          subtechnique: {
            id:,
            name,
            reference,
          }
        }
      },
      threshold: {
        field,
        value,
      },
      threat_mapping: {
        entries: {
          field,
          value,
          type,
        }
      },
      threat_filters,
      threat_indicator_path,
      threat_query,
      threat_index,
      threat_language,
      note,
      timestamp_override,
      type,
      size,
      to,
      enabled,
      filters,
      created_at,
      updated_at,
      created_by,
      updated_by,
      version,
    },
    original_time,
    original_signal: {
      type: object,
      dynamic: false,
      enabled: false
    },
    original_event: {
      action,
      category,
      code,
      created,
      dataset,
      duration,
      end,
      hash,
      id,
      kind,
      module,
      original: {
        doc_values: false,
        index: false,
      },
      outcome,
      provider,
      risk_score,
      risk_score_norm,
      sequence,
      severity,
      start,
      timezone,
      type,
    },
    status,
    threshold_count,
    threshold_result: {
      terms: {
        field,
        value,
      },
      cardinality: {
        field,
        value,
      },
      count,
    }
    depth,
  }
}

Features in the .siem-signals schema to take note of:
  • alerts on alerts via parent/parents/ancestors/depth/original_time/original_signal (I believe there's some deprecating we can do here)
  • _meta.version, because knowing your place in time is a good thing :)
  • Field overrides on Rules (i.e risk_score mapping, severity_mapping, rule_name_override, timestamp_override
    • Can be discussed with base Rule Schema, but notable for determining calculated fields original source value
  • ...
Features that don't currently exist that would've been nice in hindsight:
  • Unique ID field for tracking individual rule executions
  • ...
Open questions?
  1. Does original_event have any lingering features tied to it? (PR)
  2. How are we going to resolve our different rule types in the mappings?
  3. Will parents/ancestor paradigm be used in alerting core?
  4. We've got rule-specific fields within signal.* as well, like threshold_count/result? Store in meta if search not needed?
  5. Options around mutability for o11y use case (self-healing, multiple alerts or single?) (Same vein as open question in [RAC] Alerts as Data Bulk Insert #93730)
Relevant source files:
Reference docs:

tl;dr on the long debate in the comments here

The contract that we have is that each Rule type, when executed, creates Alert documents for each “bad” thing. A “bad” thing could be a security violation, a metric going above a threshold, a service detected down, an ML anomaly, a move out of a map region, etc. These Alert documents use the ECS schema and have the fields required for workflow (e.g. in progress/close, acknowledge, assign to user). The common Alerts UI displays them in a table, typically one Alert per row. The user can filter and group by any ECS field. This is common for all solutions and rule types.

In addition to these Alert documents, the Rule type code is allowed to add other documents in the Alert indices (with a different event.kind), as long as they don’t cause mapping conflicts nor fields explosion. These extra documents are typically immutable and provide extra details for the Alert.

For example, for a threshold based alert, they can contain the individual measurements (evaluations) over time as well as any state changes (alert is over warning watermark, alert is over critical watermark). These documents will be used by Alerts detail fly-out/page, which is Rule type specific, to display a visual timeline for each alert.

Curated UIs, like the Synthetics one, can use both Alerts and the evaluations docs to build the UIs that they need.

@spong spong added discuss Team:Observability Team label for Observability Team (for things that are handled across all of observability) Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Theme: rac label obsolete labels Mar 5, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detections-response (Team:Detections and Resp)

@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@pmuellr
Copy link
Member

pmuellr commented Mar 5, 2021

I assume ...ecsMappings is all the ECS fields at the usual top-level. But then we add @timestamp (at the bottom), that means we lose the original @timestamp from the ECS fields. That ok?

I'm not sure what rule is, especially given id and ruleTypeid. Feels like rule should be an object with id and ruleTypeid as fields. Storing the rule name will probably be useful for Discover / Lens purposes.

Do we want producer in here instead of consumer? The current alert types only register a producer. I get this stuff confused sometimes, and I guess it depends on if you're looking at this from a reading POV or writing, and reading makes more sense (customer facing vs Kibana devs), so consumer makes more sense, but ... like I said, I get this stuff confused sometimes :-)

For current alerting purposes, we are mapping params as flattened, which give us non-strict capabilities, but only supports keyword type of searchability. Is the plan to do the same with params here?

For alertId/instanceId, is the intention this is a single string field? Instance id's are certainly the wild-west right now (specific to the rule type), but it's possible in the future these could be aligned across rules. Eg, imagine different rules whose instance id's are a host name / service name. It might be useful to be able to do a search of instance id without an alert id or rule id - "show me all the alerts for instance id 'elastic.co'".

For the current event log, we store the current Kibana's server uuid in the docs. Turns out this has been useful to identify problematic Kibanas (eg, configured with the wrong encryption key), and other diagnostic purposes. Since this is all happening on the back-end, and we have nothing like http logs for traceability.

@spong
Copy link
Member Author

spong commented Mar 9, 2021

I assume ...ecsMappings is all the ECS fields at the usual top-level. But then we add @timestamp (at the bottom), that means we lose the original @timestamp from the ECS fields. That ok?

Within the .siem-signal mapping we store the timestamp of the source event as signal.original_time. Whether or not this is the right spot, we definitely want to capture the original timestamp so we can easily calculate MTTD without a join (among other uses within dashboards, triage workflow, etc).

I'm not sure what rule is, especially given id and ruleTypeid. Feels like rule should be an object with id and ruleTypeid as fields. Storing the rule name will probably be useful for Discover / Lens purposes.

Rule is a copy of all the fields from the Rule at the time of execution, and serves as a source of what the Rule configuration was at time the alert was generated. This is necessary for auditing, identifying configuration differences between when it was detected and the Rule's current state, and for search/filtering and dashboarding the resultant alerts.

@pmuellr
Copy link
Member

pmuellr commented Mar 9, 2021

I don't think I've seen yet how we propose to handle RBAC concerns here. For Kibana alerting, we use standard feature controls and a producer field in the saved object, to limit access to alerts. We have some additional filtering we add for the producer field as part of the otherwise standard saved object query capabilities.

Currently, there are at least the following producers in use:

  • apm
  • logs
  • infrastructure
  • ml
  • monitoring
  • siem
  • stackAlerts (this covers index threshold, elasticsearch query, and maps rule types)

I'm guessing SIEM hasn't had to deal with this since:

  • it only has one producer to deal with: siem
  • it creates alert data indices per space

We can "roll our own story" here, like we did with event log, where we don't provide open-ended access to the indices, but gate access by requiring queries to specify the saved objects they want to do queries over for the event data. We ensure the user can "read" the requested saved objects, then do the queries filtering over those saved object ids (and the current space the query was targeted at). But this doesn't lend itself to a "use the alert data in Discover or Lens", because of that saved objects filter barrier.

@spong
Copy link
Member Author

spong commented Mar 9, 2021

Good question @pmuellr! On the SIEM/Security side it's been achieved through both the space-awareness of the indices, and for users wanting to keep everything within a single space, but maintain some sort of multi-tenancy, document level security is used to restrict access based on usually some namespace-like field within the alert document. We've had users report success with both methods, with the only real complaint being that it needs to be manually configured as spaces are added. This sorta flexibility has proved to be really nice for users as it makes the whole "use the alert data in Discover or Lens" pretty frictionless.

@mikecote
Copy link
Contributor

Here are a few more questions from the unified architecture workstream doc:

  1. Do we add workflow-related fields to the schema?
  2. Are the rule parameters indexed? (searchable, sortable, filterable, aggregatable)
  3. What do we call the alert identifier/deduplication field? (alertId, instanceId, groupId)

@spong
Copy link
Member Author

spong commented Mar 10, 2021

Do we add workflow-related fields to the schema?

I'm thinking so -- these fields will be extremely useful for tracking/dashboarding, so keeping them mapped alongside the alert will ensure that's easy to do (no join). Question is how do you track assignment/status over time? Can we rely on the Kibana Audit Log for that data (logged API calls to assign/change status)? Other question is where do they belong. @MikePaquette, are there any ECS plans for capturing workflow fields like these?

Are the rule parameters indexed? (searchable, sortable, filterable, aggregatable)

I think this is a must in effort to allow users and UI's to break down alerts by rule-specific fields, e.g. show all alerts where rule.type: threshold and threshold.field: host.name and threshold.value > 500, or for the security domain, show all alerts where threat.tactic.name: Defense Evasion (TA0005) and threat.technique.name: Abuse Elevation Control Mechanism (T1548).

What do we call the alert identifier/deduplication field? (alertId, instanceId, groupId)

Would default to alertId, but whatever is easiest to grok. Would prefer against groupId as we're currently using that for identifying all the alerts in an EQL sequence at the moment, but it'll be namespaced so not a big deal, just starts becoming overloaded.

@dgieselaar
Copy link
Member

I’ve been exploring various options over the last week to explore what we can do with alerts as data. This is more of a brain dump than a complete proposal. I might not use the right terminology in certain places, or the right field names, and my intent is not to cover all scenarios. I would need to learn more about Security/Maps/ML rules to figure out what is generally useful and what isn't, but I've tried to take whatever I know about other team's approaches in the back of my mind.

I think we have the following rule type categories:

  • Threshold rule types (alert when aggregate of x exceeds threshold of y)
  • Anomaly rule types (alert when an anomaly was detected for series x when it exceeded the threshold of y - arguably a sub-category of threshold rule types)
  • Log or event rule types (alert when a single event matches a query)
  • Composite rule types (e.g., alert when multiple conditions evaluate to true)

I think up until now we’ve mostly been talking about log or event rule types, which are one-offs: if an event violates the rule, it’s indexed, needs a human to close it, and should only notify (execute actions) once. I.e., the alert is an event, without a lifecycle.

For the other rule types, there is a lifecycle. An alert can activate, then recover. It can also become more or less severe over time. I think it’d be valuable to capture and display that progression as well.

We also need grouping - because of the underlying progression, but also because one rule can generate many alerts, and they all might point to the same underlying problem. Consider an infrastructure alert for CPU usage of Kubernetes pods for a user that has 10k pods. The execution interval is set to 1m If there is a GCP outage, the rule will generate 10k alerts per minute. In this scenario, we’d like to group by rule (id), and display a summary of the alerts as “Reason”. We also might want to group by hostname, or show alerts that recovered as different entries in the table. Here’s an example of the latter (which also shows the progression of the severity over time).

image

To make this work, we would need another layer of data: alert events. One alert can generate multiple alert events during its lifespan. If an alert recovers, it reaches the end of its lifespan. The next time the violation happens, a new alert happens. Changes in severity do not create a new alert.

Alerts can be grouped together by a grouping key. E.g., in the aforementioned infrastructure rule, the grouping key would be host.hostname. In the UI, we could then allow the user to investigate the history of alerts for host.hostname (or any other grouping key).

Conceptually, a rule type, a rule, an alert, and a violation would all map to event data, and for the alert event, this data would be merged together.

Each alert event captures the following data about the alert:

  • alert.id: this id is regenerated every time an alert starts. If it recovers, and activates, a new id is generated.
  • alert.created: the start time of the alert.
  • alert.grouping_key: the grouping key of the alert. Unique in the context of the rule. This field allows the framework to keep an alert “alive” over the course of multiple executions of the rule.
  • alert.active: whether a violation occurred in the last execution of the rule.

Additionally, information specific to the violation is stored. This is allows us to visualize the progression of the severity over time.

  • alert.violation.level: the severity of the violation: low, minor, major, warning, critical. Perhaps good to index as a numeric value so we can sort on it. Optional, defaults to warning.
  • alert.violation.value: for threshold-based alerts, this would be the actual value of whatever metric is being monitored. For security, this could perhaps be the risk score (?). Optional.
  • alert.violation.threshold: a numeric value that indicates the threshold for the alert. This could be a static value, defined in the rule, but it could also be specific to the alert, or even specific to the violation (e.g., for an anomaly based rule, it might be the upper bound of the expected range for which data is considered non-anomalous). Also optional.
  • @timestamp: the timestamp of the violation.

On all events, some data about the rule type and the rule itself should be added as well:

  • rule.id: the uuid of the rule instance.
  • rule.name: the human-readable label of the rule instance.
  • rule_type.id: the static id of the rule type
  • rule_type.name: the human-readable label of the rule type.
  • rule_type.description: the human-readable description of the rule type.

I would consider open/closed to be different from active/recovered. E.g., an alert could auto-close, meaning that it is closed immediately after the alert recovers. Or, it could auto-close 30m after the alert recovers. The alert would be kept “alive” by the framework until the timeout expires. For now, I’ve left open/closed out of the equation, as it might not be necessary now, but a possibility is that if an alert closes later than it recovers, a new event is indexed, with the state from the last alert event, but with alert.closed: true.

Additionally, there’s one other concept that I’ve been experimenting with that might be interesting: influencers. Similar to ML, these would be suspected contributors to the reason of the violation. These values would be indexed under alert.violation.influencers, which can be a keyword field, or a flattened field. For instance, if I have a transaction duration rule for the opbeans-java service, and it creates an alert, one of the influencers would be service.name:opbeans-java. But, we also might want to run a significant terms aggregation, compare data that exceeded the threshold with data that didn’t, and surface terms that occur more in one dataset than the other, and index them as influencers. This could surface host.hostname:x as an influencer, and would allow us to correlate data between alerts from different solutions. It could also be multiple hostnames. These fields might be ECS-compatible, but I would still index them under alert.violation.influencers. The difference is that I see fields that are stored as ECS are a guarantee that the alert relates to that field, but influencers are a suspicion. E.g., if we suspect that a host contributed to the violation of the transaction duration rule for opbeans-java, we would index it as an influencer only. If we have an infrastructure rule for a host name, and a violation of the rule occurs, it is indexed both under host.hostname and alert.violation.influencers. This allows the Infrastructure UI to recall alerts that are guaranteed to be relevant for the displayed host, but also display possibly related alerts in a different manner.

The event-based approach would also allow us to capture state changes over time, e.g. we could use them to answer @spong’s question about tracking changes in open/closed over time. Any state change would be a new event, inheriting its state from the previous event.

Querying this data generally means: get me the values of the last alert event if I group by x/y. The easiest way to do this is a terms aggregation with a nested top_metrics aggregation, sorted by @timestamp descending. It maps quite well to tools like Lens and Discover. E.g., here's how I can visualize in lens if an alert is flip-flapping:

image

As you can see, there are gaps, which meant that the alert recovered, but then activated again. In this case, that's because I'm disabling and enabling the rule. But this could also show that the threshold is too low, allowing the user to adjust that. Or, hypothetically, the framework could keep an alert alive for a short time period, to prevent it from flip-flapping.

@mikecote
Copy link
Contributor

Are the rule parameters indexed? (searchable, sortable, filterable, aggregatable)

I think this is a must in effort to allow users and UI's to break down alerts by rule-specific fields, e.g. show all alerts where rule.type: threshold and threshold.field: host.name and threshold.value > 500, or for the security domain, show all alerts where threat.tactic.name: Defense Evasion (TA0005) and threat.technique.name: Abuse Elevation Control Mechanism (T1548).

I would highly recommend applying lessons learned from #50213 to the schema. We've explored different paths but the one that works is to create an index per rule type. This would mean having ...-<ruleTypeId>-... somewhere in the index pattern and explicit rule param mappings for each property for each rule type.

@mikecote
Copy link
Contributor

mikecote commented Mar 15, 2021

I would like to capture the "why" that the schema has to be in ECS format (from an observability and Security perspective). The reasons will be useful input for the saved-objects as data story (#94502).

@mikecote
Copy link
Contributor

Also, I don't see workflow-related fields in the schema above (status, etc), are they part of ...ecsMapping?

@spong
Copy link
Member Author

spong commented Mar 16, 2021

I would like to capture the "why" that the schema has to be in ECS format (from an observability and Security perspective).

I believe @FrankHassanabad captured it best in a demo awhile back, but the power of using ECS (in these docs) across solutions is in the use and re-use of visualizations and workflows, being able view these alerts in custom solution-based views like an authentication table or transaction log, and allowing users to quickly and easily filter on fields that are common and familiar.

Also, I don't see workflow-related fields in the schema above (status, etc), are they part of ...ecsMapping?

I don't think they were captured in the initial doc(?), for now I was thinking we'd namespace them under alert while we discuss with the ECS folks if there's any plan to capture workflows like this.


I made a few small updates to the schema in the above description, namespacing to alert, added workflow fields, and removed the baseid as it was intended to be the id of the rule that created the alert which will be captured in rule.id. I also think we can remove ruleTypeId as it'll be captured within rule.type.

As mentioned earlier, we'll need to capture the alerts-on-alerts fields as well -- will finalize those as part of implementing the the different rule types within the rac test plugin. I've created a draft PR of what I've had a chance to put together so far (not much), but it at least covers the bootstrapping of the index/template/ilm and has the same script for generating the mapping like the event_log. I did the initial generation off of ECS 1.9 and just pasted the existing signal mapping in there as a placeholder while I get to know the script generation logic, but we're closer to having a spot for iterating on the schema and bulk-indexing strategies. 🙂

@mikecote
Copy link
Contributor

I would like to capture the "why" that the schema has to be in ECS format (from an observability and Security perspective).

I believe @FrankHassanabad captured it best in a demo awhile back, but the power of using ECS (in these docs) across solutions is in the use and re-use of visualizations and workflows, being able view these alerts in custom solution-based views like an authentication table or transaction log, and allowing users to quickly and easily filter on fields that are common and familiar.

@sqren @dgieselaar is the above the same for Observability's use case to use ECS?

I don't think they were captured in the initial doc(?), for now I was thinking we'd namespace them under alert while we discuss with the ECS folks if there's any plan to capture workflows like this.

@spong is the workflows something used by the Security solution or only Observability? I would like to capture and discuss why we wouldn't use cases instead of duplicating their workflows.

I made a few small updates to the schema in the above description

Some thoughts:

  • Would it be better to have rule as its own root object? So we have rule.id, rule.type.id, alert.id, etc
  • I would like to discuss original_alert first before committing to a structure in the schema
    • Is it just an id?
    • Is it content from another alert?
    • Is this something observability needs?

@mikecote
Copy link
Contributor

mikecote commented Mar 16, 2021

cc @kobelb, @stacey-gammon

@dgieselaar
Copy link
Member

dgieselaar commented Mar 16, 2021 via email

@jasonrhodes
Copy link
Member

I had a good meeting with @dgieselaar, @spong, @smith, and a few others last week to discuss alert concepts to make sure we are all on the same page with how to talk about all of these moving parts. I think we all have a common understanding of the general issues around merging our two sets of concepts, but there are still a few things I'm not clear on.

Here are some images to try to help push this discussion forward.

These are the concepts that I understand for observability rules + alerts:

o11y-alert-concepts

These are the related concepts that I (admittedly very vaguely) understand for security rules + alerts:

security-alert-concepts

My questions are about this concept I've called an "alert stream" or that Dario refers to as "alert events" and how observability treats each detected violation as a mutable event, and user-centric "alerts" are aggregations of those events (including an eventual recovery event). Whereas security, I think, creates a mutable violation/alert and then updates that object with different workflow states (open|in-progress|closed). It's not yet clear to me how we are planning to merge these concepts in a schema we both share.

Also, I have a question about security violations/alerts as depicted in my crude drawing: Can a given security detection generate multiple violations/alerts in this way? What am I misunderstanding about this model still?

@tsg
Copy link
Contributor

tsg commented Mar 16, 2021

Also, I have a question about security violations/alerts as depicted in my crude drawing: Can a given security detection generate multiple violations/alerts in this way? What am I misunderstanding about this model still?

Yes, one rule execution can result in multiple alerts. Each alert has its own status and can be marked in progress/closed independently. The typical example is that a search rule matches multiple documents and we create an alert for each. This example is our most simple/common rule type, but it's worth looking at our more complex rule types as well.

In particular, the Event Correlation rule type has some similarities with the Observability alerts. A correlation rule detects a sequence of events (each event respects some condition, in a particular order). To use your terminology, we create "alert events" for each individual event and then also a single user-centric mutable alert that we display to the user. In our terminology, we call the events "building-block alerts" and the user-centric alert, simply Alert.

The "alert events" are good because they save the documents in case they get deleted by ILM and capture the state of the docs as they were at rule execution time. The user-centric alert is good because it contains the mutable state and makes it easy to page through in the UI and create custom visualizations on top.

I'm thinking the same model can apply to Observability alerts. WDYT?

@dgieselaar
Copy link
Member

dgieselaar commented Mar 16, 2021

@tsg:

I'm thinking the same model can apply to Observability alerts. WDYT?

To some extent, yes. But I'm not sure if we need to mutate things. We can "just" write the latest alert state to an index, and then use aggregations or collapse to get the last value. Otherwise, we would either need some kind of job that cleans up old alerts, or have a user do that manually (and maybe bulk it). For the mutable alerts in security, how do they get deleted?

(fwiw I think there's definitely common ground here, and at least for APM we will be looking at rules that might be more similar to Security rules, e.g. detection of new errors, so I'm not too worried about diverging).

@tsg
Copy link
Contributor

tsg commented Mar 16, 2021

To some extent, yes. But I'm not sure if we need to mutate things. We can "just" write the latest alert state to an index, and then use aggregations or collapse to get the last value.

To make sure I understand, you are saying that you could only index the "alert event", and that the "user-centric alert" doesn't need to be explicitly present in the index?

I think we'll need mutations for marking in progress/closed/acknowledged/etc., right? That is related to the MTTx discussion earlier.

Also, I think indexing a "user-centric" alert makes querying and visualizations easier because you don't need an aggregation layer at query time.

Otherwise, we would either need some kind of job that cleans up old alerts, or have a user do that manually (and maybe bulk it). For the mutable alerts in security, how do they get deleted?

The .alerts indices have ILM policies. By default, they are not deleted, but users can configure a Delete phase in ILM.

(fwiw I think there's definitely common ground here, and at least for APM we will be looking at rules that might be more similar to Security rules, e.g. detection of new errors, so I'm not too worried about diverging).

++, I think though that the decision of always having a user-centric alert indexed is important because then we know we can rely on it in any new UI.

@dgieselaar
Copy link
Member

I think we'll need mutations for marking in progress/closed/acknowledged/etc., right? That is related to the MTTx discussion earlier.

In my head, we wouldn't need mutations. We'd just append a new event, with the updated state.

Also, I think indexing a "user-centric" alert makes querying and visualizations easier because you don't need an aggregation layer at query time.

Yeah, maybe? I don't know. I put a lot of faith in aggregations and I think ES can help here as well. I think for Observability we almost always just want to aggregate over events. My perception is that there are better ways to surface the most important data than asking the user to paginate through the whole dataset. Aggregation/grouping might be an interesting default for the Security alerts table as well.

The .alerts indices have ILM policies. By default, they are not deleted, but users can configure a Delete phase in ILM.

I'm not sure if I fully understand how ILM works, but suppose the user configures their policy to delete alerts after 30 days, corresponding to the retention period of their machine data, and an alert is in progress for longer than 30 days, is it deleted? That seems like an edge case, but the answer would help me understand the implications of mutating alert documents better.

Maybe mutating data is easier. But I think that you'll end up having to manage (sync?) two data sources (the events, and the user-centric alert), and things might get ugly quickly.

@dgieselaar
Copy link
Member

Not sure if this is the right place so feel free to slack or email about it, but what are use cases today that cannot be satisfied by aggregations or collapse?

@dgieselaar
Copy link
Member

dgieselaar commented Mar 28, 2021

Maybe the right word is "evaluation" instead of "check". Thinking out loud: an evaluation of the rule might be a violation of the rule, but not always. (e.g., you'd have ok, violation, unknown). An evaluation would be a metric. A rule execution might results in multiple evaluations, but no violations (for all services monitored by this rule, the average latency was below the threshold). Or multiple violations, and no alerts (e.g., the alert will only be created after three consecutive violations). Or, no evaluations, and only an alert (e.g. when extracting events).

Maybe there are two distinct phases to rule execution: evaluate, and alert.

The severity belongs on the alert, not on the evaluation. Whatever the last severity level will be, will be stored on the alert. If the changes in severity are important we can query the metrics - we should be able to store most if not all alert fields on the evaluation metric document. If not, we can just query the alerts.

An example of a rule that monitors latency for all production services. A violation occurs at 13:00, opening an alert, stays active at 13:01 when the latency was above the threshold, and closes at 13:02 when the latency was below the threshold.

[
  {
    "@timestamp": "2021-03-28T13:00:00.000Z",
    "event.kind": "metric",
    "event.action": "evaluate",
    "rule.uuid": "9c00495e-0e91-4653-ab7a-30b62298085e",
    "rule.id": "apm.transaction_duration_threshold",
    "rule.name": "Transaction duration for production services",
    "rule.category": "Transaction duration",
    "producer": "apm",
    "alert.uuid": "19d83250-f773-4499-92c7-f7d1346ed747",
    "alert.id": "opbeans-java",
    "alert.start": "2021-03-28T13:00:00.000Z",
    "alert.duration.us": 0,
    "alert.severity.value": 90,
    "alert.severity.level": "critical",
    "alert.status": "open",
    "evaluation.value": 1000,
    "evaluation.threshold": 900,
    "evaluation.status": "violation",
    "service.name": "opbeans-java",
    "service.environment": "production"
  },
  {
    "@timestamp": "2021-03-28T13:00:00.000Z",
    "event.kind": "alert",
    "rule.uuid": "9c00495e-0e91-4653-ab7a-30b62298085e",
    "rule.id": "apm.transaction_duration_threshold",
    "rule.name": "Transaction duration for production services",
    "rule.category": "Transaction duration",
    "producer": "apm",
    "alert.uuid": "19d83250-f773-4499-92c7-f7d1346ed747",
    "alert.id": "opbeans-java",
    "alert.start": "2021-03-28T13:00:00.000Z",
    "alert.duration.us": 0,
    "alert.severity.value": 90,
    "alert.severity.level": "critical",
    "alert.status": "open",
    "service.name": "opbeans-java",
    "service.environment": "production"
  },
  {
    "@timestamp": "2021-03-28T13:01:00.000Z",
    "event.kind": "metric",
    "event.action": "evaluate",
    "rule.uuid": "9c00495e-0e91-4653-ab7a-30b62298085e",
    "rule.id": "apm.transaction_duration_threshold",
    "rule.name": "Transaction duration for production services",
    "rule.category": "Transaction duration",
    "producer": "apm",
    "alert.uuid": "19d83250-f773-4499-92c7-f7d1346ed747",
    "alert.id": "opbeans-java",
    "alert.start": "2021-03-28T13:00:00.000Z",
    "alert.duration.us": 60000000,
    "alert.severity.value": 90,
    "alert.severity.level": "critical",
    "alert.status": "open",
    "evaluation.value": 1050,
    "evaluation.threshold": 900,
    "evaluation.status": "violation",
    "service.name": "opbeans-java",
    "service.environment": "production"
  },
  {
    "@timestamp": "2021-03-28T13:01:00.000Z",
    "event.kind": "alert",
    "rule.uuid": "9c00495e-0e91-4653-ab7a-30b62298085e",
    "rule.id": "apm.transaction_duration_threshold",
    "rule.name": "Transaction duration for production services",
    "rule.category": "Transaction duration",
    "producer": "apm",
    "alert.uuid": "19d83250-f773-4499-92c7-f7d1346ed747",
    "alert.id": "opbeans-java",
    "alert.start": "2021-03-28T13:00:00.000Z",
    "alert.duration.us": 60000000,
    "alert.severity.value": 90,
    "alert.severity.level": "critical",
    "alert.status": "open",
    "service.name": "opbeans-java",
    "service.environment": "production"
  },
  {
    "@timestamp": "2021-03-28T13:02:00.000Z",
    "event.kind": "metric",
    "event.action": "evaluate",
    "rule.uuid": "9c00495e-0e91-4653-ab7a-30b62298085e",
    "rule.id": "apm.transaction_duration_threshold",
    "rule.name": "Transaction duration for production services",
    "rule.category": "Transaction duration",
    "producer": "apm",
    "alert.uuid": "19d83250-f773-4499-92c7-f7d1346ed747",
    "alert.id": "opbeans-java",
    "alert.start": "2021-03-28T13:00:00.000Z",
    "alert.end": "2021-03-28T13:02:00.000Z",
    "alert.duration.us": 120000000,
    "alert.severity.value": 90,
    "alert.severity.level": "critical",
    "alert.status": "close",
    "evaluation.value": 500,
    "evaluation.threshold": 900,
    "evaluation.status": "ok",
    "service.name": "opbeans-java",
    "service.environment": "production"
  },
  {
    "@timestamp": "2021-03-28T13:02:00.000Z",
    "event.kind": "alert",
    "rule.uuid": "9c00495e-0e91-4653-ab7a-30b62298085e",
    "rule.id": "apm.transaction_duration_threshold",
    "rule.name": "Transaction duration for production services",
    "rule.category": "Transaction duration",
    "producer": "apm",
    "alert.uuid": "19d83250-f773-4499-92c7-f7d1346ed747",
    "alert.id": "opbeans-java",
    "alert.start": "2021-03-28T13:00:00.000Z",
    "alert.end": "2021-03-28T13:02:00.000Z",
    "alert.duration.us": 120000000,
    "alert.severity.value": 90,
    "alert.severity.level": "critical",
    "alert.status": "close",
    "service.name": "opbeans-java",
    "service.environment": "production"
  },
  {
    "@timestamp": "2021-03-28T13:03:00.000Z",
    "event.kind": "metric",
    "event.action": "evaluate",
    "rule.uuid": "9c00495e-0e91-4653-ab7a-30b62298085e",
    "rule.id": "apm.transaction_duration_threshold",
    "rule.name": "Transaction duration for production services",
    "rule.category": "Transaction duration",
    "producer": "apm",
    "alert.id": "opbeans-java",
    "evaluation.value": 300,
    "evaluation.threshold": 900,
    "evaluation.status": "ok",
    "service.name": "opbeans-java",
    "service.environment": "production"
  }
]

@jasonrhodes
Copy link
Member

So as best as I can tell, we've discussed a few different possible document types we could index:

[A] Signal/Alert Document: (event.kind: signal)
A document that represents the user-facing alert itself, with all of the necessary information attached such as duration, current severity, etc. This document is mutatable and would be updated/mutated on subsequent re-evaluations, state changes, etc. (I believe this is what Security indexes today?)

[B] Evaluation Document: (event.kind: metric)
Represents an individual evaluation (not user-facing on its own) and is stored discretely for each evaluation (TBD whether all "OK" evaluations would generate such a document). The information stored here would include the threshold, the actual value detected, severity of the violation, specific error message, and/or any other point-in-time information specific to this individual evaluation.

[C] State-change Document: (event.kind: event)
Represents any other kind of state change of the Signal/Alert document, such as human workflow ("open" -> "in progress"). Would be useful for tracking the history of such changes. (Would severity change be a state change too?) Security does not currently index these kinds of changes, but rather just updates the Signal/Alert document directly.

Our indexing options include:

  1. Only store [A]. The Alert Table query would only look for these Alert/Signal documents. It may be difficult to look at changes over time to states like severity, workflow, etc.
  2. Only store [B]. The Alert Table query could either (i) always aggregate Evaluation documents or (ii) query all signal documents in a given range, grouped by some id, and only return the last of the group -- where states like duration, etc. could be stored and passed through by each executor.
  3. Store [A] and ([B] and/or [C]). The Alert Table query could still just look for Alert/Signal documents, but could drill into the changes over time by doing a separate query for related evaluation/state change documents. This involves the most writes/updates and the most storage, but the most flexibility at query time. Evaluations would result in either two new documents (one [A] and one [B]) or one create ([B]) and one update ([A]).
  4. Allow Rule Types to dictate which of these documents they store. This means you would not be able to consistently query for alerts, so I'm not sure this is a viable option.
  5. Require storing [A], Rule Types can opt-in to store [B] and/or [C]. This has flexibility but is still somewhat complicated re: knowing things about a Rule Type before you can query for its alerts, in some cases.

My biggest question right now, I think, is do we really need more than indexing option (1) here?

@andrewvc from Uptime has a lot of thoughts about this exact thing, so I'd like him to chime in from another Observability perspective, as well.

Thanks!

@dgieselaar
Copy link
Member

My biggest question right now, I think, is do we really need more than indexing option (1) here?

We cannot do the severity log. But I want to challenge the idea that pre-aggregating and updating a single document is the simpler approach. From what I can tell there are reports of performance issues where bulk updates of tens of thousands of alerts can take longer than 60s (@spong correct me if I misunderstood that). For Observability alerts this could happen every minute or even at a higher rate.

@jasonrhodes
Copy link
Member

@dgieselaar I had a good talk with @andrewvc yesterday where he talked at length about the simplicity gains of updates over relying on create-only + aggregations. I'll let him elaborate.

The Metrics UI is already a massive tangle of giant aggregations for which we are constantly wishing we had more stable and easily queryable documents, so I'm at least open to the idea that we may need to get over our fear of updates, but I'm not deep enough in the weeds to have a solid opinion.

@simianhacker @weltenwort do either of you have insights on this specific part of the conversation, re: storing lots of individual documents and aggregating for queries vs updating a single document and querying for that?

@jasonrhodes
Copy link
Member

But I want to challenge the idea that pre-aggregating and updating a single document is the simpler approach.

Also, the read would definitely be simpler, in the sense that it's one less level of aggregation to think through (and be limited by in some ways). The update may be more expensive than create/append-only, but I'm not sure to what extent.

@dgieselaar
Copy link
Member

Also, the read would definitely be simpler, in the sense that it's one less level of aggregation to think through (and be limited by in some ways).

If we don't need to visualize severity or state changes over time, agree. But from what I know we do want to have that.

The update may be more expensive than create/append-only, but I'm not sure to what extent.

Me neither. I asked for guidance from the ES folks but from what I understand there are no benchmarks for frequent updates because that is not recommended in general. We should probably chat to them at some point soon.

@andrewvc
Copy link
Contributor

andrewvc commented Apr 1, 2021

What a thread, I'm finally caught up reading the whole thing :). Conclusive answers here are tricky, and while I have a number of open questions at this point I'm leaning toward the 'signal' doc approach as superior (though I'm not 100% certain there).

The main question that kept coming into my head reading through this thread is 'what are the actual queries we want to run on this data?'. The challenge here is that we want a schema that works for multiple future UI iterations, where the queries are unknown. Any choice we make here amounts to a bet on the sorts of features we'll develop in the future. There's no one right answer.

Given that, if the main thing we want to represent in our UIs is the aggregated data (signals), that points toward using these mutable signal docs IMHO. Both approaches can work, but append only is more limiting in many read-time scenarios.

Synthetics

I have some further specific questions about how this would all play with Uptime/Synthetics. A common question we have to answer is 'give me a list of monitors filtered by up/down status'. It would be great to shift this to a list of monitors by alert active/not active, since that aligns the UI behavior with the alert behavior, and really makes alerts the definition of whether a monitor is good / bad.

Some questions about how this impacts us:

  1. Querying for down monitors is easy, just search the alerts index, but how do we query for monitors that are up? Do we need signal documents for the up state? Do we only let users filter by down (e.g. alerting) in the UI?
  2. How do we prevent overlap? Currently alerts are scoped by query, meaning two alerts can affect the same monitor. Ideally a monitor would have a single definition of healthy/unhealthy. We can probably solve this internally with some sort of hierarchy or priority (setting alerts on a per monitor basis is onerous at scale).
  3. To confirm, we'd need to copy any uptime fields a user might search on into the alerts index right? This will have implications for storage needs, but is probably acceptable.

One other option we've discussed is embedding the alert definition in the Heartbeat config with Fleet/Yaml, and doing the sliding window status evaluation on the client side, in Heartbeat, embedding the status / state in each heartbeat document. The performance cost here is essentially zero to record both contiguous up and down states. This process would let us look at a monitor's history as a series of state changes, as in the image below:

image

This effect can be achieved by the equivalent of embedding an 'evaluation' doc in each indexed doc, but with very little data overhead since it-s just a few fields in each doc. We could probably derive a 'signal' doc if in a variety of ways (dual write from the beat, some bg process in Kibana etc), or defer that work till needed.

That said, there's a lot to be said for the elegance of doing it all centrally in Kibana, you have much more power at your fingertips in terms of data context, but then you have more perf concerns as well (like scaling Kibana workers).

Performance

Performance is a great thing to be thinking about now, and it's a tough comparison, because while we may save at index time by going append-only we may pay a large cost in software complexity and slow queries doing weird filtering, perhaps being forced to do that on composite aggs. We have to do this style of query on the Uptime overview page, where it's slow, complex, and painful.

One thing that I would like us to consider generally is that any system that gets more active as things get worse is somewhat scary. When half your datacenter dies and an alert storm ensues you don't want your ES write load to jump by 1000% etc. One advantage of a system that reports both up and down states (nice for Synthetics use cases) would be that it would have less performance volatility. This also points toward mutable docs since steady states would be less onerous in terms of storage requirements (if you implement a flapping state they can be damped even further).

I'll also add that it's hard to compare perf of updates vs append-only if the read-time costs of append-only wind up higher.

@dgieselaar
Copy link
Member

dgieselaar commented Apr 2, 2021

Cheers for chiming in @andrewvc, good stuff - your experience with heartbeat/uptime is very valuable in the context of this discussion.

The main question that kept coming into my head reading through this thread is 'what are the actual queries we want to run on this data?'. The challenge here is that we want a schema that works for multiple future UI iterations, where the queries are unknown. Any choice we make here amounts to a bet on the sorts of features we'll develop in the future. There's no one right answer.

++. But, that's also why I feel we should store more, not less, precisely because we don't know.

Querying for down monitors is easy, just search the alerts index, but how do we query for monitors that are up? Do we need signal documents for the up state? Do we only let users filter by down (e.g. alerting) in the UI?

I think you'd have to have one changeable alert document per monitor, that you continuously update with the latest status. At least, I don't see how you avoid the issues you mention if you create a new changeable alert document on up/down changes. Having one long-running changeable alert document is somewhat concerning to me though.

How do we prevent overlap? Currently alerts are scoped by query, meaning two alerts can affect the same monitor. Ideally a monitor would have a single definition of healthy/unhealthy. We can probably solve this internally with some sort of hierarchy or priority (setting alerts on a per monitor basis is onerous at scale).

I'm not sure what the "alert" is you're referring to here. In the new terminology, is it the rule? E.g., the user can create multiple rules that evaluate the same (Uptime) monitor?

To confirm, we'd need to copy any uptime fields a user might search on into the alerts index right? This will have implications for storage needs, but is probably acceptable.

Yep, I'm hoping ES compression will help us here :).

One other option we've discussed is embedding the alert definition in the Heartbeat config with Fleet/Yaml, and doing the sliding window status evaluation on the client side, in Heartbeat, embedding the status / state in each heartbeat document. The performance cost here is essentially zero to record both contiguous up and down states. This process would let us look at a monitor's history as a series of state changes, as in the image below:

[...]

This effect can be achieved by the equivalent of embedding an 'evaluation' doc in each indexed doc, but with very little data overhead since it-s just a few fields in each doc. We could probably derive a 'signal' doc if in a variety of ways (dual write from the beat, some bg process in Kibana etc), or defer that work till needed.

I don't entirely follow :) what documents are you indexing and / or updating here? One document per state change, and update that?

That said, there's a lot to be said for the elegance of doing it all centrally in Kibana, you have much more power at your fingertips in terms of data context, but then you have more perf concerns as well (like scaling Kibana workers).

++. I think eventually we want something that can be applied both as a rule in Kibana, or as a rule in a beat (PromQL recording/alerting rules come to mind). I think Security also has something like this where rule definitions are shared between Kibana rules and the endpoint agents. @tsg is that correct?

Performance is a great thing to be thinking about now, and it's a tough comparison, because while we may save at index time by going append-only we may pay a large cost in software complexity and slow queries doing weird filtering, perhaps being forced to do that on composite aggs. We have to do this style of query on the Uptime overview page, where it's slow, complex, and painful.

Agree, and I want to emphasise again that I am not suggesting we repeat the problems that Uptime and Metrics UI are running into. Which is why I think we should re-open elastic/elasticsearch#61349 (comment) :). If ES supports something like give me the top document of value x for field y, and only filter/aggregate on those documents, that would be a huge power-play.

One thing that I would like us to consider generally is that any system that gets more active as things get worse is somewhat scary. When half your datacenter dies and an alert storm ensues you don't want your ES write load to jump by 1000% etc. One advantage of a system that reports both up and down states (nice for Synthetics use cases) would be that it would have less performance volatility. This also points toward mutable docs since steady states would be less onerous in terms of storage requirements (if you implement a flapping state they can be damped even further).

++ on having consistent output (ie, evaluations). But not sure if mutable docs are reasonable here? If a very large percentage of your documents are continuously being updated, will that not create merge pressure on Elasticsearch because it tries to merge segments all the time due to the deleted documents threshold being reached?

@tsg
Copy link
Contributor

tsg commented Apr 2, 2021

Thanks @andrewvc for chiming in (and for the patience of reading the whole thing :) ).

I have some further specific questions about how this would all play with Uptime/Synthetics. A common question we have to answer is 'give me a list of monitors filtered by up/down status'. It would be great to shift this to a list of monitors by alert active/not active, since that aligns the UI behavior with the alert behavior, and really makes alerts the definition of whether a monitor is good / bad.

Interesting. I'm curious about the advantages of this approach of using the alerts data as source of truth versus maintaining your own state in the app and creating alerts to reflect that state. Is it a matter of consolidating the logic in a single place, and that place is the Rule type? I think that can work, I'm just considering if this will bring a new set of requirements on the RAC :)

Btw, Heartbeat might have some similarities with Endpoint in this regard. The Endpoint knows already what is an alert as soon as it happens on the edge (e.g. malware model was triggered). It indexes an immutable document in its normal data streams. Then we have a "Promotion Rule", that's supposed to be always on, which "promotes" the immutable alert documents to Alerts with workflow (event.kind: signal) in the alerting indices. The Promotion Rule is a normal Alerting/Detection Engine rule, which has the advantage that users can tune it to some degree, for example by adding exceptions or overriding severity.
Brainstorming further:

Querying for down monitors is easy, just search the alerts index, but how do we query for monitors that are up? Do we need signal documents for the up state? Do we only let users filter by down (e.g. alerting) in the UI?

I'm thinking easiest would be that for each transition to down status you would store an Alert (event.kind: signal) and for each up status transition you would store another document (event.kind: event, perhaps) that you can also update over time if you want to.

Then the Alerting view in Obs or Kibana top-level will only show the down status alerts by default, but the custom UI in the Synthetics app can query both down and up status alerts.

In other words, the contract is that event.kind: signal is marking what is considered "bad" by the Rule type and that's what we display by default in the generic Alerting views. The Rule type code can store any other documents in the alerting indices as long as they don't cause mapping conflicts or field explosion, and the custom UIs can use them as they please.

@jasonrhodes
Copy link
Member

@dgieselaar, @andrewvc, and I just had a short zoom call to discuss observability schema stuff, and I think we settled on:

  • All rule types store document [A] (Alert/Signal) which is updated with various state values
  • Individual rule types can opt-in to storing documents [B] (evaluation) and/or [C] (state-change), which are immutable point-in-time events

(For reference, I made up these document letter names here in this comment above)

Observability will start out likely storing [A] and [B], we expect Security may continue only storing [A], and that should all work fine for all of our needs and create consistent query API for an alert table.

We also talked about further engaging the Elasticsearch team about update performance at scale, among other things, which @dgieselaar and @andrewvc will continue looking into in parallel.

@jasonrhodes
Copy link
Member

In other words, the contract is that event.kind: signal is marking what is considered "bad" by the Rule type and that's what we display by default in the generic Alerting views. The Rule type code can store any other documents in the alerting indices as long as they don't cause mapping conflicts or field explosion, and the custom UIs can use them as they please.

@tsg this feels like it aligns perfectly with what I wrote in my last comment, but I hadn't seen this bit in yours. If so, then I think we've got good alignment here! cc @spong

@tsg
Copy link
Contributor

tsg commented Apr 3, 2021

@jasonesc Yes, I think we're aligned 👍 . Because this ticket is so long, I'll write a tl;dr of this conclusion and add it to the description early next week.

@tsg
Copy link
Contributor

tsg commented Apr 5, 2021

Individual rule types can opt-in to storing documents [B] (evaluation) and/or [C] (state-change), which are immutable point-in-time events

One thing that I want to point out about these [B] and [C] documents: if we store them in the Alerts indices, they will be bound by the Alerts indices mapping. That means ECS + some fields that we agree on.

@jasonrhodes @andrewvc @dgieselaar for the use cases that you have in mind, are there fields that you expect to need to sort/filter by that are not in ECS? If yes, can you list them to see if we could still include them in the mapping? As long as they don't risk conflicting with future ECS fields and there are not too many of them, we can probably just add them to the mapping.

@dgieselaar
Copy link
Member

dgieselaar commented Apr 5, 2021

@tsg the plan is to create separate indices for solutions or even specific rule types. The fields needed for a unified experience should be in the root mapping that is shared between all indices. So, any specific fields we need might start out being defined in some of the Observability indices, and they can be "promoted" to the root mapping once we feel it's mature enough and widely useful.

In my head, any kind of event can be stored in these indices. That means state changes, evaluations, but also rule execution events.

I do wonder if we should store signal documents in another index that is not ILMed (with the same mappings). From what I know ILM requires one and only one write index. If we roll over an index on a stack upgrade, presumably this puts the rolled over index in read mode. How do you then update signal documents from indices that just rolled over?

(Added this to the agenda for tomorrow)

@tsg
Copy link
Contributor

tsg commented Apr 5, 2021

the plan is to create separate indices for solutions or even specific rule types. The fields needed for a unified experience should be in the root mapping that is shared between all indices. So, any specific fields we need might start out being defined in some of the Observability indices, and they can be "promoted" to the root mapping once we feel it's mature enough and widely useful.

Ok, yeah, we have individual indices anyway for other reasons, so maybe this is a non-issue. Would still be good to have an overview of the fields so that we get ahead of potential conflicts and confusion. For example, if two rule types use the same field name but with different types, it might be a good idea to resolve one way or the other to avoid future pain.

I do wonder if we should store signal documents in another index that is not ILMed (with the same mappings). From what I know ILM requires one and only one write index. If we roll over an index on a stack upgrade, presumably this puts the rolled over index in read mode. How do you then update signal documents from indices that just rolled over?

Are you perhaps thinking of datastreams? I think just ILM doesn't have this restriction and I did a quick test on the signal indices and seems to work.

@dgieselaar
Copy link
Member

For example, if two rule types use the same field name but with different types, it might be a good idea to resolve one way or the other to avoid future pain.

++, my idea was to throw an error on startup if two registries try to register the same field. We could also add a precompile step that checks for any incompatibilities between different registries + ecs mappings.

Are you perhaps thinking of datastreams? I think just ILM doesn't have this restriction and I did a quick test on the signal indices and seems to work.

Hmm, maybe? It's mentioned in the ILM docs [1] and the rollover API docs [2]:

[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started-index-lifecycle-management.html#ilm-gs-alias-bootstrap

Designates the new index as the write index and makes the bootstrap index read-only.

[2] https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-rollover-index.html#rollover-index-api-desc

If the specified rollover target is an alias pointing to a single index, the rollover request:

  • Creates a new index
  • Adds the alias to the new index
  • Removes the alias from the original index

If the specified rollover target is an alias pointing to multiple indices, one of these indices must > have is_write_index set to true. In this case, the rollover request:

Creates a new index
Sets is_write_index to true for the new index
Sets is_write_index to false for the original index

I'm not sure what scenarios are applicable here, to be honest.

@weltenwort
Copy link
Member

@simianhacker @weltenwort do either of you have insights on this specific part of the conversation, re: storing lots of individual documents and aggregating for queries vs updating a single document and querying for that?

Most aspects have already been discussed in this comprehensive thread, so I have little to add. The dual indexing strategy sounds like the most reasonable and least paint-ourselves-into-a-corner solution to me too.

As for query complexity, I think a well-designed mapping can alleviate many of the pains. The situation is not really comparable to the Metrics UI IMHO, since we fully control the mapping here and can choose the field semantics to match our queries.

@andrewvc
Copy link
Contributor

@tsg WRT extra fields in the mapping @tsg I can only think of two now (but we'll probably want more later of course). Ones that I can definitely see us adding are:

  • monitor.name: Human readable (non-unique) name of uptime monitor
  • monitor.id: Unique identifier for associated uptime monitor

CC @dominiqueclarke

@jasonrhodes
Copy link
Member

This ticket is very long and has a lot of complicated parts to it, but I think @tsg has done a good job of summarizing in the actual ticket. For further discussion of RAC alerts as data, let's open a new ticket or refer to another document. I'm locking this for right now, but please feel free to unlock/re-open if anyone needs to add anything.

@elastic elastic locked as resolved and limited conversation to collaborators May 4, 2021
@peluja1012
Copy link
Contributor

Closing in favor of "Alerts as Data" RFC doc.

@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
discuss Team:Detections and Resp Security Detection Response Team Team:Observability Team label for Observability Team (for things that are handled across all of observability) Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Theme: rac label obsolete
Projects
None yet
Development

No branches or pull requests