Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endpoint Telemetry: Agents Metrics + Policy Config / Response #102171

Merged
merged 37 commits into from
Jun 30, 2021

Conversation

pjhampton
Copy link
Contributor

@pjhampton pjhampton commented Jun 15, 2021

Summary

This PR retrieves and transmits Endpoint agent telemetry if cluster permissions permit.
There have been auxiliary PRs / Issues opened:

We are currently sharing the telemetry with the Endpoint team. We will be making changes to the final payload.


Implementation

The implementation is not that straightforward - here is a high level of how it works

  • Every 24hr grab an available Kibana worker
  • Aggregate the endpoint metrics datastream for the last 24 hours on unique endpoint id for latest metrics document
GET /.ds-metrics-endpoint.metrics*/_search?expand_wildcards=open,hidden
{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "aggs": {
    "endpoint_agents": {
      "terms": { 
        "size": 10000,
        "field": "agent.id" 
      },
      "aggs": {
        "latest_metrics": {
          "top_hits": {
            "size": 1,
            "sort": [
              {
                "@timestamp": {
                  "order": "desc"
                }
              }
            ]
          }
        }
      }
    }
  }
}
  • If there is no endpoint metrics we will assume there isn't any endpoint running and bail; otherwise
  • Get the full list of fleet agents
  • Iterate through the full list of agents, if it has a policy check it is of type endpoint, add to a cache associating the fleet agent id with the policy id
  • Aggregate the endpoint policy datastream for the last 24 hours on a unique endpoint id for the latest failed policy response (if exists)
GET /.ds-metrics-endpoint.policy*/_search?expand_wildcards=open,hidden
{
  "size": 0,
  "query": {
    "match": {
      "Endpoint.policy.applied.status": "failure"
    }
  },
  "aggs": {
    "policy_responses": {
      "terms": {
        "field": "Endpoint.policy.applied.id",
        "size": 10
      },
      "aggs": {
        "latest_response": {
          "top_hits": {
            "size": 1,
            "sort": [
              {
                "@timestamp": {
                  "order": "desc"
                }
              }
            ]
          }
        }
      }
    }
  }
}
  • Combine the metrics documents with the correct policy configuration and response failure (if applicable)
  • Send to the endpoint-meta telemetry channel

Follow up

I'm hosting a call week beginning 28/Jun re this telemetry + design.
Let me know if you want me to swing you an invite.

Checklist

@pjhampton pjhampton added Feature:Telemetry v8.0.0 Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. v7.14.0 auto-backport Deprecated - use backport:version if exact versions are needed labels Jun 15, 2021
@pjhampton pjhampton self-assigned this Jun 15, 2021
@pjhampton pjhampton requested a review from a team as a code owner June 15, 2021 09:36
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

@pjhampton pjhampton marked this pull request as draft June 15, 2021 09:36
@pjhampton pjhampton added the release_note:skip Skip the PR/issue when compiling release notes label Jun 15, 2021
@pjhampton
Copy link
Contributor Author

@elasticmachine merge upstream

@pjhampton
Copy link
Contributor Author

@elasticmachine merge upstream

@pjhampton pjhampton changed the title Endpoint Agent Telemetry: Fleet Config + Policy Response Endpoint Telemetry: Agents Metrics + Policy Config / Response Jun 29, 2021
Copy link
Member

@Bamieh Bamieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


const endpointPolicyCache = new Map<string, FullAgentPolicyInput>();
for (const policyInfo of fleetAgents.values()) {
if (policyInfo.policy_id !== null && policyInfo.policy_id !== undefined) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a personal preference here, but might make it a bit easier to read intent if we store the boolean logic in a variable

const shouldCachePolicy = 
	policyInfo.policy_id !== null &&
	policyInfo.policy_id !== undefined &&
	!endpointPolicyCache.has(policyInfo.policy_id)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that is slick. Thanks for the feedback!

Copy link
Contributor

@michaelolo24 michaelolo24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks for making these changes. We can work on performance improvements on a follow up PR 👍🏾

@pjhampton pjhampton enabled auto-merge (squash) June 29, 2021 19:13
@pjhampton
Copy link
Contributor Author

@elasticmachine merge upstream

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @pjhampton @donaherc

@pjhampton pjhampton merged commit eca1460 into master Jun 30, 2021
kibanamachine added a commit to kibanamachine/kibana that referenced this pull request Jun 30, 2021
…c#102171)

* [PH] Initial setup for endpoint task telemetry.

* Refactor / Add daily task for collecting fleet detail / policy resp / EP metrics

* [PH CD] Code walkthrough. Start fetching fleet policy configs.

* [PH] pass in fleet agent service rather than homebrew kuerys.

* [PH] prepare to move away from legacy es client. Get fleet ep agents.

* Fetch agent policy configs.

* Stub ep policy responses.

* Fix CI + Types. Fix dep injection. Reimagine SO client creation.

* Create SO client properly

* Fetch EP Policy responses.

* Fetch EP Policy responses.

* Remove unused import

* Fetch failed policy responses from EP data stream.

* Remove unused imports.

* Combine failed policy responses with policy configs.

* Attach fleet agent + ep agent ids

* Add dedicated channel sender. Temp disable with feature flag.

* Remove ublock from the failed policy response.

* Fetch endpoint metrics.

* Fix bad merge commit.

* Get EP telemetry.

* Record last execution time of endpoint task

* Remove send on demand feature flag.

* Simplify cache conditional.

* Refactor into Promise.allSettled

* Fix type error.

* Bail if there is no endpoint metrics

* Bump interval to 24h.

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
@kibanamachine
Copy link
Contributor

💚 Backport successful

Status Branch Result
7.x

This backport PR will be merged automatically after passing CI.

kibanamachine added a commit that referenced this pull request Jun 30, 2021
… (#103851)

* [PH] Initial setup for endpoint task telemetry.

* Refactor / Add daily task for collecting fleet detail / policy resp / EP metrics

* [PH CD] Code walkthrough. Start fetching fleet policy configs.

* [PH] pass in fleet agent service rather than homebrew kuerys.

* [PH] prepare to move away from legacy es client. Get fleet ep agents.

* Fetch agent policy configs.

* Stub ep policy responses.

* Fix CI + Types. Fix dep injection. Reimagine SO client creation.

* Create SO client properly

* Fetch EP Policy responses.

* Fetch EP Policy responses.

* Remove unused import

* Fetch failed policy responses from EP data stream.

* Remove unused imports.

* Combine failed policy responses with policy configs.

* Attach fleet agent + ep agent ids

* Add dedicated channel sender. Temp disable with feature flag.

* Remove ublock from the failed policy response.

* Fetch endpoint metrics.

* Fix bad merge commit.

* Get EP telemetry.

* Record last execution time of endpoint task

* Remove send on demand feature flag.

* Simplify cache conditional.

* Refactor into Promise.allSettled

* Fix type error.

* Bail if there is no endpoint metrics

* Bump interval to 24h.

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

Co-authored-by: Pete Hampton <pjhampton@users.noreply.github.com>
jloleysens added a commit to jloleysens/kibana that referenced this pull request Jun 30, 2021
…-png-pdf-report-type

* 'master' of github.com:elastic/kibana: (178 commits)
  [test] Migrating to kbn_archiver from es_archiver - for the Maps app (elastic#103028)
  [Reporting] Reintroduce "ILM policy for managing reporting indices" (elastic#103850)
  [Security Solution][Endpoint] Allow activity log scrolling on small screens (elastic#103852)
  Allow zero (0) to unset unenroll_timeout field (elastic#103790)
  [TSVB] Metric count is depicted as `-` instead of 0 (elastic#103717)
  [Query] Es query/field base (elastic#103177)
  Remove add data button from nav (elastic#103810)
  Fix telemetry advanced setting style (elastic#103838)
  [Transform] Fix default naming and sorting fields suggestion for `top_metrics` agg (elastic#103690)
  [APM] use conventional error rate color for correlations (elastic#103500)
  Endpoint Telemetry: Agents Metrics + Policy Config / Response (elastic#102171)
  [Alerting] Fixed search results are not updated when search term is removed on Rules and Connectors page (elastic#103663)
  fix too many rernders (elastic#103672)
  [APM] Add “Analyze Data” button (elastic#103485)
  [Lens] Fix value popover spacing (elastic#103081)
  [TSVB] Fix TSVB is not reporting all categories of Elasticsearch error (elastic#102926)
  [SECURITY] Adds security links to doc link service (elastic#102676)
  Update dependency @elastic/charts to v31 (elastic#102078)
  [Security Solution][CTI] Investigation time enrichment UI (elastic#103383)
  Adds ECS guide to doc links service (elastic#102246)
  ...

# Conflicts:
#	x-pack/plugins/reporting/public/share_context_menu/register_pdf_png_reporting.tsx
@spalger spalger added the v7.15.0 label Jul 7, 2021
@pjhampton pjhampton deleted the pjhampton/endpoint-telemetry branch February 15, 2022 09:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Deprecated - use backport:version if exact versions are needed Feature:Telemetry release_note:skip Skip the PR/issue when compiling release notes Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. v7.14.0 v7.15.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants