[Monitoring] Only do a single date_histogram agg for get_nodes calls #43481

chrisronline · 2019-08-16T18:13:27Z

Resolves #43477

This PR changes the query executed when we fetch the nodes listing page from within the Stack Monitoring UI.

Currently, the query does multiple sub date_histogram aggregations, when only one is necessary. Unfortunately, this change is more complicated as parts of the code flow involved touch multiple shared areas of code, not easily changed to support this PR. As a result, we have a bit of shuffling and unshuffling happening to ensure we can perform a single date_histogram query, yet also return the same, expected structure to the rest of the monitoring code.

elasticmachine · 2019-08-16T19:18:59Z

💔 Build Failed

continuous-integration/kibana-ci/pull-request

elasticmachine · 2019-08-20T16:52:36Z

💚 Build Succeeded

continuous-integration/kibana-ci/pull-request

elasticmachine · 2019-08-20T18:06:52Z

Pinging @elastic/stack-monitoring

cachedout · 2019-08-21T13:05:13Z

x-pack/legacy/plugins/monitoring/server/lib/elasticsearch/convert_metric_names.js

+ * This work stemmed from this issue: https://github.com/elastic/kibana/issues/43477
+ *
+ * Historically, the `get_nodes` function created an aggregation with multiple sub `date_histogram`
+ * aggregations for each metric aggregation. From a top down view, the entire aggregations look liked:


Looked like :)

cachedout · 2019-08-21T13:28:54Z

This looks reasonably straightforward and the gains are pretty obvious. Are you thinking that down the road we'd continue by changing the rest of the code to avoid having to do this convert/unconvert strategy?

chrisronline · 2019-08-21T15:35:52Z

This looks reasonably straightforward and the gains are pretty obvious. Are you thinking that down the road we'd continue by changing the rest of the code to avoid having to do this convert/unconvert strategy?

I don't think it's that simple.

Historically, the aggregation part of the query looks like (I intentionally removed a few of the aggregations to shorten it):

{
  "aggs": {
    "nodes": {
      "terms": {
        "field": "source_node.uuid",
        "size": 10000
      },
      "aggs": {
        "node_cgroup_quota": {
          "date_histogram": {
            "field": "timestamp",
            "min_doc_count": 1,
            "fixed_interval": "30s"
          },
          "aggs": {
            "usage": {
              "max": {
                "field": "node_stats.os.cgroup.cpuacct.usage_nanos"
              }
            },
            "periods": {
              "max": {
                "field": "node_stats.os.cgroup.cpu.stat.number_of_elapsed_periods"
              }
            },
            "quota": {
              "min": {
                "field": "node_stats.os.cgroup.cpu.cfs_quota_micros"
              }
            },
            "usage_deriv": {
              "derivative": {
                "buckets_path": "usage",
                "gap_policy": "skip",
                "unit": "1s"
              }
            },
            "periods_deriv": {
              "derivative": {
                "buckets_path": "periods",
                "gap_policy": "skip",
                "unit": "1s"
              }
            }
          }
        },
        "node_cgroup_throttled": {
          "date_histogram": {
            "field": "timestamp",
            "min_doc_count": 1,
            "fixed_interval": "30s"
          },
          "aggs": {
            "metric": {
              "max": {
                "field": "node_stats.os.cgroup.cpu.stat.time_throttled_nanos"
              }
            },
            "metric_deriv": {
              "derivative": {
                "buckets_path": "metric",
                "unit": "1s"
              }
            }
          }
        }
      }
    }
  }
}

The code then takes out the named aggregation buckets, which makes each aggregation structure look like (again, removing redundant parts):

"node_cgroup_throttled": {
  "buckets": [
    {
      "key_as_string": "2019-08-21T14:27:00.000Z",
      "key": 1566397620000,
      "doc_count": 3,
      "metric": {
        "value": null
      },
      "metric_deriv": {
        "value": null,
        "normalized_value": null
      }
    },
    {
      "key_as_string": "2019-08-21T14:27:30.000Z",
      "key": 1566397650000,
      "doc_count": 3,
      "metric": {
        "value": null
      },
      "metric_deriv": {
        "value": null,
        "normalized_value": null
      }
    }
  ]
}

The key part of this is, under each named aggregation, the structure is identical (or rather consistent) in that it has like metric and metric_deriv. Or, in some cases like node_cgroup_quota, there are extra aggregated metrics like usage and periods. Either way, in the response, the structure is normalized under the named aggregation into a consistent format.

This consistent format is expected in other areas of the code. I'll use the node_cgroup_quota example again. Here is the code that reads one of those normalized metric aggregation stuctures.

Some other examples of shared code expecting this normalized structure:

Now, let's look how this changes with a top level date_histogram aggregation.

That might look like (again, I removed redundant parts):

{
  "aggs": {
    "nodes": {
      "terms": {
        "field": "source_node.uuid",
        "size": 10000
      },
      "aggs": {
        "by_date": {
          "date_histogram": {
            "field": "timestamp",
            "min_doc_count": 1,
            "fixed_interval": "30s"
          },
          "aggs": {
            "node_cgroup_quota": {
              "usage": {
                "max": {
                  "field": "node_stats.os.cgroup.cpuacct.usage_nanos"
                }
              },
              "periods": {
                "max": {
                  "field": "node_stats.os.cgroup.cpu.stat.number_of_elapsed_periods"
                }
              },
              "quota": {
                "min": {
                  "field": "node_stats.os.cgroup.cpu.cfs_quota_micros"
                }
              },
              "usage_deriv": {
                "derivative": {
                  "buckets_path": "usage",
                  "gap_policy": "skip",
                  "unit": "1s"
                }
              },
              "periods_deriv": {
                "derivative": {
                  "buckets_path": "periods",
                  "gap_policy": "skip",
                  "unit": "1s"
                }
              }
            },
            "node_cgroup_throttled": {
              "metric": {
                "max": {
                  "field": "node_stats.os.cgroup.cpu.stat.time_throttled_nanos"
                }
              },
              "metric_deriv": {
                "derivative": {
                  "buckets_path": "metric",
                  "unit": "1s"
                }
              }
            }
          }
        }
      }
    }
  }
}

This won't work because our named aggregation doesn't actually have a sub aggregation anymore (it used to be date_histogram) so we'd need to remove those entirely, which would make our structure look like (this is NOT shortened):

{
  "aggs": {
    "nodes": {
      "terms": {
        "field": "source_node.uuid",
        "size": 10000
      },
      "aggs": {
        "by_date": {
          "date_histogram": {
            "field": "timestamp",
            "min_doc_count": 1,
            "fixed_interval": "30s"
          },
          "aggs": {
            "usage": {
              "max": {
                "field": "node_stats.os.cgroup.cpuacct.usage_nanos"
              }
            },
            "periods": {
              "max": {
                "field": "node_stats.os.cgroup.cpu.stat.number_of_elapsed_periods"
              }
            },
            "quota": {
              "min": {
                "field": "node_stats.os.cgroup.cpu.cfs_quota_micros"
              }
            },
            "usage_deriv": {
              "derivative": {
                "buckets_path": "usage",
                "gap_policy": "skip",
                "unit": "1s"
              }
            },
            "periods_deriv": {
              "derivative": {
                "buckets_path": "periods",
                "gap_policy": "skip",
                "unit": "1s"
              }
            },
            "metric": {
              "max": {
                "field": "node_stats.fs.total.available_in_bytes"
              }
            },
            "metric_deriv": {
              "derivative": {
                "buckets_path": "metric",
                "unit": "1s"
              }
            }
          }
        }
      }
    }
  }
}

Even though we are supposed to have 6 named aggregations, we only see one metric and one metric_deriv because we can't have multiple of the same named key in a JSON structure.

This is the main issue.

We need to convert the structure to maintain the named aggregation, but rather than having it in a higher aggregation, we need to prefix the metric aggregation named with the named aggregation, something like node_cgroup_quota_periods.

Of course, once we do that, we need to undo that structure, as the rest of the shared code (linked earlier) would not know to look for that, or even care about the prefix naming.

I'm all for a better approach if anyone has any thoughts, but I didn't see a path forward outside of what I did in the PR.

cachedout · 2019-08-21T16:48:53Z

Thanks for the explanation.

I thought about this and I can't see a better way given the way the application is structured right now. Another approach seems like it might be to back away from trying to fetch a set of metrics for a set of nodes in a single ES request and instead break that up into multiple smaller requests but I'm pretty wary about going down that road, especially given that it's hard to believe that wouldn't be dramatically less performant for this case.

…_more_date_histograms

elasticmachine · 2019-08-21T18:24:07Z

💚 Build Succeeded

continuous-integration/kibana-ci/pull-request

x-pack/legacy/plugins/monitoring/server/lib/elasticsearch/convert_metric_names.js

igoristic · 2019-08-21T19:50:37Z

Awesome stuff @chrisronline! This fix alone will have a great impact on performance

elasticmachine · 2019-08-21T21:08:09Z

💚 Build Succeeded

continuous-integration/kibana-ci/pull-request

chrisronline · 2019-08-23T13:10:20Z

@cachedout I think this should be all ready for you to review again when you get the chance

cachedout

I think this is good. 👍 on the thorough documentation and explanation in the comments.

…_more_date_histograms

elasticmachine · 2019-08-27T16:23:23Z

💔 Build Failed

continuous-integration/kibana-ci/pull-request

chrisronline · 2019-08-27T16:50:11Z

retest

elasticmachine · 2019-08-27T18:11:01Z

💚 Build Succeeded

continuous-integration/kibana-ci/pull-request

…lastic#43481) * I think this is working now * Add a way to uncovert, and then fix tests * Remove unnecessary export

…43481) (#44133) * I think this is working now * Add a way to uncovert, and then fix tests * Remove unnecessary export

…43481) (#44134) * I think this is working now * Add a way to uncovert, and then fix tests * Remove unnecessary export

…calls (#43481) (#44137) * [Monitoring] Only do a single date_histogram agg for get_nodes calls (#43481) * I think this is working now * Add a way to uncovert, and then fix tests * Remove unnecessary export * Update snapshots * normalize this across branches * This is just interval in 6.8

chrisronline · 2019-08-28T12:59:27Z

Backport:

7.x: d4f0efc
7.3: 87ec6ed
6.8: c814843

cachedout · 2019-09-11T10:08:07Z

As a related side-note, the ES team is tracking a severe performance regression in date_histogram in 7.3.

chrisronline · 2019-09-11T13:45:25Z

@cachedout Nice find! I'm not quite sure how this information relates to this PR though. Do you mind providing more more details about the connection?

cachedout · 2019-09-11T14:50:54Z

I only put it here because it was one of the places where we've been working with date_histogram aggs recently and I figured that if somebody searched for "monitoring" and date_histogram they'd come across this issue.

I think this is working now

3766410

Add a way to uncovert, and then fix tests

264e7c9

chrisronline self-assigned this Aug 20, 2019

chrisronline requested a review from igoristic August 20, 2019 18:06

chrisronline added review Team:Monitoring Stack Monitoring team v7.4.0 v8.0.0 labels Aug 20, 2019

chrisronline added v6.8.3 v7.3.2 labels Aug 20, 2019

chrisronline marked this pull request as ready for review August 20, 2019 18:07

cachedout reviewed Aug 21, 2019

View reviewed changes

chrisronline added the release_note:fix label Aug 21, 2019

Merge branch 'master' of github.com:elastic/kibana into monitoring/no…

674f5b5

…_more_date_histograms

igoristic reviewed Aug 21, 2019

View reviewed changes

x-pack/legacy/plugins/monitoring/server/lib/elasticsearch/convert_metric_names.js Outdated Show resolved Hide resolved

igoristic approved these changes Aug 21, 2019

View reviewed changes

Remove unnecessary export

3e374be

cachedout approved these changes Aug 27, 2019

View reviewed changes

Merge branch 'master' of github.com:elastic/kibana into monitoring/no…

6fe6300

…_more_date_histograms

chrisronline merged commit 3489274 into elastic:master Aug 27, 2019

chrisronline deleted the monitoring/no_more_date_histograms branch August 27, 2019 18:13

This was referenced Aug 27, 2019

[7.x] [Monitoring] Only do a single date_histogram agg for get_nodes calls (#43481) #44133

Merged

[7.3] [Monitoring] Only do a single date_histogram agg for get_nodes calls (#43481) #44134

Merged

chrisronline mentioned this pull request Aug 27, 2019

[6.8] [Monitoring] Only do a single date_histogram agg for get_nodes calls (#43481) #44137

Merged

chrisronline added a commit that referenced this pull request Aug 27, 2019

[Monitoring] Only do a single date_histogram agg for get_nodes calls (#…

d4f0efc

…43481) (#44133) * I think this is working now * Add a way to uncovert, and then fix tests * Remove unnecessary export

chrisronline added a commit that referenced this pull request Aug 27, 2019

[Monitoring] Only do a single date_histogram agg for get_nodes calls (#…

87ec6ed

…43481) (#44134) * I think this is working now * Add a way to uncovert, and then fix tests * Remove unnecessary export

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Monitoring] Only do a single date_histogram agg for get_nodes calls #43481

[Monitoring] Only do a single date_histogram agg for get_nodes calls #43481

chrisronline commented Aug 16, 2019 •

edited

Loading

elasticmachine commented Aug 16, 2019

elasticmachine commented Aug 20, 2019

elasticmachine commented Aug 20, 2019

cachedout Aug 21, 2019

cachedout commented Aug 21, 2019

chrisronline commented Aug 21, 2019

cachedout commented Aug 21, 2019

elasticmachine commented Aug 21, 2019

igoristic commented Aug 21, 2019

elasticmachine commented Aug 21, 2019

chrisronline commented Aug 23, 2019

cachedout left a comment

elasticmachine commented Aug 27, 2019

chrisronline commented Aug 27, 2019

elasticmachine commented Aug 27, 2019

chrisronline commented Aug 28, 2019

cachedout commented Sep 11, 2019

chrisronline commented Sep 11, 2019

cachedout commented Sep 11, 2019

[Monitoring] Only do a single date_histogram agg for get_nodes calls #43481

[Monitoring] Only do a single date_histogram agg for get_nodes calls #43481

Conversation

chrisronline commented Aug 16, 2019 • edited Loading

elasticmachine commented Aug 16, 2019

💔 Build Failed

elasticmachine commented Aug 20, 2019

💚 Build Succeeded

elasticmachine commented Aug 20, 2019

cachedout Aug 21, 2019

Choose a reason for hiding this comment

cachedout commented Aug 21, 2019

chrisronline commented Aug 21, 2019

cachedout commented Aug 21, 2019

elasticmachine commented Aug 21, 2019

💚 Build Succeeded

igoristic commented Aug 21, 2019

elasticmachine commented Aug 21, 2019

💚 Build Succeeded

chrisronline commented Aug 23, 2019

cachedout left a comment

Choose a reason for hiding this comment

elasticmachine commented Aug 27, 2019

💔 Build Failed

chrisronline commented Aug 27, 2019

elasticmachine commented Aug 27, 2019

💚 Build Succeeded

chrisronline commented Aug 28, 2019

cachedout commented Sep 11, 2019

chrisronline commented Sep 11, 2019

cachedout commented Sep 11, 2019

chrisronline commented Aug 16, 2019 •

edited

Loading