Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Monitoring] Only do a single date_histogram agg for get_nodes calls #43481

Merged

Conversation

chrisronline
Copy link
Contributor

@chrisronline chrisronline commented Aug 16, 2019

Resolves #43477

This PR changes the query executed when we fetch the nodes listing page from within the Stack Monitoring UI.

Currently, the query does multiple sub date_histogram aggregations, when only one is necessary. Unfortunately, this change is more complicated as parts of the code flow involved touch multiple shared areas of code, not easily changed to support this PR. As a result, we have a bit of shuffling and unshuffling happening to ensure we can perform a single date_histogram query, yet also return the same, expected structure to the rest of the monitoring code.

@elasticmachine
Copy link
Contributor

💔 Build Failed

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

@elasticmachine
Copy link
Contributor

Pinging @elastic/stack-monitoring

@chrisronline chrisronline marked this pull request as ready for review August 20, 2019 18:07
* This work stemmed from this issue: https://github.com/elastic/kibana/issues/43477
*
* Historically, the `get_nodes` function created an aggregation with multiple sub `date_histogram`
* aggregations for each metric aggregation. From a top down view, the entire aggregations look liked:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked like :)

@cachedout
Copy link
Contributor

This looks reasonably straightforward and the gains are pretty obvious. Are you thinking that down the road we'd continue by changing the rest of the code to avoid having to do this convert/unconvert strategy?

@chrisronline
Copy link
Contributor Author

This looks reasonably straightforward and the gains are pretty obvious. Are you thinking that down the road we'd continue by changing the rest of the code to avoid having to do this convert/unconvert strategy?

I don't think it's that simple.

Historically, the aggregation part of the query looks like (I intentionally removed a few of the aggregations to shorten it):

{
  "aggs": {
    "nodes": {
      "terms": {
        "field": "source_node.uuid",
        "size": 10000
      },
      "aggs": {
        "node_cgroup_quota": {
          "date_histogram": {
            "field": "timestamp",
            "min_doc_count": 1,
            "fixed_interval": "30s"
          },
          "aggs": {
            "usage": {
              "max": {
                "field": "node_stats.os.cgroup.cpuacct.usage_nanos"
              }
            },
            "periods": {
              "max": {
                "field": "node_stats.os.cgroup.cpu.stat.number_of_elapsed_periods"
              }
            },
            "quota": {
              "min": {
                "field": "node_stats.os.cgroup.cpu.cfs_quota_micros"
              }
            },
            "usage_deriv": {
              "derivative": {
                "buckets_path": "usage",
                "gap_policy": "skip",
                "unit": "1s"
              }
            },
            "periods_deriv": {
              "derivative": {
                "buckets_path": "periods",
                "gap_policy": "skip",
                "unit": "1s"
              }
            }
          }
        },
        "node_cgroup_throttled": {
          "date_histogram": {
            "field": "timestamp",
            "min_doc_count": 1,
            "fixed_interval": "30s"
          },
          "aggs": {
            "metric": {
              "max": {
                "field": "node_stats.os.cgroup.cpu.stat.time_throttled_nanos"
              }
            },
            "metric_deriv": {
              "derivative": {
                "buckets_path": "metric",
                "unit": "1s"
              }
            }
          }
        }
      }
    }
  }
}

The code then takes out the named aggregation buckets, which makes each aggregation structure look like (again, removing redundant parts):

"node_cgroup_throttled": {
  "buckets": [
    {
      "key_as_string": "2019-08-21T14:27:00.000Z",
      "key": 1566397620000,
      "doc_count": 3,
      "metric": {
        "value": null
      },
      "metric_deriv": {
        "value": null,
        "normalized_value": null
      }
    },
    {
      "key_as_string": "2019-08-21T14:27:30.000Z",
      "key": 1566397650000,
      "doc_count": 3,
      "metric": {
        "value": null
      },
      "metric_deriv": {
        "value": null,
        "normalized_value": null
      }
    }
  ]
}

The key part of this is, under each named aggregation, the structure is identical (or rather consistent) in that it has like metric and metric_deriv. Or, in some cases like node_cgroup_quota, there are extra aggregated metrics like usage and periods. Either way, in the response, the structure is normalized under the named aggregation into a consistent format.

This consistent format is expected in other areas of the code. I'll use the node_cgroup_quota example again. Here is the code that reads one of those normalized metric aggregation stuctures.

Some other examples of shared code expecting this normalized structure:

Now, let's look how this changes with a top level date_histogram aggregation.

That might look like (again, I removed redundant parts):

{
  "aggs": {
    "nodes": {
      "terms": {
        "field": "source_node.uuid",
        "size": 10000
      },
      "aggs": {
        "by_date": {
          "date_histogram": {
            "field": "timestamp",
            "min_doc_count": 1,
            "fixed_interval": "30s"
          },
          "aggs": {
            "node_cgroup_quota": {
              "usage": {
                "max": {
                  "field": "node_stats.os.cgroup.cpuacct.usage_nanos"
                }
              },
              "periods": {
                "max": {
                  "field": "node_stats.os.cgroup.cpu.stat.number_of_elapsed_periods"
                }
              },
              "quota": {
                "min": {
                  "field": "node_stats.os.cgroup.cpu.cfs_quota_micros"
                }
              },
              "usage_deriv": {
                "derivative": {
                  "buckets_path": "usage",
                  "gap_policy": "skip",
                  "unit": "1s"
                }
              },
              "periods_deriv": {
                "derivative": {
                  "buckets_path": "periods",
                  "gap_policy": "skip",
                  "unit": "1s"
                }
              }
            },
            "node_cgroup_throttled": {
              "metric": {
                "max": {
                  "field": "node_stats.os.cgroup.cpu.stat.time_throttled_nanos"
                }
              },
              "metric_deriv": {
                "derivative": {
                  "buckets_path": "metric",
                  "unit": "1s"
                }
              }
            }
          }
        }
      }
    }
  }
}

This won't work because our named aggregation doesn't actually have a sub aggregation anymore (it used to be date_histogram) so we'd need to remove those entirely, which would make our structure look like (this is NOT shortened):

{
  "aggs": {
    "nodes": {
      "terms": {
        "field": "source_node.uuid",
        "size": 10000
      },
      "aggs": {
        "by_date": {
          "date_histogram": {
            "field": "timestamp",
            "min_doc_count": 1,
            "fixed_interval": "30s"
          },
          "aggs": {
            "usage": {
              "max": {
                "field": "node_stats.os.cgroup.cpuacct.usage_nanos"
              }
            },
            "periods": {
              "max": {
                "field": "node_stats.os.cgroup.cpu.stat.number_of_elapsed_periods"
              }
            },
            "quota": {
              "min": {
                "field": "node_stats.os.cgroup.cpu.cfs_quota_micros"
              }
            },
            "usage_deriv": {
              "derivative": {
                "buckets_path": "usage",
                "gap_policy": "skip",
                "unit": "1s"
              }
            },
            "periods_deriv": {
              "derivative": {
                "buckets_path": "periods",
                "gap_policy": "skip",
                "unit": "1s"
              }
            },
            "metric": {
              "max": {
                "field": "node_stats.fs.total.available_in_bytes"
              }
            },
            "metric_deriv": {
              "derivative": {
                "buckets_path": "metric",
                "unit": "1s"
              }
            }
          }
        }
      }
    }
  }
}

Even though we are supposed to have 6 named aggregations, we only see one metric and one metric_deriv because we can't have multiple of the same named key in a JSON structure.

This is the main issue.

We need to convert the structure to maintain the named aggregation, but rather than having it in a higher aggregation, we need to prefix the metric aggregation named with the named aggregation, something like node_cgroup_quota_periods.

Of course, once we do that, we need to undo that structure, as the rest of the shared code (linked earlier) would not know to look for that, or even care about the prefix naming.

I'm all for a better approach if anyone has any thoughts, but I didn't see a path forward outside of what I did in the PR.

@cachedout
Copy link
Contributor

Thanks for the explanation.

I thought about this and I can't see a better way given the way the application is structured right now. Another approach seems like it might be to back away from trying to fetch a set of metrics for a set of nodes in a single ES request and instead break that up into multiple smaller requests but I'm pretty wary about going down that road, especially given that it's hard to believe that wouldn't be dramatically less performant for this case.

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

@igoristic
Copy link
Contributor

Awesome stuff @chrisronline! This fix alone will have a great impact on performance

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

@chrisronline
Copy link
Contributor Author

@cachedout I think this should be all ready for you to review again when you get the chance

Copy link
Contributor

@cachedout cachedout left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good. 👍 on the thorough documentation and explanation in the comments.

@elasticmachine
Copy link
Contributor

💔 Build Failed

@chrisronline
Copy link
Contributor Author

retest

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

@chrisronline chrisronline merged commit 3489274 into elastic:master Aug 27, 2019
@chrisronline chrisronline deleted the monitoring/no_more_date_histograms branch August 27, 2019 18:13
chrisronline added a commit to chrisronline/kibana that referenced this pull request Aug 27, 2019
…lastic#43481)

* I think this is working now

* Add a way to uncovert, and then fix tests

* Remove unnecessary export
chrisronline added a commit to chrisronline/kibana that referenced this pull request Aug 27, 2019
…lastic#43481)

* I think this is working now

* Add a way to uncovert, and then fix tests

* Remove unnecessary export
chrisronline added a commit that referenced this pull request Aug 27, 2019
…43481) (#44133)

* I think this is working now

* Add a way to uncovert, and then fix tests

* Remove unnecessary export
chrisronline added a commit that referenced this pull request Aug 27, 2019
…43481) (#44134)

* I think this is working now

* Add a way to uncovert, and then fix tests

* Remove unnecessary export
chrisronline added a commit that referenced this pull request Aug 28, 2019
…calls (#43481) (#44137)

* [Monitoring] Only do a single date_histogram agg for get_nodes calls (#43481)

* I think this is working now

* Add a way to uncovert, and then fix tests

* Remove unnecessary export

* Update snapshots

* normalize this across branches

* This is just interval in 6.8
@chrisronline
Copy link
Contributor Author

Backport:

7.x: d4f0efc
7.3: 87ec6ed
6.8: c814843

@cachedout
Copy link
Contributor

As a related side-note, the ES team is tracking a severe performance regression in date_histogram in 7.3.

@chrisronline
Copy link
Contributor Author

@cachedout Nice find! I'm not quite sure how this information relates to this PR though. Do you mind providing more more details about the connection?

@cachedout
Copy link
Contributor

I only put it here because it was one of the places where we've been working with date_histogram aggs recently and I figured that if somebody searched for "monitoring" and date_histogram they'd come across this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Monitoring] Only perform a single date_histogram in get_nodes
4 participants