Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Observability] Load hasData call asynchronously #80644

Merged
merged 21 commits into from
Nov 23, 2020

Conversation

cauemarcondes
Copy link
Contributor

This PR intends to improve the performance of the Observability page.

The current architecture waits for all hasData requests to be completed to then verify to which page it should redirect the user.

The new architecture waits until at least one app returns true to the hasData request and immediately redirects the user to the correct page, while the remaining calls are being finished in background.
Before:
obs-perf-before 2
After:
obs-perf-after 2

Before:
obs-perf-before
After:
obs-perf-after

@cauemarcondes cauemarcondes added release_note:skip Skip the PR/issue when compiling release notes v7.11.0 labels Oct 15, 2020
@cauemarcondes cauemarcondes marked this pull request as ready for review October 15, 2020 12:48
@cauemarcondes cauemarcondes requested review from a team as code owners October 15, 2020 12:48
@cauemarcondes cauemarcondes requested a review from a team October 15, 2020 12:48
@botelastic botelastic bot added Team:APM All issues that need APM UI Team support Team:Uptime - DEPRECATED Synthetics & RUM sub-team of Application Observability labels Oct 15, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/apm-ui (Team:apm)

@elasticmachine
Copy link
Contributor

Pinging @elastic/uptime (Team:uptime)

() => {
apps.forEach(async (app) => {
try {
const params =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the UX app have a different interface for its hasData handler? And looks like it will not be updated because it's not a dependency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @shahzad31 is the right person to answer this question.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe @justinkambic knows?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dgieselaar i believe it was a compromise between performance and consistency. For UX case we are displaying data for the service with most traffic, so in the same query where we are getting if it has rum data , it is also getting service name with most traffic via terms aggs.

I believe we need better types here to clear this to avoid this if conditions. Or maybe we can get service name via a different query or maybe make two es queries in data request. IMO consistency here isn't worth an additional query.

There was also discussion about this in original PR where this was added.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I would expect that to be unlikely. The fact that you're running a terms aggregation to get the service name means that you cannot terminate early, because all documents have to be collected and aggregated (within the given time range). That will make your hasData call significantly slower. I don't think it's worth it, in fact, it's probably worse in terms of performance. Instead of running things in parallel, and getting certain parts of the data when they're ready, it's now one long blocking request.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be a case that terminate early flag isn't being passed correctly to es request.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dev-next doesn't have a lot of data. IIRC, half of our users have more data, and significantly less nodes. So, if it's slow on dev-next it's most likely an issue, but if it's fast, it doesn't really mean that it will be fast for our users. In this example, you'll likely have relatively high overhead of security checks etc, which happen for each request, so it's hard to tell how efficient the search is based on network timings.

I think it's good here to take a step back and analyze what's happening with this search:

{
  size: 0,
  query: {
    bool: {
      filter: [{ term: { [TRANSACTION_TYPE]: TRANSACTION_PAGE_LOAD } }],
    },
  },
  aggs: {
    services: {
      filter: {
        range: rangeFilter(start, end),
      },
      aggs: {
        mostTraffic: {
          terms: {
            field: SERVICE_NAME,
            size: 1,
          },
        },
      },
    },
  },
}
  • First all page load transactions in all indices are collected, because there is no range query on @timestamp.
  • Then, a filter aggregation runs to select only documents that match the given time range.
  • Then, another aggregation runs on all the documents for the given time range. size only removes the bucket from the response - it will still be calculated.

All these steps run sequentially (per document).

Especially the first step could be pretty slow for users that have a lot of RUM data.

If you separate this however, you'd have two searches that you can run in parallel, plus, even the total duration will be faster.

  • For the hasData search, you can set terminate_after to 1, meaning it will exit as soon as it has found a document. The performance cost of this search is negligible in most cases.
  • For the getServiceName search, you can simply add a range query in your bool clause. This will remove the need for a filter aggregation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One example from DPP:

  • one dual-purpose search: 2000ms
  • hasData: 40ms
  • getServiceName: 90ms

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Queries:

GET apm-*/_search?request_cache=false
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "transaction.type": "page-load"
          }
        }
      ]
    }
  },
  "aggs": {
    "services": {
      "filter": {
        "range": {
          "@timestamp": {
            "gte": "2020-05-12T10:00:00.000Z",
            "lt": "2020-05-12T11:00:00.000Z"
          }
        }
      },
      "aggs": {
        "mostTraffic": {
          "terms": {
            "field": "service.name",
            "size": 1
          }
        }
      }
    }
  }
}

GET apm-*/_search?request_cache=false
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "transaction.type": "page-load"
          }
        }
      ]
    }
  },
  "size": 0,
  "track_total_hits": true,
  "terminate_after": 1
}

GET apm-*/_search?request_cache=false
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "transaction.type": "page-load"
          }
        },
        {
          "range": {
            "@timestamp": {
              "gte": "2020-05-12T10:00:00.000Z",
              "lt": "2020-05-12T11:00:00.000Z"
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "mostTraffic": {
      "terms": {
        "field": "service.name",
        "size": 1
      }
    }
  }
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

}

export function DataSections({ bucketSize, hasData, absoluteTime, relativeTime }: Props) {
return (
<EuiFlexItem grow={false}>
<EuiFlexGroup direction="column">
{hasData?.infra_logs && (
{hasData?.infra_logs?.hasData && (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove these statements, and add a hidden prop on the EUIFlexItem component. That will ensure the component itself will start loading the data.

Copy link
Contributor Author

@cauemarcondes cauemarcondes Oct 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure the hidden property works? I haven't found any description of it here https://elastic.github.io/eui/#/layout/flex, and I tried it and it always shows the child element.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a native HTML attribute. It should work, but if it doesn't, there might be some styles from EUI overriding it, or React special-casing it. We don't have to use hidden, but the point is to render the component so it starts the request, but hide it from the UI.

Copy link
Contributor

@justinkambic justinkambic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionally, looks good. I agree with some of the previous comments that we can probably improve code clarity a bit.

@cauemarcondes
Copy link
Contributor Author

@elasticmachine merge upstream

@cauemarcondes
Copy link
Contributor Author

Looks like there are two issues still outstanding, namely that the hasData interface for the UX app has a different interface,

@dgieselaar @shahzad31 since I'm going to be on PTO for the next weeks, I'd rather doing this change in another PR. I also want to refactor the fetchData, today it's called inside each component, I want to change it to also use the new Context created.

and the components still wait until hasData has returned for at least one app

I'm going to fix it.

@cauemarcondes
Copy link
Contributor Author

retest

@cauemarcondes
Copy link
Contributor Author

@elasticmachine merge upstream

Comment on lines 47 to 49
if (app !== 'alert') {
const params =
app === 'ux' ? { absoluteTime: { start: absStart, end: absEnd } } : undefined;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the special handling for alert and ux? I thought we had the same interface for all apps

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ux there's a tread where @dgieselaar and @shahzad31 discussed it, #80644 (comment). To remove it is necessary some refactor on the UX side, which would be better if we could fo it in another PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For alert, I can solve it if I manually register the alert service with registerDataHandler. WDYT?

Comment on lines +23 to +36
if (id === 'alert') {
const { status, hasData: alerts } = hasData.alert || {};
return (
status === FETCH_STATUS.FAILURE ||
(status === FETCH_STATUS.SUCCESS && (alerts as Alert[]).length === 0)
);
} else {
const app = hasData[id];
if (app) {
const _hasData = id === 'ux' ? (app.hasData as UXHasDataResponse)?.hasData : app.hasData;
return app.status === FETCH_STATUS.FAILURE || !_hasData;
}
}
return false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should get rid of the special logic for alert and ux here if possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove the alert logic based on my comment here. Ux will demand more changes with I'm not familiar to, maybe we could leave it like this for now and improve it in another PR.

}
}

fetchAlerts();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it the right place to fetch alerts in the HasDataContextProvider?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought would be nice to centralize all requests to fill the page in a single place. Alert would have behaved in the same way as the other apps if I registered it. Maybe that is the change I should do.

Copy link
Member

@sorenlouv sorenlouv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm. Just a few comments

@cauemarcondes
Copy link
Contributor Author

@dgieselaar @sqren @shahzad31 I don't want to spend more time on this PR, it has already consumed me a lot. I have improved the way hasData is fetched, now it doesn't wait until all APPs return to navigate, it waits until the first app that returns true. That has already improved the loading time.

I agree we must fix the UX hasData, to use the same signature like the others, but it can be done on another PR since this is already big.

Copy link
Member

@sorenlouv sorenlouv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Member

@weltenwort weltenwort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logs UI changes LGTM

and nice job at improving the responsiveness of the overview 👏

@cauemarcondes
Copy link
Contributor Author

@elasticmachine merge upstream

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
observability 108 111 +3

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
observability 160.7KB 169.1KB +8.4KB

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
observability 72.1KB 68.6KB -3.5KB

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@cauemarcondes cauemarcondes merged commit ac73b6a into elastic:master Nov 23, 2020
@cauemarcondes cauemarcondes deleted the obs-improve-perf branch November 23, 2020 10:58
cauemarcondes added a commit to cauemarcondes/kibana that referenced this pull request Nov 23, 2020
* obs perf

* fixing unit tests

* fixing ts issues

* fixing empty state

* addressing pr comments

* addressing pr comments

* fixing TS issue

* fixing some stuff

* refactoring

* fixing ts issues and unit tests

* addressing PR comments

* fixing TS issues

* fixing eslint issue

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
# Conflicts:
#	x-pack/plugins/observability/public/pages/overview/index.tsx
gmmorris added a commit to rudolf/kibana that referenced this pull request Nov 23, 2020
* master: (67 commits)
  [Observability] Load hasData call asynchronously (elastic#80644)
  Implement AnonymousAuthenticationProvider. (elastic#79985)
  Deprecate `visualization:colorMapping` advanced setting (elastic#83372)
  [TSVB] [Rollup] Table tab not working with rollup indexes (elastic#83635)
  Revert "[Search] Search batching using bfetch (elastic#83418)" (elastic#84037)
  skip flaky suite (elastic#83772)
  skip flaky suite (elastic#69849)
  create kbn-legacy-logging package (elastic#77678)
  [Search] Search batching using bfetch (elastic#83418)
  [Security Solution] Refactor Timeline flyout to take a full page (elastic#82033)
  Drop use of console-stamp (elastic#83922)
  skip flaky suite (elastic#84011 , elastic#84012)
  Fixed usage of `isReady` for usage collection of alerts and actions (elastic#83760)
  [maps] support URL drilldowns (elastic#83732)
  Revert "Added default dedupKey value as an {{alertInstanceId}} to provide grouping functionality for PagerDuty incidents. (elastic#83226)"
  [code coverage] Update jest config to collect more data (elastic#83804)
  Added default dedupKey value as an {{alertInstanceId}} to provide grouping functionality for PagerDuty incidents. (elastic#83226)
  [Security Solution] Give notice when endpoint policy is out of date (elastic#83469)
  [Security Solution] Sync url state on any changes to query string (elastic#83314)
  [CI] Initial TeamCity implementation (elastic#81043)
  ...
cauemarcondes added a commit that referenced this pull request Nov 23, 2020
* obs perf

* fixing unit tests

* fixing ts issues

* fixing empty state

* addressing pr comments

* addressing pr comments

* fixing TS issue

* fixing some stuff

* refactoring

* fixing ts issues and unit tests

* addressing PR comments

* fixing TS issues

* fixing eslint issue

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
# Conflicts:
#	x-pack/plugins/observability/public/pages/overview/index.tsx
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release_note:skip Skip the PR/issue when compiling release notes Team:APM All issues that need APM UI Team support Team:Uptime - DEPRECATED Synthetics & RUM sub-team of Application Observability v7.11.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants