-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trace Elasticsearch query to the origin #97934
Comments
Pinging @elastic/kibana-core (Team:Core) |
To reverse the question: Which additional information would you like us to send to ES here? I don't think we can automate/detect more than the calling plugin from Core. Do you have something particular in mind? Should we let callers of the ES API send their own custom 'meta' headers in addition to the |
Do you want Kibana to forward them to the Elasticsearch server? |
To be honest, I was intentionally being vague here because I don't have a solution in mind. Users are primarily looking for a way to answer the question of "What in the world is causing this expensive query to be executed?". I'm open to all solutions that fully or partially allow this question to be answered based on the feasibility and maintainability. |
I believe we need to be more specific than the "feature" if I understand that correctly. It would be incredibly useful to pinpoint the "thing" that called Elasticsearch like "Vijay's Visualization" in dashboard "foo" vs. "Lens". @kobelb Can you clarify the level of detail you are proposing for "feature"? The goal is for an admin looking at an ES slow/expensive query to rapidly find the culprit in Kibana so they can address the stability of their environment. |
For sure, happy to revise the original description to make this more clear. |
I agree it would be useful. However, how does this play with PII concerns? If we send the name of the visualization/dashboard and/or even the user who performed the action could unintentionally leak private information. What do you think? @kobelb @VijayDoshi |
We could just send the id of the SO instead (when working with SO's) Note that tracking the feature/app that performed initiating SO requests doesn't seems like the part with the most value. I would think that it's currently harder to correlate requests performed against data indices. |
@mikecote my understanding is that alerts are not SOs in the traditional sense, would we need special handling for alerts if we used SO ids? |
If we're using the new terminology; correct, alerts are not SOs but do have unique identifiers. We can however use the rule ID because rules are SOs. |
@VijayDoshi ^^ |
The challenging part here is not only the async nature of js but also a necessity to trace an operation across different environments. browser --> (proxy) --> Kibana server --> (proxy) --> elasticsearch Ideally, there is a single mechanism covering all aforementioned use cases. It looks like a classical Observability problem when we want to answer questions:
Given that we already ship the built-in APM agent in the product, how realistic does it look to use it for debugging? It's not a simple task including several steps (I hope, I didn't miss anything):
But it sounds like we could provide a debug tool that automates all these steps.
Almost all the requests to the Kibana server go through
The number of sources is more diverse on the server-side: SO client, ES client, data plugin, task manager. With multiple abstraction layers in between, the passing context through several layers might be tedious. That's why I'm thinking about reusing the APM agent infrastructure or at least borrow implementation ideas from it.
Not sure all that information is needed. We are interested in an operation source (for example, |
That's a great question and one that'll require additional effort to get an answer to. I'm on board with investigating the feasibility of using APM for these types of situations, as it's the ideal solution for distributed tracing. @joshdover and I just recently discussed this, and there's a cross-team effort that we're trying to get going to answer this question.
Long-term, absolutely! For the time being, I think we should figure out whether we're comfortably recommending that users enable APM integration in Kibana and Elasticsearch to a dedicated cluster to diagnose these difficult isssues. |
We're certainly comfortable enabling our agents in any rum+nodejs application. That said, the results are probably not well suited for user consumption as currently instrumented - the results are most useful to the developers of the application, mostly us in the case of Kibana. I'd be happy to help coordinate pulling the right folks in from APM depending your goals. @dgieselaar in particular has put a lot of effort into improving both kibana instrumentation and node agent capabilities and can likely provide expertise across topics. |
I think we need to be adding human-readable transactions and spans for key user interactions. For example, visualization renders, dashboard renders, async search, discover search, alert executions, action executions etc. With these in place, we should be able to give users much more detailed information about what user-created data may be causing problems in their cluster. Some folks from the Kibana Core team and APM team are meeting today to discuss what's possible and craft a plan of next steps for leveraging APM more widely. I'll report back here with our results. |
While APM doesn't yet provide a curated UI for doing this type of analysis, the raw data is indeed present and with a few manual steps in Discover we can easily trace a slow Elasticsearch query to a specific HTTP route or Task Manager task in Kibana today, without any changes to Elasticsearch. For many use cases, this should solve the problem. APM doesn't have a UI that allows you to filter for specific span types, however it's possible to drop into Discover and find Elasticsearch queries originating from Kibana that are slower than 60s with a filter like: Because a large portion of our HTTP routes are plugin-specific, this should allow us to easily determine which plugin is triggering ES slow queries in many cases. For Kibana HTTP routes that are used by many different UIs, we'll need to do some additional work in adding instrumentation to the client-side in order to be able to tease out which UI triggered the underlying slow query. I think we can solve this with a combination of Core-level changes (eg. adding the plugin ID information to the HTTP client exposed to plugins on All this said, this is dependent on the customer being able to turn on the APM instrumentation. There's plenty of things we could do to make this easier for customers (eg. add a Is there any reason to believe that we'd still need an alternative non-APM solution for this? If not, I think focusing our efforts on improving the existing APM instrumentation would give us the best ROI on our time. |
Yesterday we had a meeting with the Elasticsearch team about improving request traceability across the Stack.
In the long term, we consider building a custom solution to overcome the problem listed above. |
During a standup today, @cyrille-leclerc mentioned the W3C distributed trace context: Baggage as a potential solution for communicating the plugin-specific information from the client-side in the distributed traces. I was told that @dgieselaar knows the details of this. Have we looked into taking advantage of this? If it'd solve the APM solution, could we use it for the long-term solution where this context is included in the appropriate Elasticsearch logs? |
Thanks @kobelb . @jasonrhodes and I will be happy to help. |
The primary challenge we have right now is not the transmission of the context from the client to the server, but the actual passing of the context data itself through all the application layers to the client-side code that initiates requests to the server. This is a challenge in JavaScript due to it's async, event-driven nature which prevents us from easily being able to determine which user action triggered which Elasticsearch queries without manual code changes. On the server, Node.js's On the browser, no such API exists so we'll need to manually instrument each call path to ensure context is traced through each layer in our JavaScript services on the client. There is some degree of out-of-the-box global context tracing we can achieve, but for heavily used common services like the We can definitely leverage the Baggage proposal for transmission of this context, however I don't believe the Baggage proposal will help us with actually delivering this context from the UI layers the JS service layers where we make the requests to Kibana's server side. The OpenTelemetry (aka OpenTracing) protocol that APM already supports does use the Baggage proposal, but at this time APM uses it's own format for sending key-value pairs. I do think using Baggage for sending key-value pair context information for direct use by Elasticsearch's slow log would be a better option over overloading the
This initial conversation about instrumentation overhead was in reference the overhead of using APM as a storage output for this context data. Now that we know we can leverage Elasticsearch's slow log directly without necessarily needing to use APM, we need to investigate whether or not we can implement this tracing without the same overhead as the APM agent. It's likely that we'll want to use that In summary:
|
Thanks for the great explanation, @joshdover. I appreciate you and your team putting such careful thought into this. |
@joshdover do you have everything you need from the observability side on this? I know @graphaelli has been involved, but feel free to bring myself and/or @Mpdreamz (APM tech lead) in at any point if we can help support this effort. Thanks for the update! |
Problem statementElastic Stack has multiple components, while Kibana has got multiple layers of abstraction. When a Kibana user wants to understand why it takes a minute to render a dashboard or why requests to Elasticsearch queries stuck in the request queue, they need to answer the following questions:
In the current issue, we focus on the use case covering 80% of slowness problems in the Stack: detecting a slow Elasticsarch query and tracing it back to a Kibana plugin triggering the query. In the long term, Kibana should provide tooling providing insights about Stack performance, Stack component interactions with the ability to drill to every component in the system. This functionality is out of the scope of the current issue. Functional requirementsA Kibana user should be able:
Implementation optionsAPM agentKibana is already shipped with a built-in Elastic APM agent. The agent instruments Kibana code to collect runtime stats. The runtime stats are sent to an APM server to be processed and stored in the APM cluster within Elasticsearch.
Cons:
Custom solutionRight now, the APM agent doesn’t support integration with Elasticserch slow query logging, async tracing in the browser. These are critical requirements for the current issue and most likely, it will not be possible to fix these drawbacks quickly That’s why it may make sense to start with a custom solution and consider switching to the APM agent later.
Cons:
Client-side
const executionContext = createExecutionContext({
name: plugin name,
type: plugin element name,
url?: /path/to/element/in/ui
description?: human readable description
})
Server-side
Open questions:
@kobelb @joshdover and others. I'd appreciate your input on the implementation options and the Open questions section. My thoughts:
Depends on the final implementation overhead. Maybe it should be enabled by default to be accessible by Kibana or Elastcisearch. Elasticsearch slow logging is configured via logging level
Required:
not at the current stage
in order of priority:
If we encounter such a use case.
To speed up the process, it makes sense to have Core team do it, in advance consulting with the owners of the code. @jasonrhodes @graphaelli @Mpdreamz does it make sense for you if the Core team implements a custom solution instead of relying on APM for this specific use case of tracing an Elasticsearch query to a Kibana source? What do we lose by giving up the APM solution? |
Short reply will go in depth deeper tomorrow but I feel strongly we should be eachother's worst/best customer. If we built something custom now it'll be a missed opportunity to invest in a deep stack integration. For instance @SylvainJuge explored first class support for APM within Elasticsearch at several levels. Even if Elasticsearch does not ship with APM anytime soon we could look at it supporting trace headers and persisting these to Elasticsearch logs in addition to x-opaque-id. This way even if APM is not installed kibana can still manually correlate traces rather then building a custom scheme for this. |
Let me back up a bit to make sure I understand some goals/decisions. A large part of this effort is about getting some tracing info ("where did this slow ES query originate?") without relying on having an APM server setup. Is this based on the assumption/understanding/agreement that in many scenarios it isn't practical or likely that the owner of that stack deployment will setup to use the APM system? And that is still true for something like
I am not challenging, just trying to understand.
No. At least the way I think we'd implement "hey APM agent, do your thing, but there is no APM server for now", there would be no buffering of events at all.
To be clear, the current Node.js APM agent will currently work without an APM server, it will just log.error a bit and have some wasted overhead. Are you thinking of using There are two separate Node.js APM agent pieces of work here:
I will bring it up with my team so we can discuss prioritization and get back here. Running without APM server and baggage support I expect will be fine. Adding Async Local Storage-based context management is probably unlikely for the short term.
@joshdover, did you mean "the RUM agent cannot commit"? |
Yes, Elasticsearch already provides
As I understand, the APM client hasn't been created with this use case in mind. I would wait until this mode is officially supported. Even if if
Probably, the mechanism of data transfer between the Kibana browser app and the server app is an internal implementation detail. I don't think that |
I started looking at
|
No, I meant that we should probably avoid building our own async context tracing solution in the browser if the RUM agent can commit to providing one for us in the future. That way we don't duplicate any efforts. However, given the current state of the art though, it doesn't appear that there may be any viable solution to this yet. I'd love to find out more about why the TC39 proposal for Zones was withdrawn and if there's a replacement proposal in the works. |
Regarding changes for the Node.js agent:
|
These should be addressed with elastic/apm-agent-nodejs#2101 and by adding traceparent-based log correlation to both Kibana and Elasticsearch. Using log correlation, it will be possible to correlate the logs form Kibana and Elasticsearch to a particular request. If and when APM is enabled (I'm advocating for turning it on by default on cloud but that's a different discussion), we wouldn't even necessarily need the slow log but could query the APM index for slow ES queries. It would then be possible to view the full request from frontend to backend so that we exactly know where a slow query originates from, rather than just which component.
Seems like this needs manual context propagation in the short to mid-term. But instead of doing something completely custom for Kibana, I'd suggest using the RUM agent's API to do the context propagation. I believe those APIs don't exist today but I think it's worth investing in adding them rather than doing something outside of APM. |
@felixbarny we are going to switch to APM solution as soon elastic/apm-agent-nodejs#2101 is done. As I can see, the APM agent team is planning to work on it during
Agree, until it's provided by the APM RUM agent, we will have to use a custom solution. Is there an issue to track the progress of the enhancement?
It's planned to be done at the later stage as well #101708 |
What's your timeline to add that to Kibana? If we'd add that in the early 7.14 or early 7.15 time frame, can we avoid the interim solution?
I fear that working on a custom solution now will make it much harder to migrate to an APM-based solution in the future.
I don't think it's completely clear what functionality you'd need from the RUM agent. Could you open a ticket on https://github.com/elastic/apm-agent-rum-js or another Kibana issue to elaborate on the things you need? |
@mshustov we reprioritized elastic/apm-agent-nodejs#2101 and will address this next (after current in-progress stuff on Node.js Agent), so you can hopefully move forward with that very soon. |
We've spent several days talking with @felixbarny about implementing this functionality and improving the observability for the whole Stack at the same time. I updated the #101587 description to reflect the current plan. Feel free to chime in. |
The Elasticsearch performance team is interested in analyzing the timing of requests issued by Kibana to make sure any changes in the Elasticsearch server do not lead to regression in Kibana UI performance. Yesterday we had a meeting with @gingerwizard, @danielmitterdorfer, @vigneshshanmugam, @wylieconlon, @ppisljar about improving the Kibana client-side instrumentation with the APM RUM agent. Some notes from the meeting: A more detailed tracing mechanism might require Kibana plugins to instrument their data management code manually. |
Context: Recent important escalation: |
@arisonl I think it depends on what specific question we are trying to answer. If the question is "Today can we trace slow Elasticsearch queries to their origin in Kibana?", the answer is yes, in 8.1+, you can trace queries back to Dashboards/Visualizations/Lens/Discover/Alerts/Maps using Kibana's If the question is "Today can we trace slow Elasticsearch queries back to a specific user?", the answer is yes, with some caveats (You need to be using audit logs, and can't be using If the question is "Today can we trace slow Elasticsearch queries using (@rudolf @kobelb keep me honest if I've missed anything on this) 😉 |
What are the main blockers that prevent you to use APM for these use cases? We've made several improvements in the Node.js agent to improve its performance and to make it possible to use it only for context propagation purposes, without it reaching out to the server. In the meantime, Elasticsearch has also adopted APM so that we can have full distributed tracing across Kibana and Elasticsearch, which should IMO be superior to manual correlation of a slowlog file. The APM Team has a big interest in making this a success and we're here to support you. |
@felixbarny thank you, this sounds great. Some problems related to the APM approach were summarized earlier in this thread but this is over a year old now. And APM distributed tracing does sound like the right end state. We have many requests asking to provide a way for ES queries to be traceable to user actions in Kibana or system initiated tasks (such as alerts) and so I think that we should pick this conversation up and review where we stand and what the most reasonable next steps would be. In addition, we also have use cases that involve being able to trace back in cross cluster search scenarios. |
I agree. It's hard to keep track of this long thread. Maybe we should close this and create a new one which describes the current state, the remaining issues, and the next steps. Who can own doing that?
Generally, CCS should just work™️ with APM. |
At this point I don't think we are blocked on anything here. It mostly comes down to:
|
I think we have already achieved the outcome in the issue description. As per https://www.elastic.co/guide/en/kibana/master/kibana-troubleshooting-trace-query.html we can attribute an Elasticsearch slowlog query to the dashboard/visualisation that caused it. Not all apps capture the execution context but there is a generic solution being tracked in #132629 While APM can provide even richer data I'm not sure this is the right tool for most users use case:
There is one piece of critical information that's currently missing which is who created the query, the Elasticsearch team is investigating exposing the user id's from the slowlog. So I'm going ahead and close this issue, feel free to re-open if there's shortcomings in the current approach that I missed that we should address. Having said that, there might be benefit in Kibana switching to APM's execution context instead of our own implementation. |
Kibana executes a lot of queries against Elasticsearch. Some of them originate from internal sources (task-manager claiming tasks, queries that are run on Kibana startup, etc.) and some of them are a result of a user's interaction with Kibana (opening a Visualization, using the Index-Management UI, etc).
Some of these queries can be slow and show up in the Elasticsearch slow log. However, when they do, it's not obvious what caused them to occur in the first place. This makes it very difficult for administrators to do anything to prevent them from occurring in the future.
It'd be incredibly helpful for us to be able to correlate a slow Elasticsearch query to the specific usage of a feature. For example, it should be possible to correlate an Elasticsearch query to a specific Visualization that's embedded in a Dashboard. This would allow the administrator to take corrective action.
Orthogonal or related issues
#77214 proposes custom request headers for the sake of usage analysis; however, there's potentially some overlap with a solution here.
#60122 proposes including the "execution context" to all Kibana log entries, but it doesn't propose forwarding these to Elasticsearch for correlation.
The text was updated successfully, but these errors were encountered: