Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tooling to measure xsnap performance and bottlenecks #7068

Open
raphdev opened this issue Feb 24, 2023 · 2 comments
Open

Tooling to measure xsnap performance and bottlenecks #7068

raphdev opened this issue Feb 24, 2023 · 2 comments
Assignees
Labels
enhancement New feature or request performance Performance related issues tooling repo-wide infrastructure xsnap the XS execution tool

Comments

@raphdev
Copy link
Contributor

raphdev commented Feb 24, 2023

What is the Problem Being Solved?

Investigations into issues like #6661 can be expensive and time-consuming. Over time we've developed tooling to assist troubleshooting. We can extend the tooling to produce artifacts and data that would allow us to more easily explore performance, identify bottlenecks, and root cause issues.

Having the ability to easily profile, render profiling information, would augment our ability to identify and and address performance issues.

Description of the Design

If necessary, modify xsnap to meet prerequisites to produce tracing information (e.g. instrumentation).

Integrate a tool like replay-transcript with optional profiling, something like:

  • Replay a transcript with instrumentation (perf, bpf, llvm-xray) and produce xsnap traces.
  • Allow rendering the traces (graphs, flame/icicle graphs), potentially through additional tooling to convert resulting traces to standard formats (e.g. trace-event).

Security Considerations

On data collection

In general, the kinds of profiling and instrumentation we use should be unobtrusive and have minimal privacy impact. The data (or potential telemetry) should not contain potentially sensitive information -- much of the data we need is around where time is spent, and potentially high-level information of memory layout.

Scaling Considerations

Sampling rates and volume of traces should be tuned to avoid producing excessive data dumps, which consume large disk space, making transfer more difficult, as well as make their processing/rendering more expensive. This is also a devX consideration -- the more compact the data, the easier and faster it is to manipulate and explore.

Test Plan

TODO

@raphdev raphdev added enhancement New feature or request tooling repo-wide infrastructure performance Performance related issues xsnap the XS execution tool labels Feb 24, 2023
@raphdev raphdev self-assigned this Feb 24, 2023
@raphdev
Copy link
Contributor Author

raphdev commented Mar 30, 2023

After getting xray working, I was able to try out ebpf with some success. I had less success with perf. Most of my testing is under various scenarios -- cloud and Docker.

  • Xray works pretty much anywhere, if you can compile with Clang. It is probably the most detailed one, due to having function entry and exit information and having temporal info. However, perf hit can be quite noticeable, even with tuning and minimizing the functions patched. Without hardware counters, it is much slower. The tracing is also quite large. Unfortunately we can't control it too much, as the API is C++ only.
  • eBPF seems ideal. It was quick to set up, perf impact is minimal, and output is quite reasonable. Most of the tooling I've used is mostly sampling population based, so X-axis on graphs would not be time. It even worked in Docker with some hacks.
  • Perf did not work in cloud for the same reasons xray was much slower (hardware counters), only in docker with the same hacks for eBPF.

Following the instructions in https://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html#Instructions was enough to get some profiles pointed at xsnap worker processes orchestrated by the replay tool.

The resulting folded stacks could then be turned into SVGs or loaded into speedscope.

@raphdev
Copy link
Contributor Author

raphdev commented Mar 30, 2023

Still exploring tooling to shortcut manual steps. The measurements have helped us make some findings like #7276

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Performance related issues tooling repo-wide infrastructure xsnap the XS execution tool
Projects
None yet
Development

No branches or pull requests

1 participant