diff --git a/design-docs/00000-example-docs/02612-process-manager-fileid-pid-split/README.md b/design-docs/00000-example-docs/02612-process-manager-fileid-pid-split/README.md new file mode 100644 index 00000000..8f63f4e9 --- /dev/null +++ b/design-docs/00000-example-docs/02612-process-manager-fileid-pid-split/README.md @@ -0,0 +1,475 @@ +> [!IMPORTANT] +> +> This document is from Elastic times and precedes the open-sourcing of the +> profiling agent. It is provided as an example of what a design document +> could look like. Many of the links had to be removed because the original +> git history of the profiling agent had to be erased for legal reasons when +> it was open-sourced. + +Process Manager: Move Per-FileID Info Into Dedicated Manager +============================================================ + +## Meta + +- **Author(s)**: Joel Höner +- **Start Date**: Jul 13, 2022 +- **Goal End Date**: Jul 29, 2022 +- **Primary Reviewers**: Christos Kalkanis, Florian Lehner + +## Problem + +The process manager in the host agent mixes up multiple related but still +distinct concerns in a single type. It manages both information that is kept per +process ID (PID) as well as information kept per file ID. The entanglement makes +consistency expectations between the different information being stored harder +to fathom. The per-file-ID information being stored additionally has +underdocumented and inconsistent lifetimes, with part of the information being +stored in append-only maps (inserted and kept forever), while other information +is reference-counted. The reference counting for the latter is mixed in with the +other business logic of the process manager without any form of abstraction. + +## Success Criteria + +- Clear separation of per-PID and per-file-ID information in two separate + managers +- Clear and consistent lifetime for all per-file-ID information being managed +- Clearly communicate mutability/constness of the data being stored +- New manager must not call back into the PM that owns it + - We have too many circle-references already + +## Scope + +- This document is concerned only with first splitting the per-file-ID + information storage and management out from the process manager and improving + on how it is managed +- Improvements on any of the per-PID information within the process manager are + not in scope + +## Context + +In the following sections we'll start by taking inventory of the per-file-ID +information that is currently being stored in the process manager, its use, +lifetime and mutability. + +### The `interpreterData` map + +The [`interpreterData` map][interpreter-data-field] serves as a cache of +per-executable interpreter data ([`interpreter.Data`][interpreter-data-struct]. + +[interpreter-data-field]: # +[interpreter-data-struct]: # + +When a new mapping is reported, it is used to quickly determine whether we have +already previously concluded that the `FileID` associated with the new mapping is +definitely not an interpreter [by checking for a `nil` record in +`interpreterData`][interpreter-data-check]. If this is not the case (either no +entry exists or the entry is non-`nil`), the [`pm.interpreterLoader` array] is +[walked][array-walking], asking each interpreter unwinder to attempt to detect +whether they want to handle the file. If an interpreter unwinder detects the +file as being supported by them, we store the corresponding helper information +that the interpreter extracted in the `interpreterData` map. If the search for a +suitable interpreter unwinder is exhausted unsuccessfully, we instead store +`nil`. + +[interpreter-data-check]: # +[`pm.interpreterLoader` array]: # +[array-walking]: # + +The map is inserted into in `attachInterpreter` via the following +call-chains: + +- [`ProcNewMapping`](#) → [`handleNewInterpreter`](#) → [`attachInterpreter`](#) + - [`ProcNewMapping`](#) → [`handleNewInterpreter`](#) → go [`handleNewInterpreterRetry`](#) → [`attachInterpreter`](#) + +The map is read in: + +- [`attachInterpreter`](#) +- [`handleNewInterpreter`](#) + +Records are never deleted from this map, nor are records that were inserted ever +mutated. + +### The `fileIDInfo` map + +The [`fileIDInfo` map] is harder to reason about. Its main purpose is to store +the information required to delete all information for a given FileID from the +two BPF maps `stack_delta_page_to_info` and `exe_id_to_%d_stack_deltas`. + +[`fileIDInfo` map]: # + +It further stores information about gaps in the unwinding information of +executables which is passed to [`interpreter.New`]. The information is currently +only consumed by the V8 unwinder that uses it to find the pre-compiled JIT block +within the binary. + +[`interpreter.New`]: # + +Entries in the map are managed via reference counting: incrementing the RC is +performed mixed into the business logic of the process manager ([link +1][rc-inc-1], [link 2][rc-inc-2]) while decrementing the RC is performed in the +[`decreaseReferenceCount`][rc-dec] function. + +[rc-inc-1]: # +[rc-inc-2]: # +[rc-dec]: # + +Entries are inserted into the `fileIDInfo` map in either +[`ProcNewMapping`][file-id-info-insert-1] or +[`AddSynthIntervalData`][file-id-info-insert-2]. In `ProcNewMapping`, if a +record for a file ID already existed previously, we instead just increase the +RC. `AddSynthIntervalData`, on the other hand, errors out when passed a file ID +that is already known. [`ProcNewMapping`][file-id-info-insert-1] is the typical +code-path taken in the majority of executables whereas +[`AddSynthIntervalData`][file-id-info-insert-2] is more of an edge-case; it +provides the ability to override unwinding information for executable with +broken `.eh_frame` sections. It's currently only used for `[vdso]` handling on +aarch64. + +[file-id-info-insert-1]: # +[file-id-info-insert-2]: # + +The value to be stored in the map -- for both code paths -- is returned from the +[`loadDeltas`][load-deltas-return] function. [`loadDeltas`][load-deltas-return] +loads unwinding information into the `stack_delta_page_to_info` and +`exe_id_to_%d_stack_deltas` BPF maps and then returns a newly created instance +of the [`fileIDInfo` struct][file-id-info-struct]. + +[load-deltas-return]: # +[file-id-info-struct]: # + +The reference counter can be decreased via the following code-paths: + +- [`ProcExit`](#) → [`deletePIDAddress`](#) → [`decreaseReferenceCounter`](#) +- [`ProcMunmap`](#) → [`deletePIDAddress`](#) → [`decreaseReferenceCounter`](#) + +In `deletePIDAddress`, once the RC reaches 0, the information from the +map is used to remove entries corresponding to the FileID from the BPF +maps ([link 1](#), [link 2](#)) and then also the entry itself from the +`fileIDInfo` map. + +The map is read using the following code-paths: + +- [`New`](#) → [`collectInterpreterMetrics`](#) → anonymous spawned goroutine +- [`ProcExit`](#) → [`deletePIDAddress`](#) → [`decreaseReferenceCounter`](#) +- [`ProcMunmap`](#) → [`deletePIDAddress`](#) → [`decreaseReferenceCounter`](#) +- [`ProcNewMapping`](#) + +With the exception of the reference counter field, entries in this map +are never mutated after insertion. + +### The `unwindInfoIndex` map + +[The `unwindInfoIndex` map][unwind-info-index] and the `unwind_info_array` BPF +map that it is tethered with provide deduplication of +[`UnwindInfo`][unwind-info-struct] records in the `exe_id_to_%d_stack_deltas` +BPF map. The map in Go assigns each unique `UnwindInfo` the `uint16` index of +the corresponding information in the BPF map. It is both read and written +exclusively from [`getUnwindInfoIndex`][get-unwind-info-func]. Entries are only +appended and then never deleted. + +[unwind-info-index]: # +[unwind-info-struct]: # +[get-unwind-info-func]: # + +Call-chains: + +- [`ProcNewMapping`](#) → [`loadDeltas`](#) → [`getUnwindInfoIndex`](#) +- [`AddSynthIntervalData`](#) → [`loadDeltas`](#) → [`getUnwindInfoIndex`](#) + +Because the map is exclusively used as a helper for deduplicating data +in the per-file-ID maps, I suggest moving into the new manager as well. + +### The `stackdeltaprovider` field + +The [`stackdeltaprovider` field][stackdeltaprovider-field] is used exclusively +to [construct the data that is then stored in the `fileIDInfo` +map][construct-data-for-fileidinfo] and the associated BPF maps. Because the SDP +is concerned only with per-file-ID information, we should move this field to the +new manager as well. + +[stackdeltaprovider-field]: # +[construct-data-for-fileidinfo]: # + +### The `interpreterLoaders` field + +The `interpreterLoaders` field is used to [populate the data][populate-data-1] +in the `fileIDInfo` map if it doesn't exist yet. The field and [the constructor +logic for populating it][populate-data-2] should be moved to the new manager. + +[populate-data-1]: # +[populate-data-2]: # + +### The `FileIDMapper` field + +`FileIDMapper` is a public field of the process manager. It provides +functionality for setting and getting the mapping from user-mode to +kernel-mode file IDs and is [implemented as an LRU](#). +The implementation is swappable to simplify testing. + +Entries are read in: + +- [`ProcessManager.ConvertTrace`](#) +- [`ebpf.reportNewMapping`](#) + +Entries are added in: + +- [`ebpf.getKernelStack`](#) +- [`elfinfo.getELFInfoForFile`](#) + +Entities are never explicitly deleted. Instead, they are +garbage-collected by falling out of the LRU. It is unclear what the +intended behavior is when binaries that are still loaded in a process +fall out of the LRU. Converting traces for such binaries will no longer +be possible. + +The field stores per-file-ID information and should probably be moved as +well. However, being public, swapable for testing, and mutated remotely +complicates this considerably. The actual implementation of the mapper +is already abstracted behind a dedicated type and doesn't introduce a +lot of complexity into the PM code itself, so I suggest leaving it where +it is for the moment. Consolidating it into the new manager might be +desirable at a later point in time but is declared out of scope for this +document. + +## Proposed Solution + +### New Manager Design + +This section makes concrete suggestions for the design of the new manager. + +#### Lifetime Management + +The two main maps `interpreterData` and `fileIDInfo` discussed in the previous +chapter currently have diverging lifetimes: one is append-only whereas the other +one uses reference counting. In order to merge the two and to make the lifetime +of per-file-ID information easier to reason about, it is desirable to move them +towards a common lifetime. There are multiple possible strategies to pursue +here. + +##### All Append-Only + +In this strategy we merge both maps and manage them according to an append-only +lifetime: entries that are added once will never be released. + +While this strategy worked reasonably well for `interpreterData`, it is doubtful +that it is desirable to also apply it to `fileIDInfo` and the connected BPF +maps: the amount of memory required to store it is significantly higher. On +systems that build and execute a lot of executables on the fly (e.g. wasmtime +runtime with the dylib loader) we'd quickly run out of memory. + +##### All Reference-Counted + +In this variant, we merge the maps and manage both of them via +reference-counting. While it entails a bit more complexity in the new manager's +internals, it ensures that no memory is being leaked regardless of the workload. + +For `fileIDInfo` the lifetime remains unchanged. For the `interpreterData` +information the logic changes slightly when compared to the previous append-only +approach: the per-executable interpreter information is lost once the last +interpreter running it exits. In a workload that spawns a lot of very +short-lived interpreter processes in sequence, this might increase the amount of +cycles spent on re-gathering the interpreter data. + +As an optional extension to combat this, we could refrain from removing objects +whose RC reaches 0 immediately. Instead, we store a timestamp of when the RC +reached 0. A goroutine then periodically cleans all entries that have been at RC += 0 for more than N seconds. Lookup and deletion methods will pretend that +objects with RC = 0 do not exist whereas insertion methods will pick up the +dying object and revive it by incrementing the RC. + +It is unclear whether the overhead introduced by these additional +`interpreter.Loader` calls is substantial. `fileIDInfo` has the same lifetime +management already and – from intuition – would appear to be more expensive to +re-create than the interpreter data: it requires reading stack-deltas from disk, +decompressing them and inserting them into the BPF maps. + +We could also consider storing the per-executable interpreter information on +disk as well. + +##### Keep Maps and Lifetime Separate + +This is the simplest variant to implement, but it also reduces the opportunities +for internal simplification significantly. If we keep their lifetime separate, +we'll have to add accessors to read from and write to each respective map +individually. + +##### Author's Preference + +My preference is managing all information with reference counting. I suggest +going without the delayed deletion approach first to keep things simple. If this +turns out to be a performance problem in practice, it should be reasonably easy +to extend the system as required. + +##### Decisions + +The reviewers agreed with the author's preference. + +#### Public Interface + +```go +type ExecutableInfo struct { + // Data holds per-executable interpreter information if the file ID that this + // instance belongs to was previously identified as an interpreter. Otherwise, + // this field is nil. + Data interpreter.Data + // Gaps contains information about large gaps in the unwinding information. + Gaps []libpf.Range + // [elided: various private fields required to keep track of RC and BPF map entries] +} + +type executableInfoManagerState struct { + info map[FileID]ExecutableInfo + unwindInfoIndex map[sdtypes.UnwindInfo]uint16 + sdp StackDeltaProvider + ebpf EbpfHandler + opener interpreter.ResourceOpener + numStackDeltaMapPages uint64 +} + +type ExecutableInfoManager struct { + state xsync.RWMutex[executableInfoManagerState] +} + +// Alias to keep the definitions below short. Not to be included in the actual code. +type EIM = ExecutableInfoManager; + +// New creates a new instance of the executable info manager. +func New(sdp StackDeltaProvider, ebpf EbpfHandler, opener interpreter.ResourceOpener) *EIM; + +// AddOrIncRef either adds information about an executable to the internal cache (when first +// encountering it) or increments the reference count if the executable is already known. +// The memory mapping and PID are only borrowed in order to allow interpreter loaders to read +// memory (e.g. to determine the version or to resolve other per-executable information). +func (mgr *EIM) AddOrIncRef(mapping *Mapping, pid util.PID) (*ExecutableInfo, error); + +// AddSynth adds synthetic interval data to the manager. A call to this function must +// precede all `AddOrIncRef` calls for the same file ID, otherwise an error is returned. +func (mgr *EIM) AddSynth(fileID host.FileID, data sdtypes.IntervalData) error; + +// Get retrieves information about an executable. If the executable isn't known, nil is +// returned instead. +func (mgr *EIM) Get(fileID FileID) *ExecutableInfo; + +// RemoveOrDecRef decrements the reference counter of the file being tracked. Once the RC +// reaches zero, information about the file is removed from the manager. +func (mgr *EIM) RemoveOrDecRef(fileID FileID) error; + +// UpdateMetricSummary updates the metrics in the given map. +func (mgr *EIM) UpdateMetricSummary(summary map[metrics.MetricID]metrics.MetricValue); +``` + +The suggested interface here assumes that we went with the consolidated +ref-counting approach for lifetime management. Should we decide to go with a +different approach, this will have to be reworked. + +The ref-counting semantics are intentionally exposed in the public interface: +callers must understand them; otherwise it might be tempting to ask why not to +only call `Add` once and check existence via `Get`. + +All function-, type- and field-names are suggestions and up for debate. + +#### Ownership of the New Manager + +The instance of the new manager will be owned by the process manager in the form +of a private struct field. The code continues to live in the `processmanager` +package, although in a different file: `execinfomanager.go`. + +#### BPF Map Management + +The responsibility for managing the following per-file-ID BPF maps is +transferred to the new manager: + +- `stack_delta_page_to_info` +- `exe_id_to_%d_stack_deltas` +- `unwind_info_array` + +#### BPF Map Documentation + +BPF maps pose a bit of a challenge with regards to where and how to document +them. They are always public to all code and their ownership situation can be +intransparent. In essence, they can be reasoned about as if they were public +fields of the type that manages them. Consequently, this might also be a way +forward in documenting them: on the type that manages them, as if they were +public fields. + +We should document what with and by whom the maps are populated and whether +other code is allowed to mutate them or not. In this particular case, the maps +are managed exclusively by the new manager and other code can only read from the +maps. + +#### Locking + +The new manager is locked internally and public methods can be called from an +arbitrary amount of different threads at the same time. A global `xsync.RWMutex` +is used to lock all fields. It is acquired and released in all public functions; +internal helpers that need access to the manager’s state rely on their caller +for locking by taking an `*executableInfoManagerState` argument. If applicable, +they can also be directly implemented as having the state struct as the +receiver. + +#### Metrics + +The `numStackDeltaMapPages` metric is concerned with keeping count of the size +of per-file-ID BPF maps and should thus be moved to the new manager. The +responsibility for reporting metric information of various fields that were +moved into the new manager is transferred as well. + +#### Moving Code + +Multiple methods of the process manager are exclusively concerned with +per-file-ID information: + +- `AddSynthIntervalData` +- `loadDeltas` +- `getUnwindInfoIndex` (only called from `loadDeltas`) +- `calculateMergeOpcode` (only called from `loadDeltas`) +- `decreaseReferenceCounter` + +These will be moved to the new manager in their entirety. + +### Porting Process + +#### Option #1 + +- Copy per-file-ID maps to separate type +- Create methods for all code-snippets that essentially – as far as possible – + contain exact copies of the code that previously lived in the process manager + - Will produce a lot of bogus methods with questionable names and semantics + - It however allows reviewers to easily tell whether code is equivalent +- Remove per-file-ID maps from process manager, replacing them with an instance + of the new manager +- Replacing the previously extracted code snippets with corresponding method + calls +- Get that reviewed & merged +- Unify the two maps in the new manager and move towards the new proposed public + interface +- Get that reviewed & merged + +#### Option #2 + +- Implement a manager with the public interface documented in this document + immediately + - Duplicate all required helper functions and fields + - Place a lot of comments that help explaining where code was pasted from, + how it was changed and where new code was added + - Get that reviewed and approved as a draft, but don’t actually merge it + into `main` yet +- Port PM to use the new manager, replacing and deleting all previously + duplicated code +- Get that reviewed, approved and then merged + +#### Author’s Preference + +This is going to be a bit of a challenge to review either way: even when going +with Option #1, there will still be one rather large commit to consolidate the +lifetimes of the two maps and to move towards the new interface. There’s also a +chance that all the bogus methods created to factor out the code snippets from +PM might actually make things harder to review instead of aiding with it, +because reviewers will now also have to check on how these snippet-methods are +called. I thus have a slight preference for Option #2, but will leave it to the +reviewers to decide which variant they prefer. + +#### Decision + +The reviewers either agreed with the preference for Option #2 or didn’t have a +preference at all. The port will be conducted using Option #2. diff --git a/design-docs/00000-example-docs/04359-remove-bpf-hashing/README.md b/design-docs/00000-example-docs/04359-remove-bpf-hashing/README.md new file mode 100644 index 00000000..1f676f25 --- /dev/null +++ b/design-docs/00000-example-docs/04359-remove-bpf-hashing/README.md @@ -0,0 +1,736 @@ +> [!IMPORTANT] +> +> This document is from Elastic times and precedes the open-sourcing of the +> profiling agent. It is provided as an example of what a design document +> could look like. Many of the links had to be removed because the original +> git history of the profiling agent had to be erased for legal reasons when +> it was open-sourced. + +HA: remove BPF hashing +====================== + +# Meta + +- **Author(s)**: Joel Höner +- **Start Date**: 9. Nov 2023 +- **Goal End Date**: 5. Dec 2023 +- **Primary Reviewers**: Christos Kalkanis, Timo Teräs + +## Abstract + +Our host agent uses a relatively complex data processing pipeline that enriches, +aggregates and reshapes raw traces into the format expected by the collector. +Statistics show that a significant portion of this complexity might be +unwarranted. Upcoming features further change some important requirements for +the system as a whole. This document proposes a rapid simplification of the +BPF ↔ user-mode communication portion of the pipeline. + +## Context + +The relevant portion of our trace aggregation in the host agent currently looks +roughly like this: + +![old-trace-pipe](./trace-pipe.drawio.svg) + +> [!NOTE] +> +> Kernel stack transfer is not included here because it will remain unchanged. + +## Problem + +The BPF and `tracehandler` portions of this pipeline are complex pieces +of code that have already cost multiple engineers a lot of time to reason about +whenever they need to be changed. + +The majority of the complexity arises from the fact that our BPF code transfers +traces and counts separately, but `tracehandler` needs to transfer information +from the trace over to the count updates that it sends out. The count updates from +BPF are essentially just `(BPFHash, Count)` tuples, but `reporter` expects count +updates to be annotated with thread-, container- and pod name. BPF sends the +information required to infer these values with the trace instead of the count. + +For `tracehandler` this means that it has to keep a cache of `(BPFHash, +TraceMeta)` mappings in order to make sure that it always has the required +information for the outgoing count updates. Moreover, it cannot make any +assumptions about the trace info already being present when corresponding counts +arrive, so it must support deferring count updates until the trace info for them +arrives. + +BPF currently uses 5 different BPF maps with different lifetimes and aggregation +behaviors. Maintaining the cache from `BPFHash` to `TraceMeta` further needs to +be implemented very efficiently to not eat up too much memory, so we're +currently using a hand-rolled, domain-specific LRU implementation for this. +This all adds up to a non-trivial amount of complexity. + +Another problem with this approach is that all information that is present in +the trace must also be part of the BPF hash. This in turn means that every +high-cardinality field that we add to the trace diminishes the efficiency of the +aggregation that occurs in the BPF maps. + +### Aggregation is rare already + +Testing with real-world applications has shown that it's pretty rare for an +entry in `hash_to_count` to ever have a count >1 within the 5 second interval +that we use for the BPF ↔ user-mode communication. Inserting into and removing +from a hash map is quite a bit more work than simply sending a perf event that +is mmap'ed into user-land. Particularly for traces with the "symbolize this now" +flag this is currently rather wasteful because we even send an additional perf +event to force the map to be read and cleared immediately. + +Trace resends via `hash_to_trace` are additionally suppressed via the +`known_traces` map that is tethered to `hashMapperCache` in `tracehandler`. This +suppresses the resend for ~45% of traces. We can monitor this rate for all +profiling agent deployments globally: + +![known-trace-metrics](./known-trace-metrics.png) + +While ~45% might seem like a significant chunk on first glance, any caching +scheme must always be evaluated in the context of: + +- what the cost of just redoing the work all the time would be +- the number of cycles burned in vain on the cache lookup itself when cache misses occur +- the cost of maintaining the cache + +The complex system with traces being split across multiple maps is in place in +large parts **just to support this caching scheme**, and without it we could go +with a much simpler approach (detailed in ["Proposed +Solution"](#proposed-solution)). + +In this simpler scheme, the cost of just resending the whole trace is a 0.3 - +2.5 KiB `memcpy` and, for a percentage of traces, a syscall to fetch the kernel +portion of the trace. The cache lookup in the existing system involves two +passes through hashing functions[^2] that will in all likelihood be slower than +an actual `memcpy`[^3], though perhaps not slower than a syscall. + +With that context in mind, ~45% aren't really all that good anymore. In 55% of +cases we will pay the price of the hashing **plus** the cost of sending the +trace. + +[^2]: First our custom hashing function, then putting the hash through whatever + hash function the kernel uses for their hash maps. +[^3]: Even heavily SIMDed hash functions like XXH3 are typically [quite a bit + slower] than a `memcpy`, and our custom hash function is very basic scalar + code. Also see ["Cost of memcpy" section](#cost-of-memcpy). + +[quite a bit slower]: https://github.com/rurban/smhasher + +### Upcoming features + +There are two upcoming features that may increase the cardinality of the BPF +trace significantly. Increased cardinality in turn further reduces the chance of +two traces being aggregated under one BPF hash. + +#### Nanosecond timestamps + +Christos' planned change towards nanosecond precision timestamps will require +recording the timestamp in kernel-mode. Previously the timestamp was not part of +the BPF trace and was added once it arrived in user-mode. This worked reasonably +well for second precision timestamps, but will definitely not transfer to +nanosecond precision. + +Assuming that we'd implement that by adding a timestamp field to the trace +and included that in the trace, this will almost guarantee that every BPF trace +is now be unique for all but a synthetic load that puts all cores into an +infinite loop. + +For this particular problem, there's an alternative route where we'd switch +`hash_to_count` to a per-CPU `nanotime_to_hash` map and let user-space do +aggregation/counting of events (in this case, hashes). This is what Christos +actually prototyped. This requires fewer changes at the cost of adding more +complexity to the existing system. + +#### APM integration + +We're currently working on integration between the APM and profiling products. +To allow associating profiling traces with APM traces, we need to read the APM +transaction-, span-, and trace IDs that were active while we collected each +trace. + +The APM IDs are randomly generated and particularly the spans and their IDs +typically only last for a few milliseconds. This is thus very similar to the +problem with the nanosecond timestamps: if we just include them in the BPF trace +hash, that will make it exceedingly unlikely that any aggregation takes place. + +Other than with the nanosecond precision timestamps, this problem will only +affect deployments that actually use the APM integration feature. + +## Success Criteria + +The primary goal is to unlock adding new high-cardinality fields to the trace in +the future without too much effort and without having to worry about constantly +deteriorating performance with every added field. + +A secondary goal is to achieve the primary goal without adding even more +complexity. In fact, preferably we use the opportunity to shed some. + +The changes further cannot put the host agent over the 1% CPU usage barrier. + +## Scope + +In scope: + + - trace processing in BPF, tracer, and tracehandler + +Not in scope: + + - any breaking changes to the reporter or the network protocol + - changing the user mode hash (tackled in separate document) + - experiments with per-frame (instead of per-trace) caching + +## Proposed Solution + +Remove BPF hashing and aggregation entirely and just resend the whole trace in a +perf event or BPF ring buffer for every sample that we unwind. + +Because with this approach all information is directly present in the event, +`tracehandler` no longer needs to keep a mapping from BPF hashes to trace meta +information. Worries about whether traces or counts arrive first become +irrelevant. `tracehandler`'s job is essentially reduced to just keeping track of +which traces were already previously reported to the collector and splitting +counts and traces. + +The caches that still need to be kept are reduced to: + +- `bpfTraceCache: BPFHash → UMHash` + - Suppresses unnecessary trace conversion + - Roughly comparable to the previous `hashMapperCache` (and `known_traces`) + - Pure performance optimization: not actually strictly required with this approach +- `umTraceCache: UMHash → Void` + - Suppress trace resend to the collector + - Equivalent to previous `traceCache` + +The following diagram shows the portion of the trace aggregation pipeline that +was previously introduced in the ["Context"](#context) section, but with the +suggested changes applied. + +![trace-pipe-reworked](./trace-pipe-reworked.drawio.svg) + +### Kernel memory napkin math + +Some basic math on this suggests that we'll actually need significantly less +kernel memory to be able to keep the same amount of traces in memory as with the +current approach. + +The following uses the map sizes that the host agent dynamically allocated on my +10 core VM. The buffers in both implementations should scale linearly with the +number of cores, so the concrete core counts do not impact the ratio between +both approaches. + +
+ +| Map | Key Size | Value Size | # Entries | +| ----------------- | -------- | ---------- | --------- | +| hash_to_framelist | 16 | 272 | 0x4000 | +| hash_to_trace | 8 | 40 | 0x800 | +| hash_to_count | 8 | 4 | 0x800 | +| known_traces | 8 | 1 | 0x10000 | + +
+ +``` +(16+272)*0x4000 + + (8+40)*0x800 + + (8+4)*0x800 + + (8+1)*0x10000 + ≈ 5304 KiB +``` + +> [!NOTE] +> +> The above ignores the per-record memory overhead that the individual data +> structures have. For example the hash map will have to store the hashes +> somewhere, so the actual memory overhead will likely be a bit higher than +> calculated here. + +With the proposed solution, we only have one map. + +| Map | Key Size | Value Size | # Entries | +| ------------ | -------- | ---------------------- | ---------------------------------- | +| trace_events | 0 | `sizeof(Trace)` = 2216 | `ncpu * freq * max_buf_secs` = 200 | + +``` +2216 * 200 ≈ 432 KiB +``` + +The above selects the buffer size to allow storing traces of maximum length with +all cores maxed out for 1 second. I expect that we'll read the buffers multiple +times per second to ensure timely interpreter symbolization, so this should be a +reasonable amount of slack to overcome small system hick-ups. We could also +easily ensure space for a 5 second period and still stay way below the current +memory footprint. + +### Proof of Concept + +During my OnWeek project in Oct 2023 I prototyped the proposed approach on the +[`joel/no-bpf-hashes`] branch and the results look quite promising. It looks +like we can achieve the same or better performance than the existing approach +while at the same time getting rid of a major piece of complexity. The prototype +also still works on our oldest supported kernel version (Linux 4.15). + +The diff stats for the prototype currently look like this: + +``` +Showing 26 changed files with 283 additions and 3,179 deletions. +``` + +In the process I however also discovered some challenges that we'll discuss +in the following section. + +[`joel/no-bpf-hashes`]: # + +## Detailed change proposals + +### Abolish hash maps, embrace ring buffers + +This is the core of this design document: I suggest that we stop hashing traces +in BPF and instead resend both meta information and the frame list for every +event. This eliminates the need for the `hash_to_count`, `hash_to_trace` and +`known_traces` maps. The existing `report_events` that is currently used for +both communicating PID events and forcing immediate map reads in UM (via +`ha_symbolization_needed`) will now only be concerned with PID events. All that +we need is one new perf event buffer `trace_events` that all trace events are +sent through: + +`` + +Perf events are essentially per-CPU ring-buffers. The events don't have to be +copied into user-mode memory: the ring buffer is `mmaped` into the UM +process[^1]. + +Since we're currently only suppressing about 45% of trace resends anyway, we +don't expect the overhead of this to be massive. Hashing and splitting the trace +across multiple hash maps is definitely more expensive than just sending a perf +event. With the existing system we are paying the price of caching itself plus +the price of resending the trace for the 55% of cases where the cache is missed. + +[^1]: [`perf_event_open`] manual page, CTRL+F for "MMAP layout" headline + +[`perf_event_open`]: https://www.man7.org/linux/man-pages/man2/perf_event_open.2.html + +I was originally worried that copying the events to Go heap and the +corresponding GC pressure would be an issue, but testing with the prototype +shows that this was unwarranted: the corresponding code barely shows up in the +profiler (see ["Cost of memcpy" section](#cost-of-memcpy)). + +We're also doing 45% more syscalls to retrieve the kernel half of our traces +with the new approach. Just like the `memcpy`, this also largely turned out to +be a non-issue: in the prototype we only spend about 0.92% of CPU cycles on this +([pprof profile](./prototype-option-2-ghidra-workload.pprof)). This is likely at +least in part because only a fraction of traces actually have kernel traces +associated with them. + +What did turn out to be an issue, however, is that we're now waking user-mode a +lot more frequently than we used with the old approach. The old HA only forces +an immediate read in UM when: + +- the trace is not previously known. This eliminates about ~45% of wakes. +- the unwinder sets `ha_symbolization_needed`. This eliminates 100% of wakes + that aren't due to PID events in a pure native workload and a noticeable chunk + in mixed workloads. + +This essentially means that we're waking from our `epoll` call that waits for +perf events a lot more frequently than before. The kernel itself seems to be +pretty good at handling this kind of wake efficiently, but the Go runtime isn't. +With all other optimizations suggested in this document (minus the ones in the +following "options" sections) applied, this almost doubles the CPU time that the +host agent is consuming in the average case. For interpreter workloads the +difference is not so significant. In the profile we're observing a slight (but +by no means deal-breaking) increase of time spent in `epoll` and a massive +increase of time spent in the goroutine scheduler. + +There are a number of possible routes that we can pursue to reduce the overhead +from constant waking. We'll discuss each one in the following sub-sections. + +> [!IMPORTANT] +> +> We recently learned that there is a bug with event inhibition that causes +> the "symbolize this now" events to not actually be delivered as intended. +> +> This likely explains why the previous approach was more performant even in +> heavy interpreter workloads despite the fact that the new approach should +> intuitively perform better here: constantly reading and clearing multiple maps +> really shouldn't be faster than reading events from shared memory and bumping +> atomics. Under a heavy interpreter load the amount of waking should be about +> the same between both variants if inhibition is released correctly. +> +> We did some experiments and confirmed that fixing the issue to restore the +> code's intended behavior indeed increases CPU usage significantly. However, we +> also noticed that interpreter symbolization clearly worked pretty well +> **despite this being broken for years now**, so we converged towards simply +> keeping the inhibition bug in place for the moment. + +#### Option 1: perf event buffer with event counting watermarks + +Perf event buffers support a feature called "watermarks". This essentially +allows us to specify a minimum number of either events or bytes in the buffer +before `epoll` will wake our process. Cilium defaults this to "1 byte". Setting +up watermarks to a size where they only wake us about 4 times per second per +core when all cores are maxed eliminates the overhead entirely. + +However, there are some issues with this approach that we'd need to get around: + +1) **The frequency at which we read events becomes unpredictable.**\ + On a system that is almost completely idle, it might take minutes to ever + reach the watermark. If a user opts out of fair scheduling by pinning their + application to a fixed list of cores, the frequency will also vary from core + to core. +2) **Cilium's BPF library only supports the byte-based watermarks.**\ + The length of our trace event can vary wildly depending on workload, so a + byte-size based watermark intensifies problem 1 even more: the reporting + frequency will additionally depend on the stack trace length. + +We can get around this by: + +1) **teaching Cilium to use the watermark variant that counts events.**\ + I have already implemented this to make sure that the kernel actually accepts + it with the BPF event kind. It's only a few lines of diff and works as expected. + It should be easy to get this merged upstream if we want. +2) **sending empty dummy traces when we interrupted the idle process.**\ + It's easy to forget that we have that option because we generally tend to + think of our BPF code as only being invoked when the CPU is under load, but + that is not actually the case. We get called at a constant frequency + regardless of the workload; we just immediately bail out when we [detect that + we're in the idle process][idle-check]. We could change this code path to + always send an empty dummy record and thus ensure that our watermark is + reached in a predictable and timely manner. + +[idle-check]: # + +#### Option 2: polling the perf event buffer + +The Cilium BPF library currently [always does an `epoll`][cilium-epoll] to get +notified about whether a watermark has been reached before attempting to read +from the ring. However, this is not actually required to safely read from the +buffer: it's also perfectly sound to just try to read and see if it contains +some data. Data access is synchronized with atomics. The Rust BPF library, for +example, supports this (and in fact offers it as the default way). + +Cilium currently doesn't provide a function to do that, but we can quite easily +add one. All the error prone logic for correctly reading and updating the +atomics is in place already;we mostly just need to call their existing +functions without doing an epoll first. I also implemented that variant in the +prototype and it works great. This in turn allows us to just read from our perf +event buffers at a constant frequency. + +The drawback here is that we're paying a certain constant overhead from polling. +However, in my experiments with a 250ms polling frequency, this is essentially +negligible. On the positive side, the HA CPU usage will be very consistent +regardless of the workload. + +I measured this to be ever so slightly slower than the current regular host agent +on a completely idle system (~32 minutes wall time): + +![idle-workload](./option2-idle-workload.png) + +And a bit more efficient on a system under a heavy mixed HotSpot/native workload +(~5 minutes wall time): + +![ghidra-workload](./option2-ghidra-workload.png) + +`pf-host-agent` is the prototype HA, `pf-normal` is a HA built from latest +`main` as of writing. Ignore the CPU percentage and look at the core seconds +under "TIME+". One additional core second in 32 minutes of wall time in the idle +case seems perfectly acceptable to me. + +[cilium-epoll]: https://github.com/cilium/ebpf/blob/da24e832beede03ec73966c7b92748d370c1937d/perf/reader.go#L354C7-L354C7 + +#### Option 3: option 2 + 80% watermarks + +We could extend option 2 by additionally setting a very high byte-based +watermark (e.g. 80% of the buffer size) and additionally `epoll` on that to +ensure that our buffer doesn't run full while we are sleeping between timer +ticks. This would potentially allow us to use a much smaller perf event buffer +while still having relatively high confidence that it never runs over while we +are sleeping. + +#### Option 4: BPF ring buffers + +While perf event buffers have one ring buffer per core, BPF ring buffers are +shared between all cores. This has the obvious benefit that user-mode will +always just have to manually poll a single mmap'ed ring or `epoll` on a single +file descriptor (instead of `num_cpu` FDs). This is beneficial for both +approaches: the overhead is constant and doesn't linearly scale with core count +as it does with perf event buffers. + +BPF ring buffers further allow the BPF code to decide whether it wants to wake +user-land on each message sent via `BPF_RB_NO_WAKEUP` and `BPF_RB_FORCE_WAKEUP` +flags. It can further query the buffer fill status via `bpf_ringbuf_query`. This +essentially allows us to implement whatever custom waking schema that we want, +and later switching or refining the approach would be quite easy. + +The drawback here is that BPF ring buffers were only added in kernel 5.8. I ran +a few queries on our HA metrics cluster and we still have about 3% of customers +running kernels that are older than that. It's further possible that certain +features (like waking and inspecting the buffer state from BPF) were added in +even newer kernel versions. If there is reviewer preference towards exploring +this approach further, I'd be happy to research this in more detail. + +#### Option 5: option 2 or 4 + early wakes for interpreter frames + +Option 2 and 4 can both optionally be extended to allow for early wake-ups that +ensure timely user-mode symbolization of interpreter frames. Examples for factors +that might request an early read could be (simply quoting Timo's suggestions): + +- if it's long living process, and we know the VM is unlikely to overwrite + memory structures, we could just throttle to send this event once every few + seconds per PID at maximum +- if it's new process, or the VM is known to be potentially overwriting things, + wakeup every frame as needed +- have per-VM type, or per-{PID,framedata} LRU to throttle sending the wakeup + +**With option 2** we'd simply have to leave the old "send an event to the PID +event perf buffer for early flush" mechanism in place. Instead of flushing the +old BPF hash maps it'd cause an early read of the perf event buffers. + +**With option 4** we could set the flag argument on `bpf_ringbuf_output` to +`BPF_RB_FORCE_WAKEUP` to force a wake-up in user-mode. + +#### Author's preference + +My preference is option 2. Option 3 and 5 can be considered optional +improvements to option 2 that we may or may not apply later. + +Option 1 seems like a bit of a hack that introduces mental complexity while +providing no upside at all compared to option 2. + +Option 4 is also excellent and what I'd like to see us using in the future, but +I'm not sure whether we're willing to kick out 3% of customers. I suspect that +we'll discover that particularly the big corporate customers that are waiting +for on-prem might be running older kernels. + +Starting with option 2 now and migrating to option 4 maybe a year down the line +should not present a huge amount of work, so this is my preference. + +#### Decision + +Reviewers agreed with the author's preference. + +### Replace `hashmapperlru` with `freelru` + +The hand-rolled `hashmapperlru` mostly exists because we had to persistently +store various string fields (comm, pod name, container name) of the trace that +need to be transferred over to the count updates while maintaining high memory +efficiency with string interning. + +The remaining LRUs in `tracehandler` no longer need to store any strings or +other heap allocated objects: both keys and values are either just hashes or +`libpf.Void`. Since we always just resend everything in the new approach, +there's also no chance that we'll ever have to add back any such field in the +LRU when we add more fields later. + +Pod and container name are currently still part of the UM hash. They are not +sent from kernel directly, but derived from the PID. Since the names cannot +change after process start, there is no need to worry about them changing +and having to invalidate caches because of that. + +We can thus retire our complicated custom LRU implementation and replace it with +the generic `freelru` without sacrificing memory efficiency. + +**Validated in prototype:** yes\ +**Independent issue:** no + +### Frame list in `per_cpu_record` + +During unwinding the frame list is currently chunked and temporarily stored in +the dedicated BPF map `per_cpu_frame_list`: + +`` + +I honestly have no idea why we are doing this even in the current approach: it +adds quite a bit of complexity to both the "push" function and the trace hashing +code. As far as I can tell there doesn't seem to be any technical limitation +that would require this chunking even on old kernels. + +I suggest that we simply add the frame list to our existing [`Trace`] struct, in +struct-of-array columnar representation: + +```c +typedef struct Trace { + // [...] (existing meta-data fields omitted) + + struct { + FileID file_ids[MAX_FRAME_UNWINDS]; + u64 addr_or_lines[MAX_FRAME_UNWINDS]; + u8 kinds[MAX_FRAME_UNWINDS]; + } frames; +} Trace; +``` + +Columnar representation is chosen to reduce alignment bytes. The alternative of +an `__attribute__((packed))` `Frame` type and array-of-struct was considered, +but turned out to not be supported by CGo: it ignores the attribute and ends up +accessing memory incorrectly. + +The `Trace` instances live in [`PerCPURecord`] and was previously already used +to store trace meta-information. The proposed change thus simply consolidates +the two maps into one. + +**Validated in prototype:** yes\ +**Independent issue:** yes, but simpler to do in main PR + +[`Trace`]: # +[`PerCPURecord`]: # + +### Move container- and pod name lookup into `tracehandler` + +The lookup for container- and pod names currently happens in +`ProcessManager.ConverTrace`: + +`` + +The BPF trace only contains the PID, not the actual container- and pod name. To +have both of these available when sending out the count, we thus need to move +the lookup to `tracehandler`. `tracehandler` can then pass the names that it +looked up to `ConvertTrace` for new traces and skip trace conversion for known +ones. + +**Validated in prototype:** yes\ +**Independent issue:** no + +### Extended container- and pod name caching + +Moving the call to tracehandler means that we'll now be doing these lookups more +often than before. Previously we'd only call into `containermetadata` when a +trace wasn't present in `hashMapperCache` and needed conversion, now we'll be +doing it for every trace. + +`containermetadata` has caching in place already: + +`` + +But that cache is only hit after looking up the file identifier. The `Sprintf` +and the `stat` syscall originating in `libpf.GetOnDiskFileIdentifier` eat a +noticeable amount of CPU time. + +It wasn't immediately obvious to me why this indirection via the file identifier +is taken instead of simply using the PID as the key, but the [comments on the PR +that added the cache][pid-reuse] shed light on that: it's meant to provide +resilience against PID reuse. + +We essentially have three options here: + +1) Accept a slight increase in cycles burned: about 3-5%. +2) Add a second, short lived `PID → (PodName, ContainerName)` cache in + `tracehandler`. This is the route that I pursued in the prototype. A 256 + entry cache with a lifetime of 1 minute did the job. +3) Reduce resilience against PID reuse and change the key of the existing cache + to be PID based, but add a timeout on the LRU. + +**Author's preference:** I'd like to give option 3 a try: I suspect that it +should do the job nicely without the need for added complexity from a new layer +of caching. No strong opinion on this, though. + +**Validated in prototype:** yes (option 1 and 2)\ +**Independent issue:** yes + +[pid-reuse]: # + +#### Improvements on newer kernels + +Kernel versions v4.18+ actually have a function `bpf_get_current_cgroup_id` that +we could use to obtain a unique ID that is guaranteed not to be re-used: + +https://github.com/torvalds/linux/blob/9b6de136b5f0158c60844f85286a593cb70fb364/include/linux/kernfs.h#L217-L221 + +I believe the inode ID corresponds to the directory node in `/sys/fs/cgroup`. To +my knowledge there is no way to open a file descriptor from an inode ID, so we'll +still have to keep our existing logic based on PIDs to read the container info. +The ID could only serve as a better cache key. + +### BPF trace hashing in UM + +#### Hash inputs + +We're no longer hashing traces within BPF, but we still need to construct a hash +for the LRU that suppresses repeated trace conversion. However, if we continue +to just have all fields of the BPF trace into this, we'll suffer the same +diminishing returns of caching efficiency with every high-cardinality field that +we add as discussed in the [Problems](#problems) section. + +Since the purpose of the hash is now only to avoid repeatedly doing user-mode +symbolization for the same trace (and no longer to keep a unique mapping to +other existing and upcoming meta-fields like NS timestamp or APM info), we can +reduce the hash input to just the frame list, COMM, kernel stack ID, and PID. +The PID still needs to be included because the frame list can contain per-PID +identifiers. + +**Validated in prototype:** no\ +**Independent issue:** no (unless we start without a trace translation cache) + +#### Hash algorithm + +BPF is currently using a hand-rolled, custom hashing scheme. The scheme was +chosen primarily for simplicity and is likely neither the fastest, nor the most +collision resistant function imaginable. We already have XXH3 as a dependency +anyway, so I suggest that we just use that. It's very fast and provides a pretty +reasonable output distribution. + +**Validated in prototype:** yes\ +**Independent issue:** no (unless we start without a trace translation cache) + +### Row-based trace record in UM + +We're currently using a columnar (struct of array) representation for the frames +in `host.Trace`: + +`` + +As far as I can tell, there's no upside to this at all. Allocating space for a +trace will in fact take 4 allocs whereas with a row-based (array of struct) +implementation we'd only need 2. Initializing and accessing it is more +cumbersome and requires more bound-checks. + +I suggest that we switch the frames to array-of-struct representation. + +This suggestion is largely independent of the goals in this document, except +that with the new approach we're allocating more `host.Trace` instances: that's +why I'm mentioning it here. + +**Validated in prototype:** yes\ +**Independent issue:** yes + +## Alternatives considered + +### Perf buffer with just `(hash, meta_field_a, ...)` + +This would keep most of the existing maps in place and only replace +`hash_to_count` with a perf event buffers that transfers e.g. `(hash, +nano_timestamp, apm_span_id, ...)` tuples. We'd continue hashing the +low-cardinality portion of the trace and send the high-cardinality fields with +the trace events. The intuition for this approach is that we could avoid +unnecessary `memcpy`s at the cost of keeping most of the complexity. + +The idea was abandoned because we realized that the price of the memcpy is more +or less negligible (see ["Cost of `memcpy`"](#cost-of-memcpy) section), so +paying the extra complexity and maintenance effort of the more complex solution +seemed like a bad trade-off. This variant would also have required dealing with +the same complications as the [proposed solution](#proposed-solution) in +addition to that. + +## Misc remarks + +### Cost of `memcpy` + +Since the new approach resends the whole trace all the time, intuitively we tend +to assume that the suggested approach must come with a hefty amount of CPU +cycles spend in `memcpy`. The wish to avoid `memcpy` was brought up by multiple +reviewers, and I myself also assumed this to be a bottle-neck when starting this +experiment. + +Fortunately we can remedy these worries quite easily by profiling the prototype +host agent. The following profile is the result of a heavy mixed HotSpot/native +workload, letting Ghidra analyze the entire `devfiler` executable (29 MiB): + +![memcpy-cost](./memcpy-cost.png) + +The [`parseRawTrace`] function is responsible for hashing, allocating and +copying the BPF trace into Go heap. We can see that the whole function only +accounts for roughly 1.61% of execution time. 0.23% of that are spent in XXH3 +hashing, 1.15% in memory allocation. + +Funnily (or sadly) enough, the internal allocs in the timer that is used for +polling the perf buffer is actually about equally expensive as copying and +hashing the whole trace. + +[`parseRawTrace`]: # +[original location]: # diff --git a/design-docs/00000-example-docs/04359-remove-bpf-hashing/known-trace-metrics.png b/design-docs/00000-example-docs/04359-remove-bpf-hashing/known-trace-metrics.png new file mode 100644 index 00000000..5d5b16d5 Binary files /dev/null and b/design-docs/00000-example-docs/04359-remove-bpf-hashing/known-trace-metrics.png differ diff --git a/design-docs/00000-example-docs/04359-remove-bpf-hashing/memcpy-cost.png b/design-docs/00000-example-docs/04359-remove-bpf-hashing/memcpy-cost.png new file mode 100644 index 00000000..9a04aa09 Binary files /dev/null and b/design-docs/00000-example-docs/04359-remove-bpf-hashing/memcpy-cost.png differ diff --git a/design-docs/00000-example-docs/04359-remove-bpf-hashing/option2-ghidra-workload.png b/design-docs/00000-example-docs/04359-remove-bpf-hashing/option2-ghidra-workload.png new file mode 100644 index 00000000..798c8b90 Binary files /dev/null and b/design-docs/00000-example-docs/04359-remove-bpf-hashing/option2-ghidra-workload.png differ diff --git a/design-docs/00000-example-docs/04359-remove-bpf-hashing/option2-idle-workload.png b/design-docs/00000-example-docs/04359-remove-bpf-hashing/option2-idle-workload.png new file mode 100644 index 00000000..cc3d3b40 Binary files /dev/null and b/design-docs/00000-example-docs/04359-remove-bpf-hashing/option2-idle-workload.png differ diff --git a/design-docs/00000-example-docs/04359-remove-bpf-hashing/trace-pipe-reworked.drawio.svg b/design-docs/00000-example-docs/04359-remove-bpf-hashing/trace-pipe-reworked.drawio.svg new file mode 100644 index 00000000..64aa54b6 --- /dev/null +++ b/design-docs/00000-example-docs/04359-remove-bpf-hashing/trace-pipe-reworked.drawio.svg @@ -0,0 +1,4 @@ + + + +
BPF unwinds stack
BPF unwinds stack
store trace
store trace
unwind_stop
sends trace
unwind_stop...
 tail call 
 tail call 
Receive trace events
Receive trace events
traceOutChan
traceOutChan
Go channel
Go channel
per_cpu_records
per_cpu_records
BPF Map
BPF Map
Receive BPF trace
Receive BPF trace
Kernel interrupts
process
Kernel interrupts...
Resume process
Resume process
Add container and
pod name for PID
Add container and...
ReportCountForTrace
ReportCountForTrace
ReportFramesForTrace
ReportFramesForTrace
Symbolize
interpreter frames
Symbolize...
Calculate UM hash
Calculate UM hash
Yes
Yes
UM hash known?
UM hash known?
counts
counts
Report & cache trace
Report & cache trace
No
No
retrieve trace
retrieve trace
FrameMetadata
FrameMetadata

UM: REPORTER

UM: REPORTER

UM: TRACE HANDLER

UM: TRACE HANDLER

UM: TRACER

UM: TRACER

KERNEL: BPF UNWINDER

KERNEL: BPF UNWINDER
event with full trace
event with full trace
trace_events
trace_events
Perf Event Buffer
Perf Event Buffer
BPF trace known
by XXH3 hash?
BPF trace known...
No
No
Yes
Yes
umTraceCache
umTraceCache
LRU[UMHash]Void
LRU[UMHash]Void
bpfTraceCache
bpfTraceCache
LRU[BPFHash]UMHash
LRU[BPFHash]UMHash
frames
frames
 frame metadata
 frame metadata
out of document scope
out of document scope
Text is not SVG - cannot display
\ No newline at end of file diff --git a/design-docs/00000-example-docs/04359-remove-bpf-hashing/trace-pipe.drawio.svg b/design-docs/00000-example-docs/04359-remove-bpf-hashing/trace-pipe.drawio.svg new file mode 100644 index 00000000..7803dde4 --- /dev/null +++ b/design-docs/00000-example-docs/04359-remove-bpf-hashing/trace-pipe.drawio.svg @@ -0,0 +1,4 @@ + + + +
BPF unwinds stack
BPF unwinds stack
known_traces
known_traces
 read & clear 
 read & clear 
Trace Meta
Trace Meta
Frames
Frames
unwind_stop
hashes trace
unwind_stop...
 tail call 
 tail call 
No
No
Trace known?
Trace known?
Persist trace meta
and frame list
Persist trace meta...
 check map 
 check map 
 check map 
 check map 
 store trace meta 
 store trace meta 
 store frames 
 store frames 
Trace count++
Trace count++
 insert or inc count  
 insert or inc count  
BPF Hash Map
BPF Hash Map
 read & clear 
 read & clear 
Read and clear BPF
maps on a timer
Read and clear BPF...
traceCountOutChan
traceCountOutChan
Go channel
Go channel
deferredUpdates
deferredUpdates
map[BPFHash]uint64
map[BPFHash]uint64
hashMapperCache
hashMapperCache
LRU[BPFHash]TraceMeta
LRU[BPFHash]TraceM...
traceCache
traceCache
Custom LRU impl
Custom LRU impl
 read & clear 
 read & clear 
trace and count update batches
trace and count update batches
per_cpu_frame_list
per_cpu_frame_list
BPF Map
BPF Map
per_cpu_records
per_cpu_records
BPF Map
BPF Map
hash_to_framelist
hash_to_framelist
BPF Hash Map
BPF Hash Map
Receive and split
update batches
Receive and split...
Kernel interrupts
process
Kernel interrupts...
Resume process
Resume process
 count updates 
 count updates 
each count update
each count update
Categorize count
updates
Categorize count...
 previously deferred
count updates 
previously deferred...
partial UM trace
partial UM trace
Get container and
pod name for PID
Get container and...
 counts for unknown traces 
 counts for unknown traces 
ReportCountForTrace
ReportCountForTrace
ReportFramesForTrace
ReportFramesForTrace
partial UM trace
partial UM trace
Symbolize
interpreter frames
Symbolize...
Calculate UM hash
Calculate UM hash
Maps to UM hash, thread-,
pod- & container-name
Maps to UM hash, thread-,...
Yes
Yes
UM hash known?
UM hash known?
Enrich and send
delayed count updates
Enrich and send...
Report & cache trace
Report & cache trace
No
No
 UM frames 
 UM frames 
synced with BPF map
synced with BPF map
 counts for 
known traces 
counts for...
Enrich counts with
meta-fields
Enrich counts with...
Yes
Yes
FrameMetadata
FrameMetadata
for interpreter frames: send frame meta 
for interpreter frames: send frame meta 
each raw trace
each raw trace

UM: REPORTER

UM: REPORTER

UM: TRACE HANDLER

UM: TRACE HANDLER

UM: TRACER

UM: TRACER

KERNEL: BPF UNWINDER

KERNEL: BPF UNWINDER
report_events
report_events
hash_to_count
hash_to_count
BPF Hash Map
BPF Hash Map
hash_to_trace
hash_to_trace
BPF Hash Map
BPF Hash Map
Perf Event Buffer
Perf Event Buffer
Yes
Yes
No
No
Contains
interpreter
frames?
Contains...
Force UM to read
maps immediately
Force UM to read...
 perf event 
 perf event 
force early read
force early read
out of document scope
out of document scope
Text is not SVG - cannot display
\ No newline at end of file diff --git a/design-docs/00000-example-docs/README.md b/design-docs/00000-example-docs/README.md new file mode 100644 index 00000000..e29a0808 --- /dev/null +++ b/design-docs/00000-example-docs/README.md @@ -0,0 +1,3 @@ +This directory contains two old design documents from times before the agent +was donated to OTel and open-sourced in the process. The documents are provided +primarily as examples of how a design document can be structured. diff --git a/design-docs/00000-templates/long-form-design-doc-template.md b/design-docs/00000-templates/long-form-design-doc-template.md new file mode 100644 index 00000000..41d3e6ca --- /dev/null +++ b/design-docs/00000-templates/long-form-design-doc-template.md @@ -0,0 +1,163 @@ +Long-Form Design Doc Template +============================= + +> [!NOTE] +> +> This template is intended for use when a long-form design document is +> necessary. A document using this template should provide sufficient high level +> context to understand the problem as well as detailed proposals with supporting +> data. + +# Meta + +- **Author(s)**: Name of the author(s) +- **Start Date**: Start date of the design doc +- **Goal End Date**: Intended end date for the design process +- **Primary Reviewers**: A list of at least two primary reviewers + +# Abstract + +A reasonably concise abstract so that a reader can get an idea of what +the document is about. + +# Introduction + +A good introduction into the problem that we are trying to solve. This +introduction should not assume much background knowledge from the +reader, but also not re-hash everything from ground up; when any +external documentation can be referenced, this should be given +preference over re-hashing things. The introduction should have the +following sections + +## Context + +Present ideas, dependencies, systems and general context referred to in +the doc. + +### Constraints + +Document any relevant constraints that any/all solutions must adhere to. +Try to be as explicit as possible here, as constraints that are obvious +to you may be less so to another reader. + +As an example, when working on a task that relates to the eBPF tracers +there are a number of constraints that come into play depending on the +kernel version one wants to support. In this case it would be +appropriate to list each of these constraints, and the kernel versions +they are relevant to. + +### Related (Sub-)Problems + +If there is a set of sub-problems that need to be explained, and +possibly solved, as part of one or more of your solutions it can make +sense to address them up-front and independently. Include solutions to +such problems here as well, if appropriate. + +## Problem Statement + +What the problem is, why the problem exists, why do we need to solve it. + +**For particularly significant tasks where it is important to solicit +feedback as early as possible it often makes sense to write this section +and the success criteria first. This allows for feedback to be solicited +before writing the rest of the document, in order to ensure everyone is +on the same page before major time investment.** + +## Success Criteria + +By far the main pitfall in many design docs is not clearly defining what +"success" means. This sometimes can lead the author to meander into +less-important areas, or ignore an important aspect completely. This can +also lead the reviewers to be on a completely different line of thought, +orthogonal to the doc. + +This section is more or less what defines the outline and then sets the +stage for the sections/subsections of the document. It's a reality +check that allows the authors and the reviewers to agree on the basis +for the design, and how it should be reviewed. It should help to make +sure everybody is focused on the same thing. + +Usually this should be presented as a concise series of bullet points. + +## Scope + +This expands on the "success criteria", but helps to clear up possible +confusion about what the design doc is about and is not about. It's +usually extremely short, a few bullet points. The most useful info here +is usually "XYZ is not in scope" to avoid ambiguity. This ensures we +don't make incorrect assumptions when reading the doc + +# Proposed Solution(s) + +Ideally, there should be more than one proposed solution. The document +should list the various solutions, and then go into some depth about +their drawbacks and advantages. + +Something to watch out for here is a situation where a problem has N +sub-problems, each with their own M alternative solutions. In that case +it is usually unnecessary, and unreadable, to enumerate all possible +$N \cdot M$ permutations. Assuming it is feasible, outline the possible +solutions to each sub-problem separately, and present a very limited +number of combined solutions, if any. + +## Author's Preferred Solution + +After presenting the merits of each solution, you should ideally give a +hint as to what your preferred option is, and why. This achieves 2 +goals: + +- It ensures solutions are evaluated against each other, despite + having pros and cons that are not easily comparable. Discussing this + can provide helpful context to the reader. +- Writing a justification is a good reality-check, ensuring the design + doc achieves its goal and provides enough information to make a + decision. + + +In case you do not have a preferred solution: to help the reader, you +can describe an appropriate thought process for comparing solutions: + +- Which pros/cons are more important than others? +- What features or constraints are more important? + +We have a finite time/energy budget for assessing designs and providing +this information will help you and the reviewers to prioritize. + +# Testing Strategy + +Document here how the solution itself will be tested, as well as how the +solution may impact the testability of other components that make use of +it. + +The testability of a component itself, as well as its impact on the +testability of other components that make use of it, are important +factors to weigh when choosing between alternate solutions. If it is the +case that your proposed solutions differ in terms of how testable they +are then it may make sense to lift this section into the proposed +solutions, and repeat it for each. + +## Testing of Proposed Solution Itself + +Outline how the proposed solution will be unit tested and integration +tested. For many implementations this is straightforward and obvious, +but not always. If there are likely to be any difficulties in testing +the solution then outline them here, as well as solutions, if any. + +## Impact on Testing of Other Systems/Components + +Discuss whether or not the proposed solution will impact the testability +of systems that make use of it, and how. For example, if the solution +involves producing a component that would be difficult to mock out for +the purposes of testing something that makes use of it, then explain +that here. + +# Plan to Acquire Missing Data (Optional) + +Plan to acquire missing data. Often, some data is missing to properly +evaluate the advantages and disadvantages, and this section details what +data needs to be gotten and how. + +# Decision + +When all the data is there, a "Decision" section which details what +solution was decided on. diff --git a/design-docs/00000-templates/short-form-design-doc-template.md b/design-docs/00000-templates/short-form-design-doc-template.md new file mode 100644 index 00000000..6f4f576e --- /dev/null +++ b/design-docs/00000-templates/short-form-design-doc-template.md @@ -0,0 +1,50 @@ +Short-Form Design Doc Template +============================== + +> [!NOTE] +> +> This template is intended to propose problems and solutions for which a +> fully fledged design document is likely not necessary. A document made from +> this template will have less high level context and detail than a long-form +> document. + +# Meta + +- **Author(s)**: Name of the author(s) +- **Start Date**: Start date of the design doc +- **Goal End Date**: Intended end date for the design process +- **Primary Reviewers**: A list of at least two primary reviewers + +# Problem + +Clearly describe the problem. + +# Success Criteria + +By far the main pitfall in many design docs is not clearly defining what +"success" means. This sometimes can lead the author to meander into +less-important areas, or ignore an important aspect completely. This can +also lead the reviewers to be on a completely different line of thought, +orthogonal to the doc. + +This section is more or less what defines the outline and then sets the +stage for the sections/subsections of the document. It's a reality +check that allows the authors and the reviewers to agree on the basis +for the design, and how it should be reviewed. It should help to make +sure everybody is focused on the same thing. + +Usually this should be presented as a concise series of bullet points. + +## Scope + +This expands on the "success criteria", but helps to clear up possible +confusion about what the design doc is about and is not about. It's +usually extremely short, a few bullet points. The most useful info here +is usually "XYZ is not in scope" to avoid ambiguity. This ensures we +don't make incorrect assumptions when reading the doc + +# Proposed Solution + +A proposed solution for the problem. Even though this is a short-form +design document you must still provide enough information to allow the +reviewers to understand and assess what you are proposing. diff --git a/design-docs/README.md b/design-docs/README.md new file mode 100644 index 00000000..1b47661c --- /dev/null +++ b/design-docs/README.md @@ -0,0 +1,58 @@ +Profiling agent design documents +================================ + +This directory contains design documents that discuss major additions of changes +to the profiling agent's architecture. + +### When to write a design document? + +Invoking the design document process can be beneficial if + +- you are proposing changes that have significant impact on multiple + sub-components of the profiling agent. +- you are proposing a major addition and want to make sure that it will + actually be accepted upstream the way that you plan to implement it without + writing the code first (and thus risking need for major reworks later). +- your proposed changes require a significant amount of context to understand + for reviewers. A short design document can help to clarify the current state + and the state that you'd like to move towards in these cases. +- you'd like to incrementally apply reworks over the course of multiple + PRs, to provide extra context for reviewers to understand what the end + goal is supposed to look like. In simpler cases that can also be achieved + with a tracking issue. + +The above are guidelines: there is no hard rule on when a design document is +necessary. When in doubt, please feel free to create an issue and quickly outline +what you want to do in order to clarify whether a document is needed or not! + +### Creating a New Document + +- Create a new git branch forking from latest `main` +- Create a directory in the `design-docs` folder + - The directory name should follow a format like `00000-my-doc-title` + - The intention for creating a directory per document is to allow bundling + media (images, diagrams, drawings) with the document + - The 5 digit number is included to make it easy to spot recent documents and + to order the documents by their recentness when viewed in the GitHub UI + - When initially writing the document, the ID is set to `00000` +- Copy one of the templates for design documents from the (`00000-templates`)[./00000-templates] + directory into into the newly created directory, naming it `README.md` +- Write the design document by following the instructions in the template +- Once reaching draft quality, create a draft PR + - Add the `design-document` label + - Include a link to the rendered document. GitHub hides away the ability to + properly view rendered documents behind multiple clicks, so adding a direct + link spares others some work. After creating the PR, this link can be + obtained by clicking "Files changed", searching the `README.md` file, + clicking the three dots on the right, selecting "View file" and then + switching the current view to track your branch via the drop-down to the + left of the file path. The last step ensures that the link automatically + tracks updates when you push fixup commits later. +- Rename the directory according to the PR ID that was allocated, e.g. if the + allocated PR ID is 1234 then `00000-my-doc-title` becomes + `01234-my-doc-title` +- Once you're happy with the state of the document: + - Mark the PR as ready for review and tag at least 2 maintainers for review + - Additional people who should be aware of the document but whose review + isn't mandatory can be notified by mentioning their username in the PR's + description as `CC @username`