diff --git a/design-docs/00000-example-docs/02612-process-manager-fileid-pid-split/README.md b/design-docs/00000-example-docs/02612-process-manager-fileid-pid-split/README.md new file mode 100644 index 00000000..8f63f4e9 --- /dev/null +++ b/design-docs/00000-example-docs/02612-process-manager-fileid-pid-split/README.md @@ -0,0 +1,475 @@ +> [!IMPORTANT] +> +> This document is from Elastic times and precedes the open-sourcing of the +> profiling agent. It is provided as an example of what a design document +> could look like. Many of the links had to be removed because the original +> git history of the profiling agent had to be erased for legal reasons when +> it was open-sourced. + +Process Manager: Move Per-FileID Info Into Dedicated Manager +============================================================ + +## Meta + +- **Author(s)**: Joel Höner +- **Start Date**: Jul 13, 2022 +- **Goal End Date**: Jul 29, 2022 +- **Primary Reviewers**: Christos Kalkanis, Florian Lehner + +## Problem + +The process manager in the host agent mixes up multiple related but still +distinct concerns in a single type. It manages both information that is kept per +process ID (PID) as well as information kept per file ID. The entanglement makes +consistency expectations between the different information being stored harder +to fathom. The per-file-ID information being stored additionally has +underdocumented and inconsistent lifetimes, with part of the information being +stored in append-only maps (inserted and kept forever), while other information +is reference-counted. The reference counting for the latter is mixed in with the +other business logic of the process manager without any form of abstraction. + +## Success Criteria + +- Clear separation of per-PID and per-file-ID information in two separate + managers +- Clear and consistent lifetime for all per-file-ID information being managed +- Clearly communicate mutability/constness of the data being stored +- New manager must not call back into the PM that owns it + - We have too many circle-references already + +## Scope + +- This document is concerned only with first splitting the per-file-ID + information storage and management out from the process manager and improving + on how it is managed +- Improvements on any of the per-PID information within the process manager are + not in scope + +## Context + +In the following sections we'll start by taking inventory of the per-file-ID +information that is currently being stored in the process manager, its use, +lifetime and mutability. + +### The `interpreterData` map + +The [`interpreterData` map][interpreter-data-field] serves as a cache of +per-executable interpreter data ([`interpreter.Data`][interpreter-data-struct]. + +[interpreter-data-field]: # +[interpreter-data-struct]: # + +When a new mapping is reported, it is used to quickly determine whether we have +already previously concluded that the `FileID` associated with the new mapping is +definitely not an interpreter [by checking for a `nil` record in +`interpreterData`][interpreter-data-check]. If this is not the case (either no +entry exists or the entry is non-`nil`), the [`pm.interpreterLoader` array] is +[walked][array-walking], asking each interpreter unwinder to attempt to detect +whether they want to handle the file. If an interpreter unwinder detects the +file as being supported by them, we store the corresponding helper information +that the interpreter extracted in the `interpreterData` map. If the search for a +suitable interpreter unwinder is exhausted unsuccessfully, we instead store +`nil`. + +[interpreter-data-check]: # +[`pm.interpreterLoader` array]: # +[array-walking]: # + +The map is inserted into in `attachInterpreter` via the following +call-chains: + +- [`ProcNewMapping`](#) → [`handleNewInterpreter`](#) → [`attachInterpreter`](#) + - [`ProcNewMapping`](#) → [`handleNewInterpreter`](#) → go [`handleNewInterpreterRetry`](#) → [`attachInterpreter`](#) + +The map is read in: + +- [`attachInterpreter`](#) +- [`handleNewInterpreter`](#) + +Records are never deleted from this map, nor are records that were inserted ever +mutated. + +### The `fileIDInfo` map + +The [`fileIDInfo` map] is harder to reason about. Its main purpose is to store +the information required to delete all information for a given FileID from the +two BPF maps `stack_delta_page_to_info` and `exe_id_to_%d_stack_deltas`. + +[`fileIDInfo` map]: # + +It further stores information about gaps in the unwinding information of +executables which is passed to [`interpreter.New`]. The information is currently +only consumed by the V8 unwinder that uses it to find the pre-compiled JIT block +within the binary. + +[`interpreter.New`]: # + +Entries in the map are managed via reference counting: incrementing the RC is +performed mixed into the business logic of the process manager ([link +1][rc-inc-1], [link 2][rc-inc-2]) while decrementing the RC is performed in the +[`decreaseReferenceCount`][rc-dec] function. + +[rc-inc-1]: # +[rc-inc-2]: # +[rc-dec]: # + +Entries are inserted into the `fileIDInfo` map in either +[`ProcNewMapping`][file-id-info-insert-1] or +[`AddSynthIntervalData`][file-id-info-insert-2]. In `ProcNewMapping`, if a +record for a file ID already existed previously, we instead just increase the +RC. `AddSynthIntervalData`, on the other hand, errors out when passed a file ID +that is already known. [`ProcNewMapping`][file-id-info-insert-1] is the typical +code-path taken in the majority of executables whereas +[`AddSynthIntervalData`][file-id-info-insert-2] is more of an edge-case; it +provides the ability to override unwinding information for executable with +broken `.eh_frame` sections. It's currently only used for `[vdso]` handling on +aarch64. + +[file-id-info-insert-1]: # +[file-id-info-insert-2]: # + +The value to be stored in the map -- for both code paths -- is returned from the +[`loadDeltas`][load-deltas-return] function. [`loadDeltas`][load-deltas-return] +loads unwinding information into the `stack_delta_page_to_info` and +`exe_id_to_%d_stack_deltas` BPF maps and then returns a newly created instance +of the [`fileIDInfo` struct][file-id-info-struct]. + +[load-deltas-return]: # +[file-id-info-struct]: # + +The reference counter can be decreased via the following code-paths: + +- [`ProcExit`](#) → [`deletePIDAddress`](#) → [`decreaseReferenceCounter`](#) +- [`ProcMunmap`](#) → [`deletePIDAddress`](#) → [`decreaseReferenceCounter`](#) + +In `deletePIDAddress`, once the RC reaches 0, the information from the +map is used to remove entries corresponding to the FileID from the BPF +maps ([link 1](#), [link 2](#)) and then also the entry itself from the +`fileIDInfo` map. + +The map is read using the following code-paths: + +- [`New`](#) → [`collectInterpreterMetrics`](#) → anonymous spawned goroutine +- [`ProcExit`](#) → [`deletePIDAddress`](#) → [`decreaseReferenceCounter`](#) +- [`ProcMunmap`](#) → [`deletePIDAddress`](#) → [`decreaseReferenceCounter`](#) +- [`ProcNewMapping`](#) + +With the exception of the reference counter field, entries in this map +are never mutated after insertion. + +### The `unwindInfoIndex` map + +[The `unwindInfoIndex` map][unwind-info-index] and the `unwind_info_array` BPF +map that it is tethered with provide deduplication of +[`UnwindInfo`][unwind-info-struct] records in the `exe_id_to_%d_stack_deltas` +BPF map. The map in Go assigns each unique `UnwindInfo` the `uint16` index of +the corresponding information in the BPF map. It is both read and written +exclusively from [`getUnwindInfoIndex`][get-unwind-info-func]. Entries are only +appended and then never deleted. + +[unwind-info-index]: # +[unwind-info-struct]: # +[get-unwind-info-func]: # + +Call-chains: + +- [`ProcNewMapping`](#) → [`loadDeltas`](#) → [`getUnwindInfoIndex`](#) +- [`AddSynthIntervalData`](#) → [`loadDeltas`](#) → [`getUnwindInfoIndex`](#) + +Because the map is exclusively used as a helper for deduplicating data +in the per-file-ID maps, I suggest moving into the new manager as well. + +### The `stackdeltaprovider` field + +The [`stackdeltaprovider` field][stackdeltaprovider-field] is used exclusively +to [construct the data that is then stored in the `fileIDInfo` +map][construct-data-for-fileidinfo] and the associated BPF maps. Because the SDP +is concerned only with per-file-ID information, we should move this field to the +new manager as well. + +[stackdeltaprovider-field]: # +[construct-data-for-fileidinfo]: # + +### The `interpreterLoaders` field + +The `interpreterLoaders` field is used to [populate the data][populate-data-1] +in the `fileIDInfo` map if it doesn't exist yet. The field and [the constructor +logic for populating it][populate-data-2] should be moved to the new manager. + +[populate-data-1]: # +[populate-data-2]: # + +### The `FileIDMapper` field + +`FileIDMapper` is a public field of the process manager. It provides +functionality for setting and getting the mapping from user-mode to +kernel-mode file IDs and is [implemented as an LRU](#). +The implementation is swappable to simplify testing. + +Entries are read in: + +- [`ProcessManager.ConvertTrace`](#) +- [`ebpf.reportNewMapping`](#) + +Entries are added in: + +- [`ebpf.getKernelStack`](#) +- [`elfinfo.getELFInfoForFile`](#) + +Entities are never explicitly deleted. Instead, they are +garbage-collected by falling out of the LRU. It is unclear what the +intended behavior is when binaries that are still loaded in a process +fall out of the LRU. Converting traces for such binaries will no longer +be possible. + +The field stores per-file-ID information and should probably be moved as +well. However, being public, swapable for testing, and mutated remotely +complicates this considerably. The actual implementation of the mapper +is already abstracted behind a dedicated type and doesn't introduce a +lot of complexity into the PM code itself, so I suggest leaving it where +it is for the moment. Consolidating it into the new manager might be +desirable at a later point in time but is declared out of scope for this +document. + +## Proposed Solution + +### New Manager Design + +This section makes concrete suggestions for the design of the new manager. + +#### Lifetime Management + +The two main maps `interpreterData` and `fileIDInfo` discussed in the previous +chapter currently have diverging lifetimes: one is append-only whereas the other +one uses reference counting. In order to merge the two and to make the lifetime +of per-file-ID information easier to reason about, it is desirable to move them +towards a common lifetime. There are multiple possible strategies to pursue +here. + +##### All Append-Only + +In this strategy we merge both maps and manage them according to an append-only +lifetime: entries that are added once will never be released. + +While this strategy worked reasonably well for `interpreterData`, it is doubtful +that it is desirable to also apply it to `fileIDInfo` and the connected BPF +maps: the amount of memory required to store it is significantly higher. On +systems that build and execute a lot of executables on the fly (e.g. wasmtime +runtime with the dylib loader) we'd quickly run out of memory. + +##### All Reference-Counted + +In this variant, we merge the maps and manage both of them via +reference-counting. While it entails a bit more complexity in the new manager's +internals, it ensures that no memory is being leaked regardless of the workload. + +For `fileIDInfo` the lifetime remains unchanged. For the `interpreterData` +information the logic changes slightly when compared to the previous append-only +approach: the per-executable interpreter information is lost once the last +interpreter running it exits. In a workload that spawns a lot of very +short-lived interpreter processes in sequence, this might increase the amount of +cycles spent on re-gathering the interpreter data. + +As an optional extension to combat this, we could refrain from removing objects +whose RC reaches 0 immediately. Instead, we store a timestamp of when the RC +reached 0. A goroutine then periodically cleans all entries that have been at RC += 0 for more than N seconds. Lookup and deletion methods will pretend that +objects with RC = 0 do not exist whereas insertion methods will pick up the +dying object and revive it by incrementing the RC. + +It is unclear whether the overhead introduced by these additional +`interpreter.Loader` calls is substantial. `fileIDInfo` has the same lifetime +management already and – from intuition – would appear to be more expensive to +re-create than the interpreter data: it requires reading stack-deltas from disk, +decompressing them and inserting them into the BPF maps. + +We could also consider storing the per-executable interpreter information on +disk as well. + +##### Keep Maps and Lifetime Separate + +This is the simplest variant to implement, but it also reduces the opportunities +for internal simplification significantly. If we keep their lifetime separate, +we'll have to add accessors to read from and write to each respective map +individually. + +##### Author's Preference + +My preference is managing all information with reference counting. I suggest +going without the delayed deletion approach first to keep things simple. If this +turns out to be a performance problem in practice, it should be reasonably easy +to extend the system as required. + +##### Decisions + +The reviewers agreed with the author's preference. + +#### Public Interface + +```go +type ExecutableInfo struct { + // Data holds per-executable interpreter information if the file ID that this + // instance belongs to was previously identified as an interpreter. Otherwise, + // this field is nil. + Data interpreter.Data + // Gaps contains information about large gaps in the unwinding information. + Gaps []libpf.Range + // [elided: various private fields required to keep track of RC and BPF map entries] +} + +type executableInfoManagerState struct { + info map[FileID]ExecutableInfo + unwindInfoIndex map[sdtypes.UnwindInfo]uint16 + sdp StackDeltaProvider + ebpf EbpfHandler + opener interpreter.ResourceOpener + numStackDeltaMapPages uint64 +} + +type ExecutableInfoManager struct { + state xsync.RWMutex[executableInfoManagerState] +} + +// Alias to keep the definitions below short. Not to be included in the actual code. +type EIM = ExecutableInfoManager; + +// New creates a new instance of the executable info manager. +func New(sdp StackDeltaProvider, ebpf EbpfHandler, opener interpreter.ResourceOpener) *EIM; + +// AddOrIncRef either adds information about an executable to the internal cache (when first +// encountering it) or increments the reference count if the executable is already known. +// The memory mapping and PID are only borrowed in order to allow interpreter loaders to read +// memory (e.g. to determine the version or to resolve other per-executable information). +func (mgr *EIM) AddOrIncRef(mapping *Mapping, pid util.PID) (*ExecutableInfo, error); + +// AddSynth adds synthetic interval data to the manager. A call to this function must +// precede all `AddOrIncRef` calls for the same file ID, otherwise an error is returned. +func (mgr *EIM) AddSynth(fileID host.FileID, data sdtypes.IntervalData) error; + +// Get retrieves information about an executable. If the executable isn't known, nil is +// returned instead. +func (mgr *EIM) Get(fileID FileID) *ExecutableInfo; + +// RemoveOrDecRef decrements the reference counter of the file being tracked. Once the RC +// reaches zero, information about the file is removed from the manager. +func (mgr *EIM) RemoveOrDecRef(fileID FileID) error; + +// UpdateMetricSummary updates the metrics in the given map. +func (mgr *EIM) UpdateMetricSummary(summary map[metrics.MetricID]metrics.MetricValue); +``` + +The suggested interface here assumes that we went with the consolidated +ref-counting approach for lifetime management. Should we decide to go with a +different approach, this will have to be reworked. + +The ref-counting semantics are intentionally exposed in the public interface: +callers must understand them; otherwise it might be tempting to ask why not to +only call `Add` once and check existence via `Get`. + +All function-, type- and field-names are suggestions and up for debate. + +#### Ownership of the New Manager + +The instance of the new manager will be owned by the process manager in the form +of a private struct field. The code continues to live in the `processmanager` +package, although in a different file: `execinfomanager.go`. + +#### BPF Map Management + +The responsibility for managing the following per-file-ID BPF maps is +transferred to the new manager: + +- `stack_delta_page_to_info` +- `exe_id_to_%d_stack_deltas` +- `unwind_info_array` + +#### BPF Map Documentation + +BPF maps pose a bit of a challenge with regards to where and how to document +them. They are always public to all code and their ownership situation can be +intransparent. In essence, they can be reasoned about as if they were public +fields of the type that manages them. Consequently, this might also be a way +forward in documenting them: on the type that manages them, as if they were +public fields. + +We should document what with and by whom the maps are populated and whether +other code is allowed to mutate them or not. In this particular case, the maps +are managed exclusively by the new manager and other code can only read from the +maps. + +#### Locking + +The new manager is locked internally and public methods can be called from an +arbitrary amount of different threads at the same time. A global `xsync.RWMutex` +is used to lock all fields. It is acquired and released in all public functions; +internal helpers that need access to the manager’s state rely on their caller +for locking by taking an `*executableInfoManagerState` argument. If applicable, +they can also be directly implemented as having the state struct as the +receiver. + +#### Metrics + +The `numStackDeltaMapPages` metric is concerned with keeping count of the size +of per-file-ID BPF maps and should thus be moved to the new manager. The +responsibility for reporting metric information of various fields that were +moved into the new manager is transferred as well. + +#### Moving Code + +Multiple methods of the process manager are exclusively concerned with +per-file-ID information: + +- `AddSynthIntervalData` +- `loadDeltas` +- `getUnwindInfoIndex` (only called from `loadDeltas`) +- `calculateMergeOpcode` (only called from `loadDeltas`) +- `decreaseReferenceCounter` + +These will be moved to the new manager in their entirety. + +### Porting Process + +#### Option #1 + +- Copy per-file-ID maps to separate type +- Create methods for all code-snippets that essentially – as far as possible – + contain exact copies of the code that previously lived in the process manager + - Will produce a lot of bogus methods with questionable names and semantics + - It however allows reviewers to easily tell whether code is equivalent +- Remove per-file-ID maps from process manager, replacing them with an instance + of the new manager +- Replacing the previously extracted code snippets with corresponding method + calls +- Get that reviewed & merged +- Unify the two maps in the new manager and move towards the new proposed public + interface +- Get that reviewed & merged + +#### Option #2 + +- Implement a manager with the public interface documented in this document + immediately + - Duplicate all required helper functions and fields + - Place a lot of comments that help explaining where code was pasted from, + how it was changed and where new code was added + - Get that reviewed and approved as a draft, but don’t actually merge it + into `main` yet +- Port PM to use the new manager, replacing and deleting all previously + duplicated code +- Get that reviewed, approved and then merged + +#### Author’s Preference + +This is going to be a bit of a challenge to review either way: even when going +with Option #1, there will still be one rather large commit to consolidate the +lifetimes of the two maps and to move towards the new interface. There’s also a +chance that all the bogus methods created to factor out the code snippets from +PM might actually make things harder to review instead of aiding with it, +because reviewers will now also have to check on how these snippet-methods are +called. I thus have a slight preference for Option #2, but will leave it to the +reviewers to decide which variant they prefer. + +#### Decision + +The reviewers either agreed with the preference for Option #2 or didn’t have a +preference at all. The port will be conducted using Option #2. diff --git a/design-docs/00000-example-docs/04359-remove-bpf-hashing/README.md b/design-docs/00000-example-docs/04359-remove-bpf-hashing/README.md new file mode 100644 index 00000000..1f676f25 --- /dev/null +++ b/design-docs/00000-example-docs/04359-remove-bpf-hashing/README.md @@ -0,0 +1,736 @@ +> [!IMPORTANT] +> +> This document is from Elastic times and precedes the open-sourcing of the +> profiling agent. It is provided as an example of what a design document +> could look like. Many of the links had to be removed because the original +> git history of the profiling agent had to be erased for legal reasons when +> it was open-sourced. + +HA: remove BPF hashing +====================== + +# Meta + +- **Author(s)**: Joel Höner +- **Start Date**: 9. Nov 2023 +- **Goal End Date**: 5. Dec 2023 +- **Primary Reviewers**: Christos Kalkanis, Timo Teräs + +## Abstract + +Our host agent uses a relatively complex data processing pipeline that enriches, +aggregates and reshapes raw traces into the format expected by the collector. +Statistics show that a significant portion of this complexity might be +unwarranted. Upcoming features further change some important requirements for +the system as a whole. This document proposes a rapid simplification of the +BPF ↔ user-mode communication portion of the pipeline. + +## Context + +The relevant portion of our trace aggregation in the host agent currently looks +roughly like this: + + + +> [!NOTE] +> +> Kernel stack transfer is not included here because it will remain unchanged. + +## Problem + +The BPF and `tracehandler` portions of this pipeline are complex pieces +of code that have already cost multiple engineers a lot of time to reason about +whenever they need to be changed. + +The majority of the complexity arises from the fact that our BPF code transfers +traces and counts separately, but `tracehandler` needs to transfer information +from the trace over to the count updates that it sends out. The count updates from +BPF are essentially just `(BPFHash, Count)` tuples, but `reporter` expects count +updates to be annotated with thread-, container- and pod name. BPF sends the +information required to infer these values with the trace instead of the count. + +For `tracehandler` this means that it has to keep a cache of `(BPFHash, +TraceMeta)` mappings in order to make sure that it always has the required +information for the outgoing count updates. Moreover, it cannot make any +assumptions about the trace info already being present when corresponding counts +arrive, so it must support deferring count updates until the trace info for them +arrives. + +BPF currently uses 5 different BPF maps with different lifetimes and aggregation +behaviors. Maintaining the cache from `BPFHash` to `TraceMeta` further needs to +be implemented very efficiently to not eat up too much memory, so we're +currently using a hand-rolled, domain-specific LRU implementation for this. +This all adds up to a non-trivial amount of complexity. + +Another problem with this approach is that all information that is present in +the trace must also be part of the BPF hash. This in turn means that every +high-cardinality field that we add to the trace diminishes the efficiency of the +aggregation that occurs in the BPF maps. + +### Aggregation is rare already + +Testing with real-world applications has shown that it's pretty rare for an +entry in `hash_to_count` to ever have a count >1 within the 5 second interval +that we use for the BPF ↔ user-mode communication. Inserting into and removing +from a hash map is quite a bit more work than simply sending a perf event that +is mmap'ed into user-land. Particularly for traces with the "symbolize this now" +flag this is currently rather wasteful because we even send an additional perf +event to force the map to be read and cleared immediately. + +Trace resends via `hash_to_trace` are additionally suppressed via the +`known_traces` map that is tethered to `hashMapperCache` in `tracehandler`. This +suppresses the resend for ~45% of traces. We can monitor this rate for all +profiling agent deployments globally: + + + +While ~45% might seem like a significant chunk on first glance, any caching +scheme must always be evaluated in the context of: + +- what the cost of just redoing the work all the time would be +- the number of cycles burned in vain on the cache lookup itself when cache misses occur +- the cost of maintaining the cache + +The complex system with traces being split across multiple maps is in place in +large parts **just to support this caching scheme**, and without it we could go +with a much simpler approach (detailed in ["Proposed +Solution"](#proposed-solution)). + +In this simpler scheme, the cost of just resending the whole trace is a 0.3 - +2.5 KiB `memcpy` and, for a percentage of traces, a syscall to fetch the kernel +portion of the trace. The cache lookup in the existing system involves two +passes through hashing functions[^2] that will in all likelihood be slower than +an actual `memcpy`[^3], though perhaps not slower than a syscall. + +With that context in mind, ~45% aren't really all that good anymore. In 55% of +cases we will pay the price of the hashing **plus** the cost of sending the +trace. + +[^2]: First our custom hashing function, then putting the hash through whatever + hash function the kernel uses for their hash maps. +[^3]: Even heavily SIMDed hash functions like XXH3 are typically [quite a bit + slower] than a `memcpy`, and our custom hash function is very basic scalar + code. Also see ["Cost of memcpy" section](#cost-of-memcpy). + +[quite a bit slower]: https://github.com/rurban/smhasher + +### Upcoming features + +There are two upcoming features that may increase the cardinality of the BPF +trace significantly. Increased cardinality in turn further reduces the chance of +two traces being aggregated under one BPF hash. + +#### Nanosecond timestamps + +Christos' planned change towards nanosecond precision timestamps will require +recording the timestamp in kernel-mode. Previously the timestamp was not part of +the BPF trace and was added once it arrived in user-mode. This worked reasonably +well for second precision timestamps, but will definitely not transfer to +nanosecond precision. + +Assuming that we'd implement that by adding a timestamp field to the trace +and included that in the trace, this will almost guarantee that every BPF trace +is now be unique for all but a synthetic load that puts all cores into an +infinite loop. + +For this particular problem, there's an alternative route where we'd switch +`hash_to_count` to a per-CPU `nanotime_to_hash` map and let user-space do +aggregation/counting of events (in this case, hashes). This is what Christos +actually prototyped. This requires fewer changes at the cost of adding more +complexity to the existing system. + +#### APM integration + +We're currently working on integration between the APM and profiling products. +To allow associating profiling traces with APM traces, we need to read the APM +transaction-, span-, and trace IDs that were active while we collected each +trace. + +The APM IDs are randomly generated and particularly the spans and their IDs +typically only last for a few milliseconds. This is thus very similar to the +problem with the nanosecond timestamps: if we just include them in the BPF trace +hash, that will make it exceedingly unlikely that any aggregation takes place. + +Other than with the nanosecond precision timestamps, this problem will only +affect deployments that actually use the APM integration feature. + +## Success Criteria + +The primary goal is to unlock adding new high-cardinality fields to the trace in +the future without too much effort and without having to worry about constantly +deteriorating performance with every added field. + +A secondary goal is to achieve the primary goal without adding even more +complexity. In fact, preferably we use the opportunity to shed some. + +The changes further cannot put the host agent over the 1% CPU usage barrier. + +## Scope + +In scope: + + - trace processing in BPF, tracer, and tracehandler + +Not in scope: + + - any breaking changes to the reporter or the network protocol + - changing the user mode hash (tackled in separate document) + - experiments with per-frame (instead of per-trace) caching + +## Proposed Solution + +Remove BPF hashing and aggregation entirely and just resend the whole trace in a +perf event or BPF ring buffer for every sample that we unwind. + +Because with this approach all information is directly present in the event, +`tracehandler` no longer needs to keep a mapping from BPF hashes to trace meta +information. Worries about whether traces or counts arrive first become +irrelevant. `tracehandler`'s job is essentially reduced to just keeping track of +which traces were already previously reported to the collector and splitting +counts and traces. + +The caches that still need to be kept are reduced to: + +- `bpfTraceCache: BPFHash → UMHash` + - Suppresses unnecessary trace conversion + - Roughly comparable to the previous `hashMapperCache` (and `known_traces`) + - Pure performance optimization: not actually strictly required with this approach +- `umTraceCache: UMHash → Void` + - Suppress trace resend to the collector + - Equivalent to previous `traceCache` + +The following diagram shows the portion of the trace aggregation pipeline that +was previously introduced in the ["Context"](#context) section, but with the +suggested changes applied. + + + +### Kernel memory napkin math + +Some basic math on this suggests that we'll actually need significantly less +kernel memory to be able to keep the same amount of traces in memory as with the +current approach. + +The following uses the map sizes that the host agent dynamically allocated on my +10 core VM. The buffers in both implementations should scale linearly with the +number of cores, so the concrete core counts do not impact the ratio between +both approaches. + +
`
+
+Perf events are essentially per-CPU ring-buffers. The events don't have to be
+copied into user-mode memory: the ring buffer is `mmaped` into the UM
+process[^1].
+
+Since we're currently only suppressing about 45% of trace resends anyway, we
+don't expect the overhead of this to be massive. Hashing and splitting the trace
+across multiple hash maps is definitely more expensive than just sending a perf
+event. With the existing system we are paying the price of caching itself plus
+the price of resending the trace for the 55% of cases where the cache is missed.
+
+[^1]: [`perf_event_open`] manual page, CTRL+F for "MMAP layout" headline
+
+[`perf_event_open`]: https://www.man7.org/linux/man-pages/man2/perf_event_open.2.html
+
+I was originally worried that copying the events to Go heap and the
+corresponding GC pressure would be an issue, but testing with the prototype
+shows that this was unwarranted: the corresponding code barely shows up in the
+profiler (see ["Cost of memcpy" section](#cost-of-memcpy)).
+
+We're also doing 45% more syscalls to retrieve the kernel half of our traces
+with the new approach. Just like the `memcpy`, this also largely turned out to
+be a non-issue: in the prototype we only spend about 0.92% of CPU cycles on this
+([pprof profile](./prototype-option-2-ghidra-workload.pprof)). This is likely at
+least in part because only a fraction of traces actually have kernel traces
+associated with them.
+
+What did turn out to be an issue, however, is that we're now waking user-mode a
+lot more frequently than we used with the old approach. The old HA only forces
+an immediate read in UM when:
+
+- the trace is not previously known. This eliminates about ~45% of wakes.
+- the unwinder sets `ha_symbolization_needed`. This eliminates 100% of wakes
+ that aren't due to PID events in a pure native workload and a noticeable chunk
+ in mixed workloads.
+
+This essentially means that we're waking from our `epoll` call that waits for
+perf events a lot more frequently than before. The kernel itself seems to be
+pretty good at handling this kind of wake efficiently, but the Go runtime isn't.
+With all other optimizations suggested in this document (minus the ones in the
+following "options" sections) applied, this almost doubles the CPU time that the
+host agent is consuming in the average case. For interpreter workloads the
+difference is not so significant. In the profile we're observing a slight (but
+by no means deal-breaking) increase of time spent in `epoll` and a massive
+increase of time spent in the goroutine scheduler.
+
+There are a number of possible routes that we can pursue to reduce the overhead
+from constant waking. We'll discuss each one in the following sub-sections.
+
+> [!IMPORTANT]
+>
+> We recently learned that there is a bug with event inhibition that causes
+> the "symbolize this now" events to not actually be delivered as intended.
+>
+> This likely explains why the previous approach was more performant even in
+> heavy interpreter workloads despite the fact that the new approach should
+> intuitively perform better here: constantly reading and clearing multiple maps
+> really shouldn't be faster than reading events from shared memory and bumping
+> atomics. Under a heavy interpreter load the amount of waking should be about
+> the same between both variants if inhibition is released correctly.
+>
+> We did some experiments and confirmed that fixing the issue to restore the
+> code's intended behavior indeed increases CPU usage significantly. However, we
+> also noticed that interpreter symbolization clearly worked pretty well
+> **despite this being broken for years now**, so we converged towards simply
+> keeping the inhibition bug in place for the moment.
+
+#### Option 1: perf event buffer with event counting watermarks
+
+Perf event buffers support a feature called "watermarks". This essentially
+allows us to specify a minimum number of either events or bytes in the buffer
+before `epoll` will wake our process. Cilium defaults this to "1 byte". Setting
+up watermarks to a size where they only wake us about 4 times per second per
+core when all cores are maxed eliminates the overhead entirely.
+
+However, there are some issues with this approach that we'd need to get around:
+
+1) **The frequency at which we read events becomes unpredictable.**\
+ On a system that is almost completely idle, it might take minutes to ever
+ reach the watermark. If a user opts out of fair scheduling by pinning their
+ application to a fixed list of cores, the frequency will also vary from core
+ to core.
+2) **Cilium's BPF library only supports the byte-based watermarks.**\
+ The length of our trace event can vary wildly depending on workload, so a
+ byte-size based watermark intensifies problem 1 even more: the reporting
+ frequency will additionally depend on the stack trace length.
+
+We can get around this by:
+
+1) **teaching Cilium to use the watermark variant that counts events.**\
+ I have already implemented this to make sure that the kernel actually accepts
+ it with the BPF event kind. It's only a few lines of diff and works as expected.
+ It should be easy to get this merged upstream if we want.
+2) **sending empty dummy traces when we interrupted the idle process.**\
+ It's easy to forget that we have that option because we generally tend to
+ think of our BPF code as only being invoked when the CPU is under load, but
+ that is not actually the case. We get called at a constant frequency
+ regardless of the workload; we just immediately bail out when we [detect that
+ we're in the idle process][idle-check]. We could change this code path to
+ always send an empty dummy record and thus ensure that our watermark is
+ reached in a predictable and timely manner.
+
+[idle-check]: #
+
+#### Option 2: polling the perf event buffer
+
+The Cilium BPF library currently [always does an `epoll`][cilium-epoll] to get
+notified about whether a watermark has been reached before attempting to read
+from the ring. However, this is not actually required to safely read from the
+buffer: it's also perfectly sound to just try to read and see if it contains
+some data. Data access is synchronized with atomics. The Rust BPF library, for
+example, supports this (and in fact offers it as the default way).
+
+Cilium currently doesn't provide a function to do that, but we can quite easily
+add one. All the error prone logic for correctly reading and updating the
+atomics is in place already;we mostly just need to call their existing
+functions without doing an epoll first. I also implemented that variant in the
+prototype and it works great. This in turn allows us to just read from our perf
+event buffers at a constant frequency.
+
+The drawback here is that we're paying a certain constant overhead from polling.
+However, in my experiments with a 250ms polling frequency, this is essentially
+negligible. On the positive side, the HA CPU usage will be very consistent
+regardless of the workload.
+
+I measured this to be ever so slightly slower than the current regular host agent
+on a completely idle system (~32 minutes wall time):
+
+
+
+And a bit more efficient on a system under a heavy mixed HotSpot/native workload
+(~5 minutes wall time):
+
+
+
+`pf-host-agent` is the prototype HA, `pf-normal` is a HA built from latest
+`main` as of writing. Ignore the CPU percentage and look at the core seconds
+under "TIME+". One additional core second in 32 minutes of wall time in the idle
+case seems perfectly acceptable to me.
+
+[cilium-epoll]: https://github.com/cilium/ebpf/blob/da24e832beede03ec73966c7b92748d370c1937d/perf/reader.go#L354C7-L354C7
+
+#### Option 3: option 2 + 80% watermarks
+
+We could extend option 2 by additionally setting a very high byte-based
+watermark (e.g. 80% of the buffer size) and additionally `epoll` on that to
+ensure that our buffer doesn't run full while we are sleeping between timer
+ticks. This would potentially allow us to use a much smaller perf event buffer
+while still having relatively high confidence that it never runs over while we
+are sleeping.
+
+#### Option 4: BPF ring buffers
+
+While perf event buffers have one ring buffer per core, BPF ring buffers are
+shared between all cores. This has the obvious benefit that user-mode will
+always just have to manually poll a single mmap'ed ring or `epoll` on a single
+file descriptor (instead of `num_cpu` FDs). This is beneficial for both
+approaches: the overhead is constant and doesn't linearly scale with core count
+as it does with perf event buffers.
+
+BPF ring buffers further allow the BPF code to decide whether it wants to wake
+user-land on each message sent via `BPF_RB_NO_WAKEUP` and `BPF_RB_FORCE_WAKEUP`
+flags. It can further query the buffer fill status via `bpf_ringbuf_query`. This
+essentially allows us to implement whatever custom waking schema that we want,
+and later switching or refining the approach would be quite easy.
+
+The drawback here is that BPF ring buffers were only added in kernel 5.8. I ran
+a few queries on our HA metrics cluster and we still have about 3% of customers
+running kernels that are older than that. It's further possible that certain
+features (like waking and inspecting the buffer state from BPF) were added in
+even newer kernel versions. If there is reviewer preference towards exploring
+this approach further, I'd be happy to research this in more detail.
+
+#### Option 5: option 2 or 4 + early wakes for interpreter frames
+
+Option 2 and 4 can both optionally be extended to allow for early wake-ups that
+ensure timely user-mode symbolization of interpreter frames. Examples for factors
+that might request an early read could be (simply quoting Timo's suggestions):
+
+- if it's long living process, and we know the VM is unlikely to overwrite
+ memory structures, we could just throttle to send this event once every few
+ seconds per PID at maximum
+- if it's new process, or the VM is known to be potentially overwriting things,
+ wakeup every frame as needed
+- have per-VM type, or per-{PID,framedata} LRU to throttle sending the wakeup
+
+**With option 2** we'd simply have to leave the old "send an event to the PID
+event perf buffer for early flush" mechanism in place. Instead of flushing the
+old BPF hash maps it'd cause an early read of the perf event buffers.
+
+**With option 4** we could set the flag argument on `bpf_ringbuf_output` to
+`BPF_RB_FORCE_WAKEUP` to force a wake-up in user-mode.
+
+#### Author's preference
+
+My preference is option 2. Option 3 and 5 can be considered optional
+improvements to option 2 that we may or may not apply later.
+
+Option 1 seems like a bit of a hack that introduces mental complexity while
+providing no upside at all compared to option 2.
+
+Option 4 is also excellent and what I'd like to see us using in the future, but
+I'm not sure whether we're willing to kick out 3% of customers. I suspect that
+we'll discover that particularly the big corporate customers that are waiting
+for on-prem might be running older kernels.
+
+Starting with option 2 now and migrating to option 4 maybe a year down the line
+should not present a huge amount of work, so this is my preference.
+
+#### Decision
+
+Reviewers agreed with the author's preference.
+
+### Replace `hashmapperlru` with `freelru`
+
+The hand-rolled `hashmapperlru` mostly exists because we had to persistently
+store various string fields (comm, pod name, container name) of the trace that
+need to be transferred over to the count updates while maintaining high memory
+efficiency with string interning.
+
+The remaining LRUs in `tracehandler` no longer need to store any strings or
+other heap allocated objects: both keys and values are either just hashes or
+`libpf.Void`. Since we always just resend everything in the new approach,
+there's also no chance that we'll ever have to add back any such field in the
+LRU when we add more fields later.
+
+Pod and container name are currently still part of the UM hash. They are not
+sent from kernel directly, but derived from the PID. Since the names cannot
+change after process start, there is no need to worry about them changing
+and having to invalidate caches because of that.
+
+We can thus retire our complicated custom LRU implementation and replace it with
+the generic `freelru` without sacrificing memory efficiency.
+
+**Validated in prototype:** yes\
+**Independent issue:** no
+
+### Frame list in `per_cpu_record`
+
+During unwinding the frame list is currently chunked and temporarily stored in
+the dedicated BPF map `per_cpu_frame_list`:
+
+``
+
+I honestly have no idea why we are doing this even in the current approach: it
+adds quite a bit of complexity to both the "push" function and the trace hashing
+code. As far as I can tell there doesn't seem to be any technical limitation
+that would require this chunking even on old kernels.
+
+I suggest that we simply add the frame list to our existing [`Trace`] struct, in
+struct-of-array columnar representation:
+
+```c
+typedef struct Trace {
+ // [...] (existing meta-data fields omitted)
+
+ struct {
+ FileID file_ids[MAX_FRAME_UNWINDS];
+ u64 addr_or_lines[MAX_FRAME_UNWINDS];
+ u8 kinds[MAX_FRAME_UNWINDS];
+ } frames;
+} Trace;
+```
+
+Columnar representation is chosen to reduce alignment bytes. The alternative of
+an `__attribute__((packed))` `Frame` type and array-of-struct was considered,
+but turned out to not be supported by CGo: it ignores the attribute and ends up
+accessing memory incorrectly.
+
+The `Trace` instances live in [`PerCPURecord`] and was previously already used
+to store trace meta-information. The proposed change thus simply consolidates
+the two maps into one.
+
+**Validated in prototype:** yes\
+**Independent issue:** yes, but simpler to do in main PR
+
+[`Trace`]: #
+[`PerCPURecord`]: #
+
+### Move container- and pod name lookup into `tracehandler`
+
+The lookup for container- and pod names currently happens in
+`ProcessManager.ConverTrace`:
+
+``
+
+The BPF trace only contains the PID, not the actual container- and pod name. To
+have both of these available when sending out the count, we thus need to move
+the lookup to `tracehandler`. `tracehandler` can then pass the names that it
+looked up to `ConvertTrace` for new traces and skip trace conversion for known
+ones.
+
+**Validated in prototype:** yes\
+**Independent issue:** no
+
+### Extended container- and pod name caching
+
+Moving the call to tracehandler means that we'll now be doing these lookups more
+often than before. Previously we'd only call into `containermetadata` when a
+trace wasn't present in `hashMapperCache` and needed conversion, now we'll be
+doing it for every trace.
+
+`containermetadata` has caching in place already:
+
+``
+
+But that cache is only hit after looking up the file identifier. The `Sprintf`
+and the `stat` syscall originating in `libpf.GetOnDiskFileIdentifier` eat a
+noticeable amount of CPU time.
+
+It wasn't immediately obvious to me why this indirection via the file identifier
+is taken instead of simply using the PID as the key, but the [comments on the PR
+that added the cache][pid-reuse] shed light on that: it's meant to provide
+resilience against PID reuse.
+
+We essentially have three options here:
+
+1) Accept a slight increase in cycles burned: about 3-5%.
+2) Add a second, short lived `PID → (PodName, ContainerName)` cache in
+ `tracehandler`. This is the route that I pursued in the prototype. A 256
+ entry cache with a lifetime of 1 minute did the job.
+3) Reduce resilience against PID reuse and change the key of the existing cache
+ to be PID based, but add a timeout on the LRU.
+
+**Author's preference:** I'd like to give option 3 a try: I suspect that it
+should do the job nicely without the need for added complexity from a new layer
+of caching. No strong opinion on this, though.
+
+**Validated in prototype:** yes (option 1 and 2)\
+**Independent issue:** yes
+
+[pid-reuse]: #
+
+#### Improvements on newer kernels
+
+Kernel versions v4.18+ actually have a function `bpf_get_current_cgroup_id` that
+we could use to obtain a unique ID that is guaranteed not to be re-used:
+
+https://github.com/torvalds/linux/blob/9b6de136b5f0158c60844f85286a593cb70fb364/include/linux/kernfs.h#L217-L221
+
+I believe the inode ID corresponds to the directory node in `/sys/fs/cgroup`. To
+my knowledge there is no way to open a file descriptor from an inode ID, so we'll
+still have to keep our existing logic based on PIDs to read the container info.
+The ID could only serve as a better cache key.
+
+### BPF trace hashing in UM
+
+#### Hash inputs
+
+We're no longer hashing traces within BPF, but we still need to construct a hash
+for the LRU that suppresses repeated trace conversion. However, if we continue
+to just have all fields of the BPF trace into this, we'll suffer the same
+diminishing returns of caching efficiency with every high-cardinality field that
+we add as discussed in the [Problems](#problems) section.
+
+Since the purpose of the hash is now only to avoid repeatedly doing user-mode
+symbolization for the same trace (and no longer to keep a unique mapping to
+other existing and upcoming meta-fields like NS timestamp or APM info), we can
+reduce the hash input to just the frame list, COMM, kernel stack ID, and PID.
+The PID still needs to be included because the frame list can contain per-PID
+identifiers.
+
+**Validated in prototype:** no\
+**Independent issue:** no (unless we start without a trace translation cache)
+
+#### Hash algorithm
+
+BPF is currently using a hand-rolled, custom hashing scheme. The scheme was
+chosen primarily for simplicity and is likely neither the fastest, nor the most
+collision resistant function imaginable. We already have XXH3 as a dependency
+anyway, so I suggest that we just use that. It's very fast and provides a pretty
+reasonable output distribution.
+
+**Validated in prototype:** yes\
+**Independent issue:** no (unless we start without a trace translation cache)
+
+### Row-based trace record in UM
+
+We're currently using a columnar (struct of array) representation for the frames
+in `host.Trace`:
+
+``
+
+As far as I can tell, there's no upside to this at all. Allocating space for a
+trace will in fact take 4 allocs whereas with a row-based (array of struct)
+implementation we'd only need 2. Initializing and accessing it is more
+cumbersome and requires more bound-checks.
+
+I suggest that we switch the frames to array-of-struct representation.
+
+This suggestion is largely independent of the goals in this document, except
+that with the new approach we're allocating more `host.Trace` instances: that's
+why I'm mentioning it here.
+
+**Validated in prototype:** yes\
+**Independent issue:** yes
+
+## Alternatives considered
+
+### Perf buffer with just `(hash, meta_field_a, ...)`
+
+This would keep most of the existing maps in place and only replace
+`hash_to_count` with a perf event buffers that transfers e.g. `(hash,
+nano_timestamp, apm_span_id, ...)` tuples. We'd continue hashing the
+low-cardinality portion of the trace and send the high-cardinality fields with
+the trace events. The intuition for this approach is that we could avoid
+unnecessary `memcpy`s at the cost of keeping most of the complexity.
+
+The idea was abandoned because we realized that the price of the memcpy is more
+or less negligible (see ["Cost of `memcpy`"](#cost-of-memcpy) section), so
+paying the extra complexity and maintenance effort of the more complex solution
+seemed like a bad trade-off. This variant would also have required dealing with
+the same complications as the [proposed solution](#proposed-solution) in
+addition to that.
+
+## Misc remarks
+
+### Cost of `memcpy`
+
+Since the new approach resends the whole trace all the time, intuitively we tend
+to assume that the suggested approach must come with a hefty amount of CPU
+cycles spend in `memcpy`. The wish to avoid `memcpy` was brought up by multiple
+reviewers, and I myself also assumed this to be a bottle-neck when starting this
+experiment.
+
+Fortunately we can remedy these worries quite easily by profiling the prototype
+host agent. The following profile is the result of a heavy mixed HotSpot/native
+workload, letting Ghidra analyze the entire `devfiler` executable (29 MiB):
+
+
+
+The [`parseRawTrace`] function is responsible for hashing, allocating and
+copying the BPF trace into Go heap. We can see that the whole function only
+accounts for roughly 1.61% of execution time. 0.23% of that are spent in XXH3
+hashing, 1.15% in memory allocation.
+
+Funnily (or sadly) enough, the internal allocs in the timer that is used for
+polling the perf buffer is actually about equally expensive as copying and
+hashing the whole trace.
+
+[`parseRawTrace`]: #
+[original location]: #
diff --git a/design-docs/00000-example-docs/04359-remove-bpf-hashing/known-trace-metrics.png b/design-docs/00000-example-docs/04359-remove-bpf-hashing/known-trace-metrics.png
new file mode 100644
index 00000000..5d5b16d5
Binary files /dev/null and b/design-docs/00000-example-docs/04359-remove-bpf-hashing/known-trace-metrics.png differ
diff --git a/design-docs/00000-example-docs/04359-remove-bpf-hashing/memcpy-cost.png b/design-docs/00000-example-docs/04359-remove-bpf-hashing/memcpy-cost.png
new file mode 100644
index 00000000..9a04aa09
Binary files /dev/null and b/design-docs/00000-example-docs/04359-remove-bpf-hashing/memcpy-cost.png differ
diff --git a/design-docs/00000-example-docs/04359-remove-bpf-hashing/option2-ghidra-workload.png b/design-docs/00000-example-docs/04359-remove-bpf-hashing/option2-ghidra-workload.png
new file mode 100644
index 00000000..798c8b90
Binary files /dev/null and b/design-docs/00000-example-docs/04359-remove-bpf-hashing/option2-ghidra-workload.png differ
diff --git a/design-docs/00000-example-docs/04359-remove-bpf-hashing/option2-idle-workload.png b/design-docs/00000-example-docs/04359-remove-bpf-hashing/option2-idle-workload.png
new file mode 100644
index 00000000..cc3d3b40
Binary files /dev/null and b/design-docs/00000-example-docs/04359-remove-bpf-hashing/option2-idle-workload.png differ
diff --git a/design-docs/00000-example-docs/04359-remove-bpf-hashing/trace-pipe-reworked.drawio.svg b/design-docs/00000-example-docs/04359-remove-bpf-hashing/trace-pipe-reworked.drawio.svg
new file mode 100644
index 00000000..64aa54b6
--- /dev/null
+++ b/design-docs/00000-example-docs/04359-remove-bpf-hashing/trace-pipe-reworked.drawio.svg
@@ -0,0 +1,4 @@
+
+
+
+
\ No newline at end of file
diff --git a/design-docs/00000-example-docs/04359-remove-bpf-hashing/trace-pipe.drawio.svg b/design-docs/00000-example-docs/04359-remove-bpf-hashing/trace-pipe.drawio.svg
new file mode 100644
index 00000000..7803dde4
--- /dev/null
+++ b/design-docs/00000-example-docs/04359-remove-bpf-hashing/trace-pipe.drawio.svg
@@ -0,0 +1,4 @@
+
+
+
+
\ No newline at end of file
diff --git a/design-docs/00000-example-docs/README.md b/design-docs/00000-example-docs/README.md
new file mode 100644
index 00000000..e29a0808
--- /dev/null
+++ b/design-docs/00000-example-docs/README.md
@@ -0,0 +1,3 @@
+This directory contains two old design documents from times before the agent
+was donated to OTel and open-sourced in the process. The documents are provided
+primarily as examples of how a design document can be structured.
diff --git a/design-docs/00000-templates/long-form-design-doc-template.md b/design-docs/00000-templates/long-form-design-doc-template.md
new file mode 100644
index 00000000..41d3e6ca
--- /dev/null
+++ b/design-docs/00000-templates/long-form-design-doc-template.md
@@ -0,0 +1,163 @@
+Long-Form Design Doc Template
+=============================
+
+> [!NOTE]
+>
+> This template is intended for use when a long-form design document is
+> necessary. A document using this template should provide sufficient high level
+> context to understand the problem as well as detailed proposals with supporting
+> data.
+
+# Meta
+
+- **Author(s)**: Name of the author(s)
+- **Start Date**: Start date of the design doc
+- **Goal End Date**: Intended end date for the design process
+- **Primary Reviewers**: A list of at least two primary reviewers
+
+# Abstract
+
+A reasonably concise abstract so that a reader can get an idea of what
+the document is about.
+
+# Introduction
+
+A good introduction into the problem that we are trying to solve. This
+introduction should not assume much background knowledge from the
+reader, but also not re-hash everything from ground up; when any
+external documentation can be referenced, this should be given
+preference over re-hashing things. The introduction should have the
+following sections
+
+## Context
+
+Present ideas, dependencies, systems and general context referred to in
+the doc.
+
+### Constraints
+
+Document any relevant constraints that any/all solutions must adhere to.
+Try to be as explicit as possible here, as constraints that are obvious
+to you may be less so to another reader.
+
+As an example, when working on a task that relates to the eBPF tracers
+there are a number of constraints that come into play depending on the
+kernel version one wants to support. In this case it would be
+appropriate to list each of these constraints, and the kernel versions
+they are relevant to.
+
+### Related (Sub-)Problems
+
+If there is a set of sub-problems that need to be explained, and
+possibly solved, as part of one or more of your solutions it can make
+sense to address them up-front and independently. Include solutions to
+such problems here as well, if appropriate.
+
+## Problem Statement
+
+What the problem is, why the problem exists, why do we need to solve it.
+
+**For particularly significant tasks where it is important to solicit
+feedback as early as possible it often makes sense to write this section
+and the success criteria first. This allows for feedback to be solicited
+before writing the rest of the document, in order to ensure everyone is
+on the same page before major time investment.**
+
+## Success Criteria
+
+By far the main pitfall in many design docs is not clearly defining what
+"success" means. This sometimes can lead the author to meander into
+less-important areas, or ignore an important aspect completely. This can
+also lead the reviewers to be on a completely different line of thought,
+orthogonal to the doc.
+
+This section is more or less what defines the outline and then sets the
+stage for the sections/subsections of the document. It's a reality
+check that allows the authors and the reviewers to agree on the basis
+for the design, and how it should be reviewed. It should help to make
+sure everybody is focused on the same thing.
+
+Usually this should be presented as a concise series of bullet points.
+
+## Scope
+
+This expands on the "success criteria", but helps to clear up possible
+confusion about what the design doc is about and is not about. It's
+usually extremely short, a few bullet points. The most useful info here
+is usually "XYZ is not in scope" to avoid ambiguity. This ensures we
+don't make incorrect assumptions when reading the doc
+
+# Proposed Solution(s)
+
+Ideally, there should be more than one proposed solution. The document
+should list the various solutions, and then go into some depth about
+their drawbacks and advantages.
+
+Something to watch out for here is a situation where a problem has N
+sub-problems, each with their own M alternative solutions. In that case
+it is usually unnecessary, and unreadable, to enumerate all possible
+$N \cdot M$ permutations. Assuming it is feasible, outline the possible
+solutions to each sub-problem separately, and present a very limited
+number of combined solutions, if any.
+
+## Author's Preferred Solution
+
+After presenting the merits of each solution, you should ideally give a
+hint as to what your preferred option is, and why. This achieves 2
+goals:
+
+- It ensures solutions are evaluated against each other, despite
+ having pros and cons that are not easily comparable. Discussing this
+ can provide helpful context to the reader.
+- Writing a justification is a good reality-check, ensuring the design
+ doc achieves its goal and provides enough information to make a
+ decision.
+
+
+In case you do not have a preferred solution: to help the reader, you
+can describe an appropriate thought process for comparing solutions:
+
+- Which pros/cons are more important than others?
+- What features or constraints are more important?
+
+We have a finite time/energy budget for assessing designs and providing
+this information will help you and the reviewers to prioritize.
+
+# Testing Strategy
+
+Document here how the solution itself will be tested, as well as how the
+solution may impact the testability of other components that make use of
+it.
+
+The testability of a component itself, as well as its impact on the
+testability of other components that make use of it, are important
+factors to weigh when choosing between alternate solutions. If it is the
+case that your proposed solutions differ in terms of how testable they
+are then it may make sense to lift this section into the proposed
+solutions, and repeat it for each.
+
+## Testing of Proposed Solution Itself
+
+Outline how the proposed solution will be unit tested and integration
+tested. For many implementations this is straightforward and obvious,
+but not always. If there are likely to be any difficulties in testing
+the solution then outline them here, as well as solutions, if any.
+
+## Impact on Testing of Other Systems/Components
+
+Discuss whether or not the proposed solution will impact the testability
+of systems that make use of it, and how. For example, if the solution
+involves producing a component that would be difficult to mock out for
+the purposes of testing something that makes use of it, then explain
+that here.
+
+# Plan to Acquire Missing Data (Optional)
+
+Plan to acquire missing data. Often, some data is missing to properly
+evaluate the advantages and disadvantages, and this section details what
+data needs to be gotten and how.
+
+# Decision
+
+When all the data is there, a "Decision" section which details what
+solution was decided on.
diff --git a/design-docs/00000-templates/short-form-design-doc-template.md b/design-docs/00000-templates/short-form-design-doc-template.md
new file mode 100644
index 00000000..6f4f576e
--- /dev/null
+++ b/design-docs/00000-templates/short-form-design-doc-template.md
@@ -0,0 +1,50 @@
+Short-Form Design Doc Template
+==============================
+
+> [!NOTE]
+>
+> This template is intended to propose problems and solutions for which a
+> fully fledged design document is likely not necessary. A document made from
+> this template will have less high level context and detail than a long-form
+> document.
+
+# Meta
+
+- **Author(s)**: Name of the author(s)
+- **Start Date**: Start date of the design doc
+- **Goal End Date**: Intended end date for the design process
+- **Primary Reviewers**: A list of at least two primary reviewers
+
+# Problem
+
+Clearly describe the problem.
+
+# Success Criteria
+
+By far the main pitfall in many design docs is not clearly defining what
+"success" means. This sometimes can lead the author to meander into
+less-important areas, or ignore an important aspect completely. This can
+also lead the reviewers to be on a completely different line of thought,
+orthogonal to the doc.
+
+This section is more or less what defines the outline and then sets the
+stage for the sections/subsections of the document. It's a reality
+check that allows the authors and the reviewers to agree on the basis
+for the design, and how it should be reviewed. It should help to make
+sure everybody is focused on the same thing.
+
+Usually this should be presented as a concise series of bullet points.
+
+## Scope
+
+This expands on the "success criteria", but helps to clear up possible
+confusion about what the design doc is about and is not about. It's
+usually extremely short, a few bullet points. The most useful info here
+is usually "XYZ is not in scope" to avoid ambiguity. This ensures we
+don't make incorrect assumptions when reading the doc
+
+# Proposed Solution
+
+A proposed solution for the problem. Even though this is a short-form
+design document you must still provide enough information to allow the
+reviewers to understand and assess what you are proposing.
diff --git a/design-docs/README.md b/design-docs/README.md
new file mode 100644
index 00000000..1b47661c
--- /dev/null
+++ b/design-docs/README.md
@@ -0,0 +1,58 @@
+Profiling agent design documents
+================================
+
+This directory contains design documents that discuss major additions of changes
+to the profiling agent's architecture.
+
+### When to write a design document?
+
+Invoking the design document process can be beneficial if
+
+- you are proposing changes that have significant impact on multiple
+ sub-components of the profiling agent.
+- you are proposing a major addition and want to make sure that it will
+ actually be accepted upstream the way that you plan to implement it without
+ writing the code first (and thus risking need for major reworks later).
+- your proposed changes require a significant amount of context to understand
+ for reviewers. A short design document can help to clarify the current state
+ and the state that you'd like to move towards in these cases.
+- you'd like to incrementally apply reworks over the course of multiple
+ PRs, to provide extra context for reviewers to understand what the end
+ goal is supposed to look like. In simpler cases that can also be achieved
+ with a tracking issue.
+
+The above are guidelines: there is no hard rule on when a design document is
+necessary. When in doubt, please feel free to create an issue and quickly outline
+what you want to do in order to clarify whether a document is needed or not!
+
+### Creating a New Document
+
+- Create a new git branch forking from latest `main`
+- Create a directory in the `design-docs` folder
+ - The directory name should follow a format like `00000-my-doc-title`
+ - The intention for creating a directory per document is to allow bundling
+ media (images, diagrams, drawings) with the document
+ - The 5 digit number is included to make it easy to spot recent documents and
+ to order the documents by their recentness when viewed in the GitHub UI
+ - When initially writing the document, the ID is set to `00000`
+- Copy one of the templates for design documents from the (`00000-templates`)[./00000-templates]
+ directory into into the newly created directory, naming it `README.md`
+- Write the design document by following the instructions in the template
+- Once reaching draft quality, create a draft PR
+ - Add the `design-document` label
+ - Include a link to the rendered document. GitHub hides away the ability to
+ properly view rendered documents behind multiple clicks, so adding a direct
+ link spares others some work. After creating the PR, this link can be
+ obtained by clicking "Files changed", searching the `README.md` file,
+ clicking the three dots on the right, selecting "View file" and then
+ switching the current view to track your branch via the drop-down to the
+ left of the file path. The last step ensures that the link automatically
+ tracks updates when you push fixup commits later.
+- Rename the directory according to the PR ID that was allocated, e.g. if the
+ allocated PR ID is 1234 then `00000-my-doc-title` becomes
+ `01234-my-doc-title`
+- Once you're happy with the state of the document:
+ - Mark the PR as ready for review and tag at least 2 maintainers for review
+ - Additional people who should be aware of the document but whose review
+ isn't mandatory can be notified by mentioning their username in the PR's
+ description as `CC @username`