Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reporter: don't expire actively used executables #247

Merged
merged 4 commits into from
Nov 21, 2024

Conversation

Gandem
Copy link
Contributor

@Gandem Gandem commented Nov 20, 2024

Ensure that actively used executable info is not expired from the cache, by extending its lifetime using GetAndRefresh()

Fixes #244

@Gandem Gandem requested review from a team as code owners November 20, 2024 10:31
@Gandem Gandem changed the title reporter: fix executable cache expiry reporter: don't expire actively used executables Nov 20, 2024
Copy link
Contributor

@rockdaboot rockdaboot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -147,7 +151,7 @@ func NewOTLP(cfg *Config) (*OTLPReporter, error) {
if err != nil {
return nil, err
}
executables.SetLifetime(1 * time.Hour) // Allow GC to clean stale items.
executables.SetLifetime(executableCacheLifetime) // Allow GC to clean stale items.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we also increase the size of this cache as 4096 seems small given we still haven't solved the core issue here, which is that items may be dropped from this cache (whether through time-based expiration or due to the LRU being full) and there's no control or guarantee as to when they'll be re-inserted.

I'd set the cache size to 16384 until we really solve the problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong opinion on the cache size (4096 seems quite large to me).
Ideally, we should have metrics for expiration and eviction, then get numbers from production systems.

We can also make the cache sizes and lifetimes configurable.

And we can create an LRU wrapper that automatically resizes the LRUs, as suggested at #248 (comment)

Maybe better continue at #244 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also not forget that (currently), the reporter implementation in this repository is just for demo/example purposes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong opinion on my side either - went ahead and bumped it to 16384 in bf83645

To make it less likely that LRU-full eviction evicts executable metadata for a valuable executable, see open-telemetry#247 (comment)
@rockdaboot rockdaboot merged commit 0dac5dd into open-telemetry:main Nov 21, 2024
23 checks passed
Comment on lines -577 to +583
execInfo, exists := r.executables.Get(traceInfo.files[i])
// Ensure that actively used executables do not expire.
execInfo, exists := r.executables.GetAndRefresh(traceInfo.files[i],
executableCacheLifetime)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference here? Does not Get automatically reset the TTL? If not, the other usage of the lru should be reviewed and this same change probably applies to (potential multiple) other locations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get updates the recentness of an item, which covers eviction when the LRU is full.
The lifetime is only updated/set with Add and is not touched by Get.
The lifetime is a concept that allows purging unused items from an LRU (call to PurgeExpired) to avoid long-term resource leaks.

GetAndRefresh, updates both, recentness and lifetime of an item.

FreeLRU does not actively purge expired items. For the reporter caches, we recently added a ticker to call to PurgeExpired regularly. For other caches we don't have an active purging.

ltrk2 added a commit to instana/opentelemetry-ebpf-profiler that referenced this pull request Dec 4, 2024
* Fix unwinding at syscall on aarch64 (open-telemetry#218)

* Add PID as an attribute in each sample (open-telemetry#212)

* ebpf: increase number of stack delta buckets (open-telemetry#231)

Signed-off-by: Florian Lehner <florian.lehner@elastic.co>

* reporter: use htlhash attribute for profiling specific hash (open-telemetry#236)

Signed-off-by: Florian Lehner <florian.lehner@elastic.co>

* reporter: drop fifo (open-telemetry#239)

Signed-off-by: Florian Lehner <florian.lehner@elastic.co>

* Drop more unused code (open-telemetry#240)

* reporter: do not add empty attributes (open-telemetry#233)

Signed-off-by: Florian Lehner <florian.lehner@elastic.co>

* controller: fix reporter interval mix up with monitor interval (open-telemetry#242)

* Extract reporter runloop (open-telemetry#228)

Co-authored-by: Christos Kalkanis <christos.kalkanis@elastic.co>

* Extract lookupcgroupv2 out of the otlp reporter (open-telemetry#227)

* Add CPU id to trace and trace metadata (open-telemetry#249)

* reporter: don't expire actively used executables (open-telemetry#247)

* Remove legacy code from libpf.UnixTime64 (open-telemetry#252)

* Turn kernel module file parsing errors into warnings (open-telemetry#255)

Co-authored-by: Florian Lehner <florianl@users.noreply.github.com>

* readatbuf: add missing check when reading from tail chunk (open-telemetry#259)

* metrics: Don't send counters with 0 values (open-telemetry#246)

---------

Signed-off-by: Florian Lehner <florian.lehner@elastic.co>
Co-authored-by: umanwizard <brennan@umanwizard.com>
Co-authored-by: Bhavna Jindal <bhavna.jindal@yahoo.com>
Co-authored-by: Florian Lehner <florianl@users.noreply.github.com>
Co-authored-by: Tim Rühsen <tim.ruhsen@elastic.co>
Co-authored-by: Nayef Ghattas <nayef.ghattas@datadoghq.com>
Co-authored-by: Damien Mathieu <42@dmathieu.com>
Co-authored-by: Christos Kalkanis <christos.kalkanis@elastic.co>
Co-authored-by: Tommy Reilly <gnurizen@gmail.com>
Co-authored-by: Joel Höner <joel@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Native frames: file name and gnu build id missing after one hour of run time
4 participants