-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Native frames: file name and gnu build id missing after one hour of run time #244
Comments
I don't think so. My tests only ran a few minutes, which is way before any lifetime kicks in. |
The lifetimes of individual cache entries can be "refreshed" with But before going that route, iirc the idea of having lifetimes is to allow wiping unused entries when the cache isn't full, so that the GC can do its work. If regularly used entries are expired, the entries should be reinserted as soon as seen/required. Maybe there is a logic gap that needs to be closed here (need to take a deeper look into the code). |
Gotcha, thanks for the context 👍 If that's the intent, then the issue likely lies in when In processinfo, it's called once every 6 hours, due to the opentelemetry-ebpf-profiler/processmanager/processinfo.go Lines 249 to 253 in e40ccae
In dotnet, it's called only when mappings are synchronized:
So right now, nothing guarantees that entries are being reinserted, AFAICT. Design wise though, I don't see how we could enforce that all callers of |
Sorry, now I get your point of the elfInfoCache and the executables cache being out of sync. processinfo.go In the dotnet code we have a similar pattern (globalPeCache with 6h TTL). What other options do we have that won't add too much of complexity? |
Another slightly different variant to synchronize caches:
|
I took a deeper look as well, and it looks like opentelemetry-ebpf-profiler/processmanager/processinfo.go Lines 473 to 476 in 6ad74a7
So even if we synchronized In the short term, it sounds like the easiest approach would be to drop the lifetime on In the long term though, I'm not sure we can enforce that all callers of ExecutableMetadata are frequently inserting the data again in the cache. One solution would be to have a cache structure that doesn't expire actively used data. In the case of the executables cache, since it's |
Opened an issue for discussion: elastic/go-freelru#63 Once we agree, a LRU API can be implemented relatively quickly. Should we go that route? |
PR elastic/go-freelru#64 |
Thanks a lot! 🙏 As a side note, there's a slightly similar issue in the
I believe the fix is slightly less straightforward there, since this is a LRU cache of maps, and ideally we'd want to expire individual addresses IDs in the inner maps. Wondering if it would be feasible to turn it into a LRU cache of LRU caches instead? (Let me know if you'd prefer we open a separate issue for this one!) |
Great spots, @Gandem! Btw, just released v0.16.0 of go-freelru with the new function. |
Thanks! Opened #248 to track the issue for the |
Focusing more on the core issue (desync between item insertion into the cache and item removal from the cache), some food for thought and some possibilities:
For me 3. seems like the logically consistent option from a separation of concerns pov, as it performs operations that enrich the trace before the trace makes it to the Also for posterity, we have three kinds of executable metadata belonging to:
We also have subsystems in place that are a natural fit for caching executable metadata for all these frames: |
We should keep in mind the future public API (mostly the reporter API). Having implementation-defined caches inside the reporter allows implementers to decide for whatever they prefer and what is optimal for their use-case. At the same time, the reporter implementation is independent of changes to the subsystems, so developers don't have to think about possible reporter implementations when they change something in the subsystems (e.g caching details like size or lifetime). |
It seems to me that eviction due to the LRU being full is less impactful than time-based expiration (which was fixed in #247), since the time-based expiration caused eviction for all executables at once, whereas the LRU being full will only evict the least recently used executables (which would presumably be less important / predominant in the flame graphs). Granted that LRU-full eviction might still be problematic on very large machines running 100s of containers / executables at once (which was sort of mitigated by bumping the executables cache size to 16384). Ideally, I agree that it would be better if we were able to guarantee that metadata is always present / can be re-fetched if needed. Should we keep this issue open / rename it to track this? (Looks like it auto-closed since I mentioned it in #247 PR description that it fixes this issue and I don't have the rights to re-open it 🙏 ) |
Better create a new one and reference to here. |
Opened #250, thank you! 🙏 |
What happened?
After the profiler has seen an executable for one hour, mappings for native frames no longer contain the file name field and gnu build id attribute.
Expected behavior: This might happen during startup before the ELF files are parsed/indexed for an executable, but should not happen after multiple minutes of run time.
Reproducing
On latest
main
(9c8ddce):sudo ./ebpf-profiler -collection-agent=127.0.0.1:11000 -disable-tls -reporter-interval=1m
ebpf-profiler
itself)Root cause
In the OTLP reporter, executables have a one hour lifetime (since #215):
opentelemetry-ebpf-profiler/reporter/otlp_reporter.go
Line 158 in e40ccae
As a consequence, in
elastic/go-freelru
, theexpire
field is set to an hour from the first time the executable is seen:https://github.com/elastic/go-freelru/blob/df0880f59e375f4a1a965eb98c1672e49fc22dce/lru.go#L55-L57
In the current implementation, this means that the cache entry will get cleaned up after one hour, regardless of the latest time it was used or seen, since this
expire
field is only set when the entry is added to the cache in https://github.com/elastic/go-freelru/blob/df0880f59e375f4a1a965eb98c1672e49fc22dce/lru.go#L338, but never updated ulteriorly (unless the cache entry is updated explicitly).The cache is populated in:
opentelemetry-ebpf-profiler/processmanager/processinfo.go
Line 315 in e40ccae
This function is only called if the ELF file is not in the elfInfoCache:
opentelemetry-ebpf-profiler/processmanager/processinfo.go
Lines 249 to 253 in e40ccae
The TTL of that cache is 6 hours:
opentelemetry-ebpf-profiler/processmanager/manager.go
Line 45 in e40ccae
As a consequence, after 1h of continuous runtime for an executable, it is removed from the reporter's
executables
cache, but not re-added until 5h later, leading to file name and gnu build id fields missing.Alternative reproduction
It's possible to reproduce this by lowering the lifetime of the executable cache to 1 min. You can find below a helper grpc python server that showcases the issue:
Python gRPC server script to showcase the issue
Sample output when running after 1min
Potential fixes
A couple potential options/questions that come to mind:
go-freelru
doesn't take into account whether the cache entry is actively being used before expiring it? We could, for example, add an option ingo-freelru
to have the expiry lifetime updated when the cache entry is used, so that an actively used entry is not forcefully removed.executables
cache's lifetime inprocessinfo
(and elsewhere) to ensure that the cache entry is updated. This will have the side effect of extending its lifetime in the cache, and not removing it if it's actively used.Let me know whether there's any preference as to how we should fix this. Happy to open a PR in the relevant repository 🙇 !
Notes
It's unclear to me whether this is related to #235. I suspect (though I would need to dig further to prove this), that #235 could be related to the expiry of active fields in the
frames
cache in the reporter.The text was updated successfully, but these errors were encountered: