Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native frames: file name and gnu build id missing after one hour of run time #244

Closed
Gandem opened this issue Nov 19, 2024 · 16 comments · Fixed by #247
Closed

Native frames: file name and gnu build id missing after one hour of run time #244

Gandem opened this issue Nov 19, 2024 · 16 comments · Fixed by #247

Comments

@Gandem
Copy link
Contributor

Gandem commented Nov 19, 2024

What happened?

After the profiler has seen an executable for one hour, mappings for native frames no longer contain the file name field and gnu build id attribute.

Expected behavior: This might happen during startup before the ELF files are parsed/indexed for an executable, but should not happen after multiple minutes of run time.

Reproducing

On latest main (9c8ddce):

  • Run the profiler via sudo ./ebpf-profiler -collection-agent=127.0.0.1:11000 -disable-tls -reporter-interval=1m
  • Wait for one hour (1h) after native frames are seen for an executable (for example the ebpf-profiler itself)
  • Mappings for native frames for that executable will no longer container file name / gnu build id attribute

Root cause

In the OTLP reporter, executables have a one hour lifetime (since #215):

executables.SetLifetime(1 * time.Hour) // Allow GC to clean stale items.

As a consequence, in elastic/go-freelru, the expire field is set to an hour from the first time the executable is seen:
https://github.com/elastic/go-freelru/blob/df0880f59e375f4a1a965eb98c1672e49fc22dce/lru.go#L55-L57

In the current implementation, this means that the cache entry will get cleaned up after one hour, regardless of the latest time it was used or seen, since this expire field is only set when the entry is added to the cache in https://github.com/elastic/go-freelru/blob/df0880f59e375f4a1a965eb98c1672e49fc22dce/lru.go#L338, but never updated ulteriorly (unless the cache entry is updated explicitly).

The cache is populated in:

pm.reporter.ExecutableMetadata(&reporter.ExecutableMetadataArgs{

This function is only called if the ELF file is not in the elfInfoCache:

if info, ok := pm.elfInfoCache.Get(key); ok && info.lastModified == lastModified {
// Cached data ok
pm.elfInfoCacheHit.Add(1)
return info
}

The TTL of that cache is 6 hours:

elfInfoCacheTTL = 6 * time.Hour

As a consequence, after 1h of continuous runtime for an executable, it is removed from the reporter's executables cache, but not re-added until 5h later, leading to file name and gnu build id fields missing.

Alternative reproduction

It's possible to reproduce this by lowering the lifetime of the executable cache to 1 min. You can find below a helper grpc python server that showcases the issue:

Python gRPC server script to showcase the issue
from concurrent import futures
import logging
import datetime

import grpc
import opentelemetry.proto.collector.profiles.v1experimental.profiles_service_pb2 as otelprotodef
import opentelemetry.proto.collector.profiles.v1experimental.profiles_service_pb2_grpc as otelproto


class Exporter(otelproto.ProfilesService):
    hash_to_exec_info = {}

    def Export(self, request: otelprotodef.ExportProfilesServiceRequest, target):
        print(datetime.datetime.now(), "Received message!")
        for resource in request.resource_profiles:
            for scope in resource.scope_profiles:
                for container in scope.profiles:
                    profile = container.profile
                    string_table = profile.string_table
                    attribute_table = profile.attribute_table
                    mappings = profile.mapping
                    for mapping in mappings:
                        htl_hash, build_id = "", ""
                        filename = string_table[mapping.filename]
                        for attribute in mapping.attributes:
                            if attribute_table[attribute].key == "process.executable.build_id.htlhash":
                                htl_hash = attribute_table[attribute].value.string_value
                            if attribute_table[attribute].key == "process.executable.build_id.gnu":
                                build_id = attribute_table[attribute].value.string_value
                        if build_id != "" or (filename != "UNKNOWN" and filename != ""):
                            self.hash_to_exec_info[htl_hash] = (build_id, filename)
                        if self.hash_to_exec_info.get(htl_hash, None) is not None and build_id == "" and (filename == "" or filename == "UNKNOWN"):
                            print(datetime.datetime.now(), "Lost exec info data for executable", self.hash_to_exec_info[htl_hash], build_id, filename)

        return otelprotodef.ExportProfilesServiceResponse()


def serve():
    port = "8000"
    server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
    otelproto.add_ProfilesServiceServicer_to_server(Exporter(), server)
    server.add_insecure_port("[::]:" + port)
    server.start()
    print("Server started, listening on " + port)
    server.wait_for_termination()


if __name__ == "__main__":
    logging.basicConfig()
    serve()
Sample output when running after 1min
2024-11-19 08:58:29.933295 Lost exec info data for executable ('ca3ef4796503671b33573ac8621ffd9c4c237ec7', 'process-agent')  UNKNOWN
2024-11-19 08:58:29.933319 Lost exec info data for executable ('490fef8403240c91833978d494d39e537409b92e', 'libc.so.6')  UNKNOWN
2024-11-19 08:58:29.933332 Lost exec info data for executable ('80541b54620ad5e5bb1011b3cf4573c3e4428c9c', 'libjvm.so')  UNKNOWN
2024-11-19 08:58:29.933435 Lost exec info data for executable ('d1f40ab9db7631ede9d147f878b1c8e3f515c730', 'agent')  UNKNOWN
2024-11-19 08:58:29.933448 Lost exec info data for executable ('912362c73f5bc409aad92bd06447d1642a3e7b0e', 'trace-agent')  UNKNOWN
2024-11-19 08:58:29.933459 Lost exec info data for executable ('b1169cdfa04f518b407de97242d9e07568ff743b', 'ebpf-profiler')  UNKNOWN
2024-11-19 08:58:29.933470 Lost exec info data for executable ('30840b79ac329ecbf1dec0bb60180eed256d319f', 'libz.so.1.2.11')  UNKNOWN
2024-11-19 08:58:29.933482 Lost exec info data for executable ('a38e3be1b89000e883290cc72b54b5635fb97043', 'sshd')  UNKNOWN

Potential fixes

A couple potential options/questions that come to mind:

  • Is it expected that go-freelru doesn't take into account whether the cache entry is actively being used before expiring it? We could, for example, add an option in go-freelru to have the expiry lifetime updated when the cache entry is used, so that an actively used entry is not forcefully removed.
  • Otherwise, it would also be possible to resend executable metadata more frequently that the executables cache's lifetime in processinfo (and elsewhere) to ensure that the cache entry is updated. This will have the side effect of extending its lifetime in the cache, and not removing it if it's actively used.

Let me know whether there's any preference as to how we should fix this. Happy to open a PR in the relevant repository 🙇 !

Notes

It's unclear to me whether this is related to #235. I suspect (though I would need to dig further to prove this), that #235 could be related to the expiry of active fields in the frames cache in the reporter.

@rockdaboot
Copy link
Contributor

#235 could be related to the expiry of active fields

I don't think so. My tests only ran a few minutes, which is way before any lifetime kicks in.

@rockdaboot
Copy link
Contributor

The lifetimes of individual cache entries can be "refreshed" with AddWithLifetime.

But before going that route, iirc the idea of having lifetimes is to allow wiping unused entries when the cache isn't full, so that the GC can do its work.

If regularly used entries are expired, the entries should be reinserted as soon as seen/required. Maybe there is a logic gap that needs to be closed here (need to take a deeper look into the code).

@Gandem
Copy link
Contributor Author

Gandem commented Nov 19, 2024

If regularly used entries are expired, the entries should be reinserted as soon as seen/required. Maybe there is a logic gap that needs to be closed here (need to take a deeper look into the code).

Gotcha, thanks for the context 👍 If that's the intent, then the issue likely lies in when ExecutableMetadata is called.

In processinfo, it's called once every 6 hours, due to the elfInfoCache:

if info, ok := pm.elfInfoCache.Get(key); ok && info.lastModified == lastModified {
// Cached data ok
pm.elfInfoCacheHit.Add(1)
return info
}

In dotnet, it's called only when mappings are synchronized:

symbolReporter.ExecutableMetadata(

So right now, nothing guarantees that entries are being reinserted, AFAICT.

Design wise though, I don't see how we could enforce that all callers of ExecutableMetadata are frequently inserting the data again in the cache?

@rockdaboot
Copy link
Contributor

Sorry, now I get your point of the elfInfoCache and the executables cache being out of sync.

processinfo.go
Both caches are for different purposes, but need to be synced.
The simplest options might be to drop calling reporter.ExecutableKnown() plus using the same lifetime for both caches. Or instead of syncing the lifetimes, we drop the lifetime completely from the executables cache and live with the small amounts of stale items.

In the dotnet code we have a similar pattern (globalPeCache with 6h TTL).

What other options do we have that won't add too much of complexity?

@rockdaboot
Copy link
Contributor

Another slightly different variant to synchronize caches:

  • drop ExecutableKnown()
  • do not use a lifetime for executables cache
  • extend ExecutableKnown() to take a lifetime and use lru.AddWithLifetime()
  • callers of ExecutableKnown() pass the lifetime (6h)
    That way we have caches synced and still are able to wipe stale items.

@Gandem
Copy link
Contributor Author

Gandem commented Nov 19, 2024

I took a deeper look as well, and it looks like getElfInfo is called by processNewExecMapping, which in turn is called by synchronizeMappings, but only for new mappings:

// Add the new ELF mappings
for _, mapping := range mpAdd {
pm.processNewExecMapping(pr, mapping)
}

So even if we synchronized elfInfoCache with the executables cache, we'd still have the issue. Is my understanding correct?

In the short term, it sounds like the easiest approach would be to drop the lifetime on executables, what do you think?

In the long term though, I'm not sure we can enforce that all callers of ExecutableMetadata are frequently inserting the data again in the cache. One solution would be to have a cache structure that doesn't expire actively used data.

In the case of the executables cache, since it's file ID -> (file name, build ID), taking the assumptions that file IDs and build IDs are unique, we'd probably be okay with storing this data indefinitely in the cache (without needing to refresh it) as long as it's actively being used.

@rockdaboot
Copy link
Contributor

One solution would be to have a cache structure that doesn't expire actively used data.

Opened an issue for discussion: elastic/go-freelru#63

Once we agree, a LRU API can be implemented relatively quickly. Should we go that route?

@rockdaboot
Copy link
Contributor

PR elastic/go-freelru#64
I'll create a new release when merged. Then we have the tooling to fix this issue properly :)

@Gandem
Copy link
Contributor Author

Gandem commented Nov 20, 2024

Thanks a lot! 🙏

As a side note, there's a slightly similar issue in the frames cache - I ran the profiler next to a Java process and noticed two things:

  • (not very critical) Once per hour, in a single reporting cycle, some of the frames would have as function name UNREPORTED, since the expiry expired an actively used file ID. The next reporting cycle fixes this, so this only affects one profile every hour.
  • After one hour of runtime, stub frames are UNRESOLVED, since their metadata is submitted only once since the addrToStubNameID never expires (see link below).

https://github.com/DataDog/opentelemetry-ebpf-profiler/blob/0599c7c681ee15f6c394c9b3acdf2dd45c7d8279/interpreter/hotspot/instance.go#L240

I believe the fix is slightly less straightforward there, since this is a LRU cache of maps, and ideally we'd want to expire individual addresses IDs in the inner maps. Wondering if it would be feasible to turn it into a LRU cache of LRU caches instead?

(Let me know if you'd prefer we open a separate issue for this one!)

@rockdaboot
Copy link
Contributor

Great spots, @Gandem!
Yeah, let's open another issue and fix things case by case. Otherwise, longer discussions on one topic would block the progress on other topics.

Btw, just released v0.16.0 of go-freelru with the new function.
I assume you are going to create a PR for the executables cache issue, else let me know.

@Gandem
Copy link
Contributor Author

Gandem commented Nov 20, 2024

Thanks! Opened #248 to track the issue for the frames cache.

@christos68k
Copy link
Member

christos68k commented Nov 21, 2024

Focusing more on the core issue (desync between item insertion into the cache and item removal from the cache), some food for thought and some possibilities:

  1. For items that we can't ensure are always present in the reporter cache (such as executable metadata), keep them always cached in the reporter.
  2. Instead of keeping them always cached in the reporter, allow the reporter to query a different subsystem (e.g. processmanager / executableinfomanager) when it can't find them. That subsystem may decide to keep metadata always cached in memory or retrieve it and cache it on demand in an LRU. The key point is that the metadata is always made available to the reporter, either quickly through a cache or slower by looking them up in the filesystem or in a mapping.
  3. Same as 2. but instead of having enrichment take place in the reporter, do it earlier (e.g. in tracer) and send metadata with the trace via ReportTraceEvent.

For me 3. seems like the logically consistent option from a separation of concerns pov, as it performs operations that enrich the trace before the trace makes it to the reporter. The agent subsystems involved in those operations (tracer, tracehandler) already have access to processmanager. Still need to check whether it's practical however.

Also for posterity, we have three kinds of executable metadata belonging to:

  1. Kernel frames
  2. Native frames
  3. .NET frames

We also have subsystems in place that are a natural fit for caching executable metadata for all these frames: tracer for kernel frames and processmanager/execinfomanager for native frames.

@rockdaboot
Copy link
Contributor

rockdaboot commented Nov 21, 2024

We should keep in mind the future public API (mostly the reporter API).
Keeping the API surface low reduces maintenance burden and increases "development freedom".
If the public reporter API calls into some the subsystems, the subsystems API needs to be public as well. It also is a circular dependency (we discussed this before and decided for what we have now).

Having implementation-defined caches inside the reporter allows implementers to decide for whatever they prefer and what is optimal for their use-case. At the same time, the reporter implementation is independent of changes to the subsystems, so developers don't have to think about possible reporter implementations when they change something in the subsystems (e.g caching details like size or lifetime).

@Gandem
Copy link
Contributor Author

Gandem commented Nov 21, 2024

It seems to me that eviction due to the LRU being full is less impactful than time-based expiration (which was fixed in #247), since the time-based expiration caused eviction for all executables at once, whereas the LRU being full will only evict the least recently used executables (which would presumably be less important / predominant in the flame graphs). Granted that LRU-full eviction might still be problematic on very large machines running 100s of containers / executables at once (which was sort of mitigated by bumping the executables cache size to 16384).

Ideally, I agree that it would be better if we were able to guarantee that metadata is always present / can be re-fetched if needed.

Should we keep this issue open / rename it to track this? (Looks like it auto-closed since I mentioned it in #247 PR description that it fixes this issue and I don't have the rights to re-open it 🙏 )

@rockdaboot
Copy link
Contributor

Should we keep this issue open / rename it to track this?

Better create a new one and reference to here.

@Gandem
Copy link
Contributor Author

Gandem commented Nov 21, 2024

Opened #250, thank you! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants