-
Notifications
You must be signed in to change notification settings - Fork 728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add prototype for async_metric_source [SMAGENT-1324] #1283
Conversation
This is a prototype for an abstract asynchronous metric source. The idea is that for each metric source (e.g., docker), we'll have a concrete subclass. The concrete subclasses will implement the run_impl() method, which will get invoked in the context of a separate thread. Metrics can be returned to the client code in one of three ways: (1) If we can fetch the metrics within a specified time window, we'll return them synchronously. (2) If we don't fetch the metrics fast enough, and if the client code provided a callback function, then we'll invoke the callback as soon as the metrics are available. (3) If we don't fetch the metrics fast enough, and if the client code did not provide a callback function, then we'll save the metrics and the next time the client code tries to fetch metrics for an entity with the same key, we'll return the previously fetched metric. Option (3) leaves some room for leaking entries -- if we collect metrics for some entity, store them, but the client code never asks for them again. If we can eliminate option (3) and support only options (1) and (2), then this problem goes away. Otherwise, I'll need to build in some way to trim "old" metrics.
I've been working on some UTs for this prototype. While doing that I thought of a condition or two that this currently doesn't handle. Currently this doesn't handle multiple clients or mixing callback/no callback asynchronous behaviour. If we decide to move forward this this prototype, then I'll work on making the semantics for this cases clear and update the prototype as needed. |
This change starts adding basic unit tests for async_metric_source that demonstrate how a client might use the component.
This change extends the original framework with the addition of async Docker metadata collection. Part of this change is in the same sprit as a change from Greg -- I've trying to split out the Linux/Windows/Noop implementations rather than having macro guards throughout the implementation. My approach differs a little in the sense that I introduced a class hierarchy to deal with it. Note that this is just a prototype; there's a good bit of work that isn't here. That includes: * Need to implement the Windows version of the collector * Need to implement unit tests. That will include the need to implement something that will "act like" Docker.
This change adds a concrete implementation of the Docker metadata collector for Windows (as yet untested). I also renamed the new files since I realized that this is collecting metadata and not metrics
In the earlier versions of the Linux implementation, I overlooked the fact that we were using UNIX domain sockets and not TCP. I updated it to parameterize the path to the socket, and I removed the bits related to the TCP port. This change also adds UT for that.
This change pulls the URL fetching logic out of the implementation of async_linux_docker_metadata_source and into a separate class. The rationale for this change is: * The details of URL fetching is a separate concern * The URL fetching logic might be reusable by other components The URL fetcher is separated into an interface and a concrete realization of that interface, with factory methods for creating concretions. The rationale for this design is: * It avoids class-to-class coupling between async_linux_docker_metadata_source and a concrete url fetcher * It will enable us to tweak the factory methods in a unit test environemnt to return UT-specific realizations of the interface that don't actually have to talk to web servers.
The async_docker_metadata_source will cache metadata lookups if the lookup takes longer than the client is willing to wait. A subsequent call to lookup with the same key will returned the cached result. The risk is, if nothing ever makes that subsequent call then that cached result will sit in the cache forever. Over time, the cache will continue to leak such metadata. This change introduces a time to live for cache entries. When a cache entry is created, a timestamp is recorded. The async thread examines the entries in the cache and prunes ones that are too old, as specified by the client code.
This change just fills in passing the query_image_info bool into the async lookup objects, and keeps that in sync with the corresponding value in the sinsp_container_engine_docker.
} | ||
} | ||
|
||
bool async_docker_metadata_source::parse_docker(sinsp_container_manager* const manager, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewers: Please not that I didn't write this, I just pulled it out of sinsp_container_engine_docker
.
{ | ||
|
||
/** | ||
* Base class for classes that need to collect metadata asynchronously from some |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we specify asynchronously? IMO it should be used by everyone accessing metadata, and we support that with an "infinite" timeout on the sync call. This makes it easier in the future to make it async when it invetable is too slow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think refactoring existing sync code to use this framework in a sync is doable, but I'm not sure that I see the value. I'd rather just leave existing sync code alone, and introduce this if/when it's needed.
I think making the way it is now provides a clearer mental model of what the component does.
template<typename key_type, typename metadata_type> | ||
void async_metadata_source<key_type, metadata_type>::stop() | ||
{ | ||
bool join_needed = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm assuming this is only not part of the destructor so the implementation can invoke it as well. It being the case that this is a standalone function, we either need some indication that this can only be called once, or it needs some thread safety. I'd recommend just a check after we lock saying that we already stopped it (or someone is already stopping it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only use for stop() is in the destructor to ensure that the async thread is cleaned up before the object is destroyed. I'll add such a note in the docs for stop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
until someone wants to pause the thread for some reason and decides calling "stop" is the right way to do that :)
If the intended workflow of the object is to be started and stopped only once, it shoul dbe pretty easy to enforce via a couple well placed asserts
template<typename key_type, typename metadata_type> | ||
async_metadata_source<key_type, metadata_type>::~async_metadata_source() | ||
{ | ||
try |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stop() isn't throwing any exceptions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't throwing any user-specified exceptions, that's true, but it could be calling something that transitively throws an exception. If an exception is thrown from the destructor, it'll crash the program (in C++11), so I'm being defensive against that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in c++ do you not need to rethrow the exception at each layer of the stacK?
in either case, AFAIK we don't catch exceptions in most places, so I'm not sure without specific reasons we should be catching exceptions without a good understanding of what they are and why they might occur.....pretty much why do we have a requirement that THIS particular part of code need to be safer than the entire rest of the codebas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Destructors must not throw exceptions.
Far from authoritative, but:
https://akrzemi1.wordpress.com/2013/08/20/noexcept-destructors/
I'm not in favor of using the existing code base as the high-bar for new code.
If we want the agent to crash if an exception is thrown here, I can remove the try/catch. If we do not, then I'm in favor of leaving it.
|
||
if(!m_terminate) | ||
{ | ||
run_impl(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you've released the lock before invoking this function. i see that you reacquire when doing queue operations, but it makes this particular function appear suspicious. We do all the checks against the CV in a way that makes it seem that tight synchronization is essential, then we release the lock before making this call.
If we don't actually need it, I think we need to simplify the above half of the function
- don't bother with the double check locking. termination is something we expect to happen once per ever
- don't bother checking for the empty queue around the CV wait
So pretty much lock->check for terminate->CV wait is all we need if we're not holding the lock across the call to run_impl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The run_impl()
method is "outside" of this class -- it's in the subclass. The mutex protects the state of this class, not the state of the subclass, so the lock ought not be held when running the subclass' method. If/when the subclass needs data from this class, it call an API that acquires the lock.
The CV check is a common pattern for CVs. What benefit do you see from doing something different?
If m_terminate
is set, then stop()
has been called, likely from a destructor. I don't want to block destruction until a new event comes in to signal on the CV.
} | ||
|
||
typename metadata_map::const_iterator itr = m_metadata_map.find(key); | ||
bool request_complete = (itr != m_metadata_map.end()) && itr->second.m_available; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some questions on the semantics:
- if we have atimeout set, should we not be trying to fetch the new data for at least timeout s before falling back to the cached data?
- if we have the data, should we not fetch again? Since that's going to cause us to oscillate between having data and not having data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The semantics are you call lookup()
, then we wait as long as you're willing. If we don't get an answer within that time, we let you know and we keep working on getting what you asked for. When we eventually get it, we save it for you and on the next call to lookup()
we give you that value. We explicitly do not want to re-query in this case.
In the metadata use case --- the case we're focused on here --- we don't expect that the data changes, so "new" data is as good as cached data.
key_type async_metadata_source<key_type, metadata_type>::dequeue_next_key() | ||
{ | ||
std::lock_guard<std::mutex> guard(m_mutex); | ||
key_type key = m_request_queue.front(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this whitespace is kinda wonky
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonky in what way? I've verified that the indentation is all tabs.
* remain cached waiting for a subsequent call to lookup() before | ||
* being discarded. | ||
*/ | ||
const uint64_t MAX_TTL_MS = (30 * 1000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meh, I'm not a huge fan of adding a bunch of knobs -- it adds extra complexity for not a lot of gain. I'd rather just specify a value that it "big enough".
If you think that we can't reasonably guess a value that's "big enough", then I can add a knob.
@@ -0,0 +1,26 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this file showing up in the diff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it's new in this change :)
Why? I can into non-self-sufficient headers when I was adding this code, and I did this to resolve the issue.
Eliminated the queue_size().
I wanted to put a struct type that is used as a template parameter inside the class in which it was used, but the compiler wasn't happy when I tried that.
/** | ||
* @file | ||
* | ||
* Fill in a short overview of the file's content |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to fill in the blank.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall the change looks good and looks pretty complete. The only feedback is really about whether the implementation should perform both roles of being a key/value store and doing lookups. If it's managing the objects, you'll also need to think about things like garbage collection, aging, etc. That being said, you've done almost all of the work already so I won't hold anything back.
* @returns true if this method was able to lookup and return the | ||
* value synchronously; false otherwise. | ||
*/ | ||
bool lookup(const key_type& key, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you think about separating the acts of pushing the key into the object and getting the value back? That makes the client interface more of a simple queue. And if the queue read were non-blocking you wouldn't need the deadline/max_wait_ms stuff. The part where the callback is called in the context of the async spawned thread might limit its usefulness.
* | ||
* @returns the value associated with the given key. | ||
*/ | ||
value_type get_value(const key_type& key); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this class need to keep track of previously fetched values, or can it be more of a work item fetcher, where once it satisfies a request it can forget all about it? This affects whether you need get_value()/store_value().
Aside from the specific implementation, the fact that the container information will be available shortly after the exec of the first pid in the container instead of immediately will require some secure rules changes. Our plan was to make the (currently invisible) container event visible and use that as a notice that a container had started, and adapt falco rules to work both on the first process exec as well as the container event. We haven't fully developed that plan yet, but we will work on that in the next week. |
@anoop-sysd Yes. let's close it, but let's not delete it. There are a number of enhancements here that didn't make it in that would be nice to eventually have
|
This is a prototype for an abstract asynchronous metric source. The
idea is that for each metric source (e.g., docker), we'll have a
concrete subclass. The concrete subclasses will implement the
run_impl() method, which will get invoked in the context of a separate
thread.
Metrics can be returned to the client code in one of three ways:
(1) If we can fetch the metrics within a specified time window, we'll
return them synchronously.
(2) If we don't fetch the metrics fast enough, and if the client code
provided a callback function, then we'll invoke the callback as
soon as the metrics are available.
(3) If we don't fetch the metrics fast enough, and if the client code
did not provide a callback function, then we'll save the metrics and
the next time the client code tries to fetch metrics for an entity
with the same key, we'll return the previously fetched metric.
Option (3) leaves some room for leaking entries -- if we collect metrics
for some entity, store them, but the client code never asks for them
again, so we prune "old" entries if they aren't retrieved within a given time window.