Add prototype for async_metric_source [SMAGENT-1324] #1283

adalton · 2018-12-21T16:55:18Z

This is a prototype for an abstract asynchronous metric source. The
idea is that for each metric source (e.g., docker), we'll have a
concrete subclass. The concrete subclasses will implement the
run_impl() method, which will get invoked in the context of a separate
thread.

Metrics can be returned to the client code in one of three ways:

(1) If we can fetch the metrics within a specified time window, we'll
return them synchronously.
(2) If we don't fetch the metrics fast enough, and if the client code
provided a callback function, then we'll invoke the callback as
soon as the metrics are available.
(3) If we don't fetch the metrics fast enough, and if the client code
did not provide a callback function, then we'll save the metrics and
the next time the client code tries to fetch metrics for an entity
with the same key, we'll return the previously fetched metric.

Option (3) leaves some room for leaking entries -- if we collect metrics
for some entity, store them, but the client code never asks for them
again, so we prune "old" entries if they aren't retrieved within a given time window.

This is a prototype for an abstract asynchronous metric source. The idea is that for each metric source (e.g., docker), we'll have a concrete subclass. The concrete subclasses will implement the run_impl() method, which will get invoked in the context of a separate thread. Metrics can be returned to the client code in one of three ways: (1) If we can fetch the metrics within a specified time window, we'll return them synchronously. (2) If we don't fetch the metrics fast enough, and if the client code provided a callback function, then we'll invoke the callback as soon as the metrics are available. (3) If we don't fetch the metrics fast enough, and if the client code did not provide a callback function, then we'll save the metrics and the next time the client code tries to fetch metrics for an entity with the same key, we'll return the previously fetched metric. Option (3) leaves some room for leaking entries -- if we collect metrics for some entity, store them, but the client code never asks for them again. If we can eliminate option (3) and support only options (1) and (2), then this problem goes away. Otherwise, I'll need to build in some way to trim "old" metrics.

adalton · 2018-12-21T22:51:52Z

I've been working on some UTs for this prototype. While doing that I thought of a condition or two that this currently doesn't handle. Currently this doesn't handle multiple clients or mixing callback/no callback asynchronous behaviour. If we decide to move forward this this prototype, then I'll work on making the semantics for this cases clear and update the prototype as needed.

This change starts adding basic unit tests for async_metric_source that demonstrate how a client might use the component.

test/userspace/async/async_metric_source.ut.cpp

This change extends the original framework with the addition of async Docker metadata collection. Part of this change is in the same sprit as a change from Greg -- I've trying to split out the Linux/Windows/Noop implementations rather than having macro guards throughout the implementation. My approach differs a little in the sense that I introduced a class hierarchy to deal with it. Note that this is just a prototype; there's a good bit of work that isn't here. That includes: * Need to implement the Windows version of the collector * Need to implement unit tests. That will include the need to implement something that will "act like" Docker.

This change adds a concrete implementation of the Docker metadata collector for Windows (as yet untested). I also renamed the new files since I realized that this is collecting metadata and not metrics

In the earlier versions of the Linux implementation, I overlooked the fact that we were using UNIX domain sockets and not TCP. I updated it to parameterize the path to the socket, and I removed the bits related to the TCP port. This change also adds UT for that.

This change pulls the URL fetching logic out of the implementation of async_linux_docker_metadata_source and into a separate class. The rationale for this change is: * The details of URL fetching is a separate concern * The URL fetching logic might be reusable by other components The URL fetcher is separated into an interface and a concrete realization of that interface, with factory methods for creating concretions. The rationale for this design is: * It avoids class-to-class coupling between async_linux_docker_metadata_source and a concrete url fetcher * It will enable us to tweak the factory methods in a unit test environemnt to return UT-specific realizations of the interface that don't actually have to talk to web servers.

The async_docker_metadata_source will cache metadata lookups if the lookup takes longer than the client is willing to wait. A subsequent call to lookup with the same key will returned the cached result. The risk is, if nothing ever makes that subsequent call then that cached result will sit in the cache forever. Over time, the cache will continue to leak such metadata. This change introduces a time to live for cache entries. When a cache entry is created, a timestamp is recorded. The async thread examines the entries in the cache and prunes ones that are too old, as specified by the client code.

This change just fills in passing the query_image_info bool into the async lookup objects, and keeps that in sync with the corresponding value in the sinsp_container_engine_docker.

adalton · 2019-01-11T16:13:32Z

userspace/libsinsp/async_docker_metadata_source.cpp

+	}
+}
+
+bool async_docker_metadata_source::parse_docker(sinsp_container_manager* const manager,


Reviewers: Please not that I didn't write this, I just pulled it out of sinsp_container_engine_docker.

speedyguy17 · 2019-01-11T18:14:41Z

userspace/async/async_metadata_source.h

+{
+
+/**
+ * Base class for classes that need to collect metadata asynchronously from some


should we specify asynchronously? IMO it should be used by everyone accessing metadata, and we support that with an "infinite" timeout on the sync call. This makes it easier in the future to make it async when it invetable is too slow

I think refactoring existing sync code to use this framework in a sync is doable, but I'm not sure that I see the value. I'd rather just leave existing sync code alone, and introduce this if/when it's needed.

I think making the way it is now provides a clearer mental model of what the component does.

userspace/async/async_metadata_source.h

speedyguy17 · 2019-01-14T15:48:49Z

userspace/async/async_metadata_source.tpp

+template<typename key_type, typename metadata_type>
+void async_metadata_source<key_type, metadata_type>::stop()
+{
+	bool join_needed = false;


i'm assuming this is only not part of the destructor so the implementation can invoke it as well. It being the case that this is a standalone function, we either need some indication that this can only be called once, or it needs some thread safety. I'd recommend just a check after we lock saying that we already stopped it (or someone is already stopping it)

The only use for stop() is in the destructor to ensure that the async thread is cleaned up before the object is destroyed. I'll add such a note in the docs for stop.

until someone wants to pause the thread for some reason and decides calling "stop" is the right way to do that :)

If the intended workflow of the object is to be started and stopped only once, it shoul dbe pretty easy to enforce via a couple well placed asserts

speedyguy17 · 2019-01-14T15:50:34Z

userspace/async/async_metadata_source.tpp

+template<typename key_type, typename metadata_type>
+async_metadata_source<key_type, metadata_type>::~async_metadata_source()
+{
+	try


stop() isn't throwing any exceptions

It isn't throwing any user-specified exceptions, that's true, but it could be calling something that transitively throws an exception. If an exception is thrown from the destructor, it'll crash the program (in C++11), so I'm being defensive against that.

in c++ do you not need to rethrow the exception at each layer of the stacK?

in either case, AFAIK we don't catch exceptions in most places, so I'm not sure without specific reasons we should be catching exceptions without a good understanding of what they are and why they might occur.....pretty much why do we have a requirement that THIS particular part of code need to be safer than the entire rest of the codebas?

No. Destructors must not throw exceptions.

Far from authoritative, but:
https://akrzemi1.wordpress.com/2013/08/20/noexcept-destructors/

I'm not in favor of using the existing code base as the high-bar for new code.

If we want the agent to crash if an exception is thrown here, I can remove the try/catch. If we do not, then I'm in favor of leaving it.

userspace/async/async_metadata_source.tpp

userspace/async/async_metadata_source.h

userspace/async/async_metadata_source.tpp

speedyguy17 · 2019-01-14T16:30:32Z

userspace/async/async_metadata_source.tpp

+
+		if(!m_terminate)
+		{
+			run_impl();


you've released the lock before invoking this function. i see that you reacquire when doing queue operations, but it makes this particular function appear suspicious. We do all the checks against the CV in a way that makes it seem that tight synchronization is essential, then we release the lock before making this call.

If we don't actually need it, I think we need to simplify the above half of the function

don't bother with the double check locking. termination is something we expect to happen once per ever

don't bother checking for the empty queue around the CV wait

So pretty much lock->check for terminate->CV wait is all we need if we're not holding the lock across the call to run_impl

The run_impl() method is "outside" of this class -- it's in the subclass. The mutex protects the state of this class, not the state of the subclass, so the lock ought not be held when running the subclass' method. If/when the subclass needs data from this class, it call an API that acquires the lock.

The CV check is a common pattern for CVs. What benefit do you see from doing something different?

If m_terminate is set, then stop() has been called, likely from a destructor. I don't want to block destruction until a new event comes in to signal on the CV.

userspace/async/async_metadata_source.tpp

speedyguy17 · 2019-01-14T16:38:15Z

userspace/async/async_metadata_source.tpp

+	}
+
+	typename metadata_map::const_iterator itr = m_metadata_map.find(key);
+	bool request_complete = (itr != m_metadata_map.end()) && itr->second.m_available;


some questions on the semantics:

if we have atimeout set, should we not be trying to fetch the new data for at least timeout s before falling back to the cached data?

if we have the data, should we not fetch again? Since that's going to cause us to oscillate between having data and not having data

The semantics are you call lookup(), then we wait as long as you're willing. If we don't get an answer within that time, we let you know and we keep working on getting what you asked for. When we eventually get it, we save it for you and on the next call to lookup() we give you that value. We explicitly do not want to re-query in this case.

In the metadata use case --- the case we're focused on here --- we don't expect that the data changes, so "new" data is as good as cached data.

userspace/async/async_metadata_source.tpp

speedyguy17 · 2019-01-14T18:08:57Z

userspace/async/async_metadata_source.tpp

+key_type async_metadata_source<key_type, metadata_type>::dequeue_next_key()
+{
+	std::lock_guard<std::mutex> guard(m_mutex);
+	key_type key = m_request_queue.front();


this whitespace is kinda wonky

Wonky in what way? I've verified that the indentation is all tabs.

userspace/libsinsp/async_docker_metadata_source.cpp

userspace/libsinsp/async_docker_metadata_source.h

speedyguy17 · 2019-01-14T18:22:25Z

userspace/libsinsp/async_docker_metadata_source.h

+	 * remain cached waiting for a subsequent call to lookup() before
+	 * being discarded.
+	 */
+	const uint64_t MAX_TTL_MS = (30 * 1000);


Meh, I'm not a huge fan of adding a bunch of knobs -- it adds extra complexity for not a lot of gain. I'd rather just specify a value that it "big enough".

If you think that we can't reasonably guess a value that's "big enough", then I can add a knob.

speedyguy17 · 2019-01-14T18:23:57Z

userspace/libsinsp/sinsp_public.h

@@ -0,0 +1,26 @@
+/*


why is this file showing up in the diff?

Because it's new in this change :)

Why? I can into non-self-sufficient headers when I was adding this code, and I did this to resolve the issue.

userspace/libsinsp/async_docker_metadata_source.cpp

test/userspace/async/async_metadata_source.ut.cpp

userspace/async/async_metadata_source.tpp

Eliminated the queue_size().

I wanted to put a struct type that is used as a template parameter inside the class in which it was used, but the compiler wasn't happy when I tried that.

adalton · 2019-01-18T02:44:09Z

test/userspace/libsinsp/async_linux_docker_metadata_source.ut.cpp

+/**
+ * @file
+ *
+ * Fill in a short overview of the file's content


Need to fill in the blank.

mstemm

Overall the change looks good and looks pretty complete. The only feedback is really about whether the implementation should perform both roles of being a key/value store and doing lookups. If it's managing the objects, you'll also need to think about things like garbage collection, aging, etc. That being said, you've done almost all of the work already so I won't hold anything back.

mstemm · 2019-01-30T23:08:09Z

userspace/async/async_key_value_source.h

+	 * @returns true if this method was able to lookup and return the
+	 *          value synchronously; false otherwise.
+	 */
+	bool lookup(const key_type& key,


Did you think about separating the acts of pushing the key into the object and getting the value back? That makes the client interface more of a simple queue. And if the queue read were non-blocking you wouldn't need the deadline/max_wait_ms stuff. The part where the callback is called in the context of the async spawned thread might limit its usefulness.

mstemm · 2019-01-30T23:13:07Z

userspace/async/async_key_value_source.h

+	 *
+	 * @returns the value associated with the given key.
+	 */
+	value_type get_value(const key_type& key);


Does this class need to keep track of previously fetched values, or can it be more of a work item fetcher, where once it satisfies a request it can forget all about it? This affects whether you need get_value()/store_value().

mstemm · 2019-01-31T00:46:44Z

Aside from the specific implementation, the fact that the container information will be available shortly after the exec of the first pid in the container instead of immediately will require some secure rules changes. Our plan was to make the (currently invisible) container event visible and use that as a notice that a container had started, and adapt falco rules to work both on the first process exec as well as the container event. We haven't fully developed that plan yet, but we will work on that in the next week.

anoop-sysd · 2019-03-13T23:57:29Z

@adalton , shall we close this PR in preference to #1326?

adalton · 2019-03-15T13:56:10Z

@adalton , shall we close this PR in preference to #1326?

@anoop-sysd Yes. let's close it, but let's not delete it. There are a number of enhancements here that didn't make it in that would be nice to eventually have

Decoupling the URL fetcher from the metadata fetcher
Unit testing for the metadata fetcher
Unit testing helper class for the URL fetcher that will enable us to return pre-canned results

adalton self-assigned this Dec 21, 2018

adalton requested review from gnosek, mattpag, mstemm, speedyguy17 and balinaveen December 21, 2018 16:55

Start unit tests for async_metric_source

7d02663

This change starts adding basic unit tests for async_metric_source that demonstrate how a client might use the component.

gnosek reviewed Dec 27, 2018

View reviewed changes

test/userspace/async/async_metric_source.ut.cpp Outdated Show resolved Hide resolved

Andy Dalton added 6 commits January 3, 2019 17:10

Add Windows version of the metadata collector

92cf8ff

This change adds a concrete implementation of the Docker metadata collector for Windows (as yet untested). I also renamed the new files since I realized that this is collecting metadata and not metrics

Handle query_image_info

58edc7c

This change just fills in passing the query_image_info bool into the async lookup objects, and keeps that in sync with the corresponding value in the sinsp_container_engine_docker.

adalton commented Jan 11, 2019

View reviewed changes

speedyguy17 reviewed Jan 11, 2019

View reviewed changes

Handle Meso metadata lookup

370875f

speedyguy17 reviewed Jan 14, 2019

View reviewed changes

speedyguy17 suggested changes Jan 14, 2019

View reviewed changes

speedyguy17 reviewed Jan 14, 2019

View reviewed changes

userspace/async/async_metadata_source.tpp Outdated Show resolved Hide resolved

Andy Dalton added 5 commits January 14, 2019 14:02

Address Zipper's comments

6f32690

Address some more of Zipper's comments

a0b0db9

Address Zipper's comments

3bd7c14

Eliminated the queue_size().

Add set for more efficient searching

21c55e7

Remove TODO comment

f850d3b

I wanted to put a struct type that is used as a template parameter inside the class in which it was used, but the compiler wasn't happy when I tried that.

adalton commented Jan 18, 2019

View reviewed changes

mstemm reviewed Jan 30, 2019

View reviewed changes

mstemm mentioned this pull request Feb 4, 2019

Add support for container metaevent to detect container spawning falcosecurity/falco#405

Closed

anoop-sysd changed the title ~~Add prototype for async_metric_source~~ Add prototype for async_metric_source [SMAGENT-1324] Feb 13, 2019

adalton closed this Mar 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add prototype for async_metric_source [SMAGENT-1324] #1283

Add prototype for async_metric_source [SMAGENT-1324] #1283

adalton commented Dec 21, 2018 •

edited

Loading

adalton commented Dec 21, 2018

adalton Jan 11, 2019

speedyguy17 Jan 11, 2019

adalton Jan 16, 2019

speedyguy17 Jan 14, 2019

adalton Jan 14, 2019

speedyguy17 Jan 14, 2019

speedyguy17 Jan 14, 2019

adalton Jan 14, 2019

speedyguy17 Jan 14, 2019

adalton Jan 14, 2019

speedyguy17 Jan 14, 2019

adalton Jan 15, 2019

speedyguy17 Jan 14, 2019

adalton Jan 14, 2019

speedyguy17 Jan 14, 2019

adalton Jan 14, 2019

speedyguy17 Jan 14, 2019

adalton Jan 14, 2019

speedyguy17 Jan 14, 2019

adalton Jan 14, 2019

adalton Jan 18, 2019

mstemm left a comment

mstemm Jan 30, 2019

mstemm Jan 30, 2019

mstemm commented Jan 31, 2019

anoop-sysd commented Mar 13, 2019

adalton commented Mar 15, 2019

Add prototype for async_metric_source [SMAGENT-1324] #1283

Add prototype for async_metric_source [SMAGENT-1324] #1283

Conversation

adalton commented Dec 21, 2018 • edited Loading

adalton commented Dec 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mstemm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mstemm commented Jan 31, 2019

anoop-sysd commented Mar 13, 2019

adalton commented Mar 15, 2019

adalton commented Dec 21, 2018 •

edited

Loading