Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add prototype for async_metric_source [SMAGENT-1324] #1283

Closed
wants to merge 14 commits into from

Conversation

adalton
Copy link
Contributor

@adalton adalton commented Dec 21, 2018

This is a prototype for an abstract asynchronous metric source. The
idea is that for each metric source (e.g., docker), we'll have a
concrete subclass. The concrete subclasses will implement the
run_impl() method, which will get invoked in the context of a separate
thread.

Metrics can be returned to the client code in one of three ways:

(1) If we can fetch the metrics within a specified time window, we'll
return them synchronously.
(2) If we don't fetch the metrics fast enough, and if the client code
provided a callback function, then we'll invoke the callback as
soon as the metrics are available.
(3) If we don't fetch the metrics fast enough, and if the client code
did not provide a callback function, then we'll save the metrics and
the next time the client code tries to fetch metrics for an entity
with the same key, we'll return the previously fetched metric.

Option (3) leaves some room for leaking entries -- if we collect metrics
for some entity, store them, but the client code never asks for them
again, so we prune "old" entries if they aren't retrieved within a given time window.

This is a prototype for an abstract asynchronous metric source.  The
idea is that for each metric source (e.g., docker), we'll have a
concrete subclass.  The concrete subclasses will implement the
run_impl() method, which will get invoked in the context of a separate
thread.

Metrics can be returned to the client code in one of three ways:

  (1) If we can fetch the metrics within a specified time window, we'll
      return them synchronously.
  (2) If we don't fetch the metrics fast enough, and if the client code
      provided a callback function, then we'll invoke the callback as
      soon as the metrics are available.
  (3) If we don't fetch the metrics fast enough, and if the client code
      did not provide a callback function, then we'll save the metrics and
      the next time the client code tries to fetch metrics for an entity
      with the same key, we'll return the previously fetched metric.

Option (3) leaves some room for leaking entries -- if we collect metrics
for some entity, store them, but the client code never asks for them
again.  If we can eliminate option (3) and support only options (1) and
(2), then this problem goes away.  Otherwise, I'll need to build in some
way to trim "old" metrics.
@adalton
Copy link
Contributor Author

adalton commented Dec 21, 2018

I've been working on some UTs for this prototype. While doing that I thought of a condition or two that this currently doesn't handle. Currently this doesn't handle multiple clients or mixing callback/no callback asynchronous behaviour. If we decide to move forward this this prototype, then I'll work on making the semantics for this cases clear and update the prototype as needed.

This change starts adding basic unit tests for async_metric_source that
demonstrate how a client might use the component.
Andy Dalton added 6 commits January 3, 2019 17:10
This change extends the original framework with the addition of async
Docker metadata collection.  Part of this change is in the same sprit as
a change from Greg -- I've trying to split out the Linux/Windows/Noop
implementations rather than having macro guards throughout the
implementation.  My approach differs a little in the sense that I
introduced a class hierarchy to deal with it.

Note that this is just a prototype; there's a good bit of work that
isn't here.  That includes:
    * Need to implement the Windows version of the collector
    * Need to implement unit tests.  That will include the need to
      implement something that will "act like" Docker.
This change adds a concrete implementation of the Docker metadata
collector for Windows (as yet untested).

I also renamed the new files since I realized that this is collecting
metadata and not metrics
In the earlier versions of the Linux implementation, I overlooked the
fact that we were using UNIX domain sockets and not TCP.  I updated it
to parameterize the path to the socket, and I removed the bits related
to the TCP port.

This change also adds UT for that.
This change pulls the URL fetching logic out of the implementation of
async_linux_docker_metadata_source and into a separate class.  The
rationale for this change is:

    * The details of URL fetching is a separate concern
    * The URL fetching logic might be reusable by other components

The URL fetcher is separated into an interface and a concrete
realization of that interface, with factory methods for creating
concretions.  The rationale for this design is:

    * It avoids class-to-class coupling between
      async_linux_docker_metadata_source and a concrete url fetcher
    * It will enable us to tweak the factory methods in a unit test
      environemnt to return UT-specific realizations of the interface
      that don't actually have to talk to web servers.
The async_docker_metadata_source will cache metadata lookups if the
lookup takes longer than the client is willing to wait.  A subsequent
call to lookup with the same key will returned the cached result.  The
risk is, if nothing ever makes that subsequent call then that cached
result will sit in the cache forever.  Over time, the cache will
continue to leak such metadata.

This change introduces a time to live for cache entries.  When a cache
entry is created, a timestamp is recorded.  The async thread examines
the entries in the cache and prunes ones that are too old, as specified
by the client code.
This change just fills in passing the query_image_info bool into the
async lookup objects, and keeps that in sync with the corresponding
value in the sinsp_container_engine_docker.
}
}

bool async_docker_metadata_source::parse_docker(sinsp_container_manager* const manager,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewers: Please not that I didn't write this, I just pulled it out of sinsp_container_engine_docker.

{

/**
* Base class for classes that need to collect metadata asynchronously from some
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we specify asynchronously? IMO it should be used by everyone accessing metadata, and we support that with an "infinite" timeout on the sync call. This makes it easier in the future to make it async when it invetable is too slow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think refactoring existing sync code to use this framework in a sync is doable, but I'm not sure that I see the value. I'd rather just leave existing sync code alone, and introduce this if/when it's needed.

I think making the way it is now provides a clearer mental model of what the component does.

userspace/async/async_metadata_source.h Outdated Show resolved Hide resolved
userspace/async/async_metadata_source.h Outdated Show resolved Hide resolved
userspace/async/async_metadata_source.h Outdated Show resolved Hide resolved
userspace/async/async_metadata_source.h Outdated Show resolved Hide resolved
userspace/async/async_metadata_source.h Outdated Show resolved Hide resolved
userspace/async/async_metadata_source.h Outdated Show resolved Hide resolved
userspace/async/async_metadata_source.h Outdated Show resolved Hide resolved
userspace/async/async_metadata_source.h Outdated Show resolved Hide resolved
userspace/async/async_metadata_source.h Outdated Show resolved Hide resolved
userspace/async/async_metadata_source.h Outdated Show resolved Hide resolved
template<typename key_type, typename metadata_type>
void async_metadata_source<key_type, metadata_type>::stop()
{
bool join_needed = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm assuming this is only not part of the destructor so the implementation can invoke it as well. It being the case that this is a standalone function, we either need some indication that this can only be called once, or it needs some thread safety. I'd recommend just a check after we lock saying that we already stopped it (or someone is already stopping it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only use for stop() is in the destructor to ensure that the async thread is cleaned up before the object is destroyed. I'll add such a note in the docs for stop.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

until someone wants to pause the thread for some reason and decides calling "stop" is the right way to do that :)

If the intended workflow of the object is to be started and stopped only once, it shoul dbe pretty easy to enforce via a couple well placed asserts

template<typename key_type, typename metadata_type>
async_metadata_source<key_type, metadata_type>::~async_metadata_source()
{
try
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stop() isn't throwing any exceptions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't throwing any user-specified exceptions, that's true, but it could be calling something that transitively throws an exception. If an exception is thrown from the destructor, it'll crash the program (in C++11), so I'm being defensive against that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in c++ do you not need to rethrow the exception at each layer of the stacK?

in either case, AFAIK we don't catch exceptions in most places, so I'm not sure without specific reasons we should be catching exceptions without a good understanding of what they are and why they might occur.....pretty much why do we have a requirement that THIS particular part of code need to be safer than the entire rest of the codebas?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Destructors must not throw exceptions.

Far from authoritative, but:
https://akrzemi1.wordpress.com/2013/08/20/noexcept-destructors/

I'm not in favor of using the existing code base as the high-bar for new code.

If we want the agent to crash if an exception is thrown here, I can remove the try/catch. If we do not, then I'm in favor of leaving it.

userspace/async/async_metadata_source.tpp Outdated Show resolved Hide resolved
userspace/async/async_metadata_source.h Outdated Show resolved Hide resolved
userspace/async/async_metadata_source.tpp Outdated Show resolved Hide resolved

if(!m_terminate)
{
run_impl();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you've released the lock before invoking this function. i see that you reacquire when doing queue operations, but it makes this particular function appear suspicious. We do all the checks against the CV in a way that makes it seem that tight synchronization is essential, then we release the lock before making this call.

If we don't actually need it, I think we need to simplify the above half of the function

  • don't bother with the double check locking. termination is something we expect to happen once per ever
  • don't bother checking for the empty queue around the CV wait

So pretty much lock->check for terminate->CV wait is all we need if we're not holding the lock across the call to run_impl

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The run_impl() method is "outside" of this class -- it's in the subclass. The mutex protects the state of this class, not the state of the subclass, so the lock ought not be held when running the subclass' method. If/when the subclass needs data from this class, it call an API that acquires the lock.

The CV check is a common pattern for CVs. What benefit do you see from doing something different?

If m_terminate is set, then stop() has been called, likely from a destructor. I don't want to block destruction until a new event comes in to signal on the CV.

userspace/async/async_metadata_source.tpp Outdated Show resolved Hide resolved
userspace/async/async_metadata_source.tpp Outdated Show resolved Hide resolved
}

typename metadata_map::const_iterator itr = m_metadata_map.find(key);
bool request_complete = (itr != m_metadata_map.end()) && itr->second.m_available;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some questions on the semantics:

  1. if we have atimeout set, should we not be trying to fetch the new data for at least timeout s before falling back to the cached data?
  2. if we have the data, should we not fetch again? Since that's going to cause us to oscillate between having data and not having data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantics are you call lookup(), then we wait as long as you're willing. If we don't get an answer within that time, we let you know and we keep working on getting what you asked for. When we eventually get it, we save it for you and on the next call to lookup() we give you that value. We explicitly do not want to re-query in this case.

In the metadata use case --- the case we're focused on here --- we don't expect that the data changes, so "new" data is as good as cached data.

userspace/async/async_metadata_source.tpp Outdated Show resolved Hide resolved
userspace/async/async_metadata_source.tpp Outdated Show resolved Hide resolved
userspace/async/async_metadata_source.tpp Outdated Show resolved Hide resolved
key_type async_metadata_source<key_type, metadata_type>::dequeue_next_key()
{
std::lock_guard<std::mutex> guard(m_mutex);
key_type key = m_request_queue.front();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this whitespace is kinda wonky

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonky in what way? I've verified that the indentation is all tabs.

userspace/libsinsp/async_docker_metadata_source.h Outdated Show resolved Hide resolved
* remain cached waiting for a subsequent call to lookup() before
* being discarded.
*/
const uint64_t MAX_TTL_MS = (30 * 1000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meh, I'm not a huge fan of adding a bunch of knobs -- it adds extra complexity for not a lot of gain. I'd rather just specify a value that it "big enough".

If you think that we can't reasonably guess a value that's "big enough", then I can add a knob.

@@ -0,0 +1,26 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this file showing up in the diff?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's new in this change :)

Why? I can into non-self-sufficient headers when I was adding this code, and I did this to resolve the issue.

userspace/libsinsp/async_docker_metadata_source.cpp Outdated Show resolved Hide resolved
test/userspace/async/async_metadata_source.ut.cpp Outdated Show resolved Hide resolved
Andy Dalton added 5 commits January 14, 2019 14:02
Eliminated the queue_size().
I wanted to put a struct type that is used as a template parameter
inside the class in which it was used, but the compiler wasn't happy
when I tried that.
/**
* @file
*
* Fill in a short overview of the file's content
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to fill in the blank.

Copy link
Contributor

@mstemm mstemm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the change looks good and looks pretty complete. The only feedback is really about whether the implementation should perform both roles of being a key/value store and doing lookups. If it's managing the objects, you'll also need to think about things like garbage collection, aging, etc. That being said, you've done almost all of the work already so I won't hold anything back.

* @returns true if this method was able to lookup and return the
* value synchronously; false otherwise.
*/
bool lookup(const key_type& key,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you think about separating the acts of pushing the key into the object and getting the value back? That makes the client interface more of a simple queue. And if the queue read were non-blocking you wouldn't need the deadline/max_wait_ms stuff. The part where the callback is called in the context of the async spawned thread might limit its usefulness.

*
* @returns the value associated with the given key.
*/
value_type get_value(const key_type& key);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this class need to keep track of previously fetched values, or can it be more of a work item fetcher, where once it satisfies a request it can forget all about it? This affects whether you need get_value()/store_value().

@mstemm
Copy link
Contributor

mstemm commented Jan 31, 2019

Aside from the specific implementation, the fact that the container information will be available shortly after the exec of the first pid in the container instead of immediately will require some secure rules changes. Our plan was to make the (currently invisible) container event visible and use that as a notice that a container had started, and adapt falco rules to work both on the first process exec as well as the container event. We haven't fully developed that plan yet, but we will work on that in the next week.

@anoop-sysd anoop-sysd changed the title Add prototype for async_metric_source Add prototype for async_metric_source [SMAGENT-1324] Feb 13, 2019
@anoop-sysd
Copy link
Contributor

@adalton , shall we close this PR in preference to #1326?

@adalton
Copy link
Contributor Author

adalton commented Mar 15, 2019

@adalton , shall we close this PR in preference to #1326?

@anoop-sysd Yes. let's close it, but let's not delete it. There are a number of enhancements here that didn't make it in that would be nice to eventually have

  • Decoupling the URL fetcher from the metadata fetcher
  • Unit testing for the metadata fetcher
  • Unit testing helper class for the URL fetcher that will enable us to return pre-canned results

@adalton adalton closed this Mar 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants