Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Cache metadata to local disk/db #19511

Closed
jsoriano opened this issue Jun 30, 2020 · 15 comments · Fixed by #20775
Closed

Proposal: Cache metadata to local disk/db #19511

jsoriano opened this issue Jun 30, 2020 · 15 comments · Fixed by #20775
Assignees
Labels
discuss Issue needs further discussion. int-goal Internal goal of an iteration Team:Platforms Label for the Integrations - Platforms team

Comments

@jsoriano
Copy link
Member

To enrich events collected from Cloud Foundry, Beats need to query the apps API, this is done now by add_cloudfoundry_metadata, and it uses an in-memory cache to avoid querying again for the metadata of apps already seen. Metadata is cached by default during 2 minutes, but this time can be increased with the cache_duration setting.

Big Cloud Foundry deployments can have several thousands of applications running at the same time, in these cases add_cloudfoundry_metadata needs to cache a lot of data. Increasing cache_duration can help on these cases to reduce the number of requests done. But on restarts this data is lost and needs to be requested again on startup, provoking thousands of requests that could be avoided if this data were persisted somewhere else. Other implementations of Cloud Foundry events consumers (nozzles) persist the app metadata on disk to prevent problems with this.

Beats have several features that use internal in-memory caches. In general they don't contain so much data, but it could also happen for example with kubernetes since we support monitoring at the cluster scope (see #14738). So maybe other features could also benefit of having some kind of persistence between restarts.

If persisted to disk, we would need to make sure that the data directory also persists between restarts. This is relevant when running Beats on containers/Kubernetes.

Depending on how this is implemented, persistent cache could be shared between beats, so for example Filebeat and Metricbeat monitoring the same Cloud Foundry deployment from the same host could share their local caches if they are in some local database, or if they have access to some common data directory.

The enrich processor of Elasticsearch could help here, but at the moment only Filebeat supports pipelines, and even with support for pipelines it wouldn't be so clear how to add this step to existing pipelines.

@exekias @urso I would like to have your thoughts on this.

@jsoriano jsoriano added discuss Issue needs further discussion. Team:Integrations Label for the Integrations team [zube]: Investigate labels Jun 30, 2020
@jsoriano jsoriano self-assigned this Jun 30, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations (Team:Integrations)

@urso
Copy link

urso commented Jun 30, 2020

Persistent cache with entry TTL might be interesting for some use cases, indeed.

Although not optimal, I would remove the need to share data between 2 active beats instances for now. Sharing might complicate the implementation, plus this might be a feature that we don't need anymore in the future (e.g. if we merge beats into one).

In Beats we just introduced interfaces for a persistent key/value store (implementation is still under review). See package libbeat/statestore. This might be a good base for the persistence layer. Possible implementations I've tried are our memlog (mostly in memory k/v store writing updates to an append only log), badger, and sqlite. Badger seems to scale worst with increased concurrency. sqlite actually supports multiple concurrent access from multiple processes and scales quite well (although the CGO layer and full transactions for every operation slow it down a little).

@exekias
Copy link
Contributor

exekias commented Jul 1, 2020

This sounds fair

@jsoriano do you have any idea of the cost of retrieving this meta, did we see it causing any issue? From what I understand the client will request Apps metadata as they appear in the stream. For huge deployments this will be done by several beats (sharding) which distributes the load.

@jsoriano
Copy link
Member Author

jsoriano commented Jul 1, 2020

Thanks for your comments. It looks like we agree that it may be interesting to implement something like this. We could do it as a new API-compatible alternative to common.Cache, and start using it in cloudfoundry behind some feature flag. Then we can see how it goes and if there are other places where we can benefit of this implementation.

Although not optimal, I would remove the need to share data between 2 active beats instances for now. Sharing might complicate the implementation, plus this might be a feature that we don't need anymore in the future (e.g. if we merge beats into one).

Agree, I see this as a nice to have if the implementation can easily offer concurrent reads and writes from different processes, but not as a requirement.

In Beats we just introduced interfaces for a persistent key/value store (implementation is still under review). See package libbeat/statestore. This might be a good base for the persistence layer. Possible implementations I've tried are our memlog (mostly in memory k/v store writing updates to an append only log), badger, and sqlite. Badger seems to scale worst with increased concurrency. sqlite actually supports multiple concurrent access from multiple processes and scales quite well (although the CGO layer and full transactions for every operation slow it down a little).

Yeah, statestore looks good, I wonder if we could also add an interface for backends that natively support entries with TTLs, as Badger. Though maybe it is not needed if we keep the cache on memory and only dump to disk from time to time, the strategy to follow will be something to think about.

@jsoriano do you have any idea of the cost of retrieving this meta, did we see it causing any issue?

It seems to be a potential issue on very big deployments, specially on startup when collecting metrics. With logs maybe this is not such an issue because not everything is going to be continuously logging, but metrics arrive as a continuous stream, so it is quite likely that already during the first seconds many apps metadata is going to need to be requested.
Other nozzles implement caches on disk, so it seems to be a common thing on these deployments.

From what I understand the client will request Apps metadata as they appear in the stream. For huge deployments this will be done by several beats (sharding) which distributes the load.

Yes, but there is actually little control on how this sharding is done, it is possible to receive events from the same App on several beats, so at the end every beat still needs to know about lots of Apps. On that sense it is different to our usual deploy model in Kubernetes, where each beat is only going to be interested on pods running on one node.

@jsoriano
Copy link
Member Author

jsoriano commented Jul 1, 2020

By the way, auditbeat also uses a datastore on disk, based on bbolt: https://github.com/elastic/beats/blob/v7.8.0/auditbeat/datastore/

@urso
Copy link

urso commented Jul 2, 2020

It seems to be a potential issue on very big deployments, specially on startup when collecting metrics. With logs maybe this is not such an issue because not everything is going to be continuously logging, but metrics arrive as a continuous stream, so it is quite likely that already during the first seconds many apps metadata is going to need to be requested.

This is a more general problem, not only with meta-data. My proposal would be to introduce support for randomizing the initial run of a module. Having a time-period T and uniformly randomizing the first collection of a module within the range of [0, T(, then we will end up with a more uniform and constant load in metricbeat (and network) instead of collecting data in bursts if we have many endpoints or modules with the same (or similar) period configured.

Yeah, statestore looks good, I wonder if we could also add an interface for backends that natively support entries with TTLs, as Badger. Though maybe it is not needed if we keep the cache on memory and only dump to disk from time to time, the strategy to follow will be something to think about.

I originally have TTL in the interface, but decided for a more minimal interface at some point. The reason for TTL removal was (besides simplifying the interface) also that it was not what I wanted. We have a TTL implemented on top, but for removal I figured I need a more sophisticated predicate like (TTL + resource actually not in use).

Same for cache, I would consider to not have the TTL provided by the store, but allow for a custom cache that uses a key-value store as backend only. E.g. do you want to be able to 'pin' some state in the cache, when to renew?. Do we need a hard TTL only or or an LRU cache? You know, naming and caching are hard ... and time... OMG is time hard :P

If we could spread the load better, would we even need a cache? The problem seems to be resource usage/contention. Randomization is one of the simplest mitigation strategies.

@jsoriano
Copy link
Member Author

jsoriano commented Jul 2, 2020

It seems to be a potential issue on very big deployments, specially on startup when collecting metrics. With logs maybe this is not such an issue because not everything is going to be continuously logging, but metrics arrive as a continuous stream, so it is quite likely that already during the first seconds many apps metadata is going to need to be requested.

This is a more general problem, not only with meta-data. My proposal would be to introduce support for randomizing the initial run of a module. Having a time-period T and uniformly randomizing the first collection of a module within the range of [0, T(, then we will end up with a more uniform and constant load in metricbeat (and network) instead of collecting data in bursts if we have many endpoints or modules with the same (or similar) period configured.

Metricbeat modules already have a random delay for initialization. The cloudfoundry module is a push reporter, it doesn't have control on when metrics are received, it keeps a connection to the cloudfoundry endpoint, and cloudfoundry pushes the metrics in a continuous stream. We could do something like dropping metrics that are received more frequently than a period, but this would force us to keep track of all the metrics we are receiving (what could require a cache on its own 🙂 ).

We could also introduce some throttling to avoid hitting the metadata APIs too much. Then initial events wouldn't be enriched (or they would be delayed or dropped) till the cache warms-up.

I would consider to not have the TTL provided by the store, but allow for a custom cache that uses a key-value store as backend only. E.g. do you want to be able to 'pin' some state in the cache, when to renew? Do we need a hard TTL only or or an LRU cache?

An LRU cache with enough capacity could cover many cases where we are only using metadata that doesn't change over time, but we need to renew at some moment for other cases, and in general we are using the TTL for that. We also rely on the TTL to clean up entries that are not needed anymore (but there could be other strategies for GC).
Being more specific, for the current cloudfoundry implementation an LRU cache with enough capacity would be ok, because the values we use don't change. But there are some other state-related values that may change and we may want to use them in the future.

If we continue needing caches with expiration, and the backend we use provides it, it would be nice to use it, otherwise we need to maintain our implementation on top.

If we could spread the load better, would we even need a cache? The problem seems to be resource usage/contention. Randomization is one of the simplest mitigation strategies.

At least with cloudfoundry metrics (and can be with the collection of any other event, or logs in general too) the fact is that we don't have control on when messages are received, we could add some additional logic to retain events so we spread better the load caused by collection of metadata, but I am not sure if the potential complexity of this logic would worth it.

@andresrc
Copy link
Contributor

For the case of PCF in particular, it seems that most deployments also contain redis, operators are familiar with it and we already have the client library in. Could this be an option instead of taking care of the cache ourselves?

@jsoriano
Copy link
Member Author

For the case of PCF in particular, it seems that most deployments also contain redis, operators are familiar with it and we already have the client library in. Could this be an option instead of taking care of the cache ourselves?

I think so, this is the kind of implementation I was thinking of when I mentioned that cache could be shared between beats, but I didn't want to enter on the specific implementation to use at this point. I would still think on having caches that don't require an external service, for scenarios where there is no one available for this.

Take also into account that if we use something like bbolt we don't have to take care of the cache implementation, and we are already using it in Auditbeat. The main advantage I see on using redis for an scenario like big PCF deployments is that cache could be more easily shared between multiple beats instances monitoring the same deployment.

@andresrc andresrc added Team:Platforms Label for the Integrations - Platforms team and removed Team:Integrations Label for the Integrations team labels Jul 31, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations-platforms (Team:Platforms)

@jsoriano
Copy link
Member Author

@urso @exekias an update on this. I tried some options on #20707, #20739 and #20775, that more or less covered these approaches:

  • Using auditbeat's datastore based on bbolt
  • Using statestore with memlog backend
  • Using statestore with a new naive bbolt backend
  • Adding a new cache-specific implementation based on badger

Summarizing, the option that I found better is the last one, a new cache implementation based on badger, I will clean-up the PR for review (#20775). Let me know if you think on some blocker for this approach.

Some observations during my tests:

  • bbolt implementation in auditbeat would require some fixes before being used in more general terms, also it lacks of TTL and bbolt is actually more thought for sequential access, it performs worse than the other options in random read/writes, specially with a lot of data, that is actually the use case we want to solve with this cache. So it would require some effort and it wouldn't really fit so well for our use case.
  • statestore with memlog backend actually performs really well in general, but we need to implement our own TTL over the existing implementation. I also found some performance issues when loading a lot of data, that is the main case we want to cover, I think that these issues are related to checkpoint writing, that seems to lock the whole store. With a lot of data it also seems to use more memory than other options, I think it stores all the data in memory, and is slower to start, probably for the same reason.
  • statestore with bbolt backend has similar issues to reusing the bbolt implementation in auditbeat. It would require some effort, and it performs worse than other options. I think we would find similar problems with any implementation based on bbolt.
  • badger is in general heavier than other options, requiring more resources for few data, but its performance is always good enough, and doesn't degrade with a lot of data, having a fast start with any quantity of data. Compared to memlog-based statestore, performs slightly worse, but loads faster, and doesn't have problems with lots of data. It is easy to tune, and after tuning some simple options the performance improves a lot for our use case. This may be also true with other stores, but defaults in badger work better, and the many options it has are more straightforward to configure. It also supports TTL natively, what simplifies implementation for our use case. A point against badger is that is the only considered option that introduces a new dependency, but I don't see it so problematic.

More general thoughts affecting all options:

  • In all the implementations, JSON serialization and deserialization increases significatively the use of CPU. To mitigate that, the best option I have found is to reduce the size of the objects stored, so they contain only the info we need. This is something I think that would be good to do in any case, and I am doing that for cloudfoundry as part of Add a persistent cache for cloudfoundry metadata based on badger #20775. We could also add a lru-cache layer of decoded objects, but this would add complexity, and we can add it later, also reducing the size of the objects would also help on this option.
  • We could make some optimizations in the memlog backend, to mitigate the problems with checkpoints (if they are the problem with lot of data), but I find better to use an existing implementation that has this problem already solved.
  • All these caches need to be properly closed, and we are not properly closing processors at the moment (see Fix leaks with metadata processors #16349). This can cause some problems in this case: memory leaks, data loss if not written to disk on shutdown, data corruption on shutdown. Memory leaks with these processors is not something new, we also suffer them with add_kubernetes_metadata, and because of that we recommend to use these processors only in global configuration. The problems with data loss is not so important on this use case, as the metadata can be always recovered from the original APIs. I have seen data corruption happening with the badger implementation, but it recovers nicely on startup.
    In any case, we should address this issue in general (continuing with Fix leaks with metadata processors #16349 or similar), and then these problems would disappear.
  • Redis implementation is pending, but I would like to implement it after having the cache on disk. Regarding this, one advantage of using badger is that it has a semantic more similar to something like redis, including TTLs. One reason to delay the implementation is that we need to discuss where to place the connection settings for redis, we would need to add some additional global configuration for these caches.

// cc @masci

@exekias
Copy link
Contributor

exekias commented Sep 16, 2020

Thank you for such a detailed research! Going for badger sounds good to me.

Being able to close processors sounds like a requirement (?), let's keep discussing in #16349

@urso
Copy link

urso commented Sep 16, 2020

`statestore with memlog backend actually performs really well in general, but we need to implement our own TTL over the existing implementation. I also found some performance issues when loading a lot of data, that is the main case we want to cover, I think that these issues are related to checkpoint writing, that seems to lock the whole store. With a lot of data it also seems to use more memory than other options, I think it stores all the data in memory, and is slower to start, probably for the same reason.

Correct, statestore is an in memory kv-store. The disk is only used to guarantee we can start from the old state.

The cursor input manager in filebeat has a collector with TTL build in. We could extract the GC implementation. Just mentioning :) , badger or sqlite would be fine with me.

badger is in general heavier than other options, requiring more resources for few data, but its performance is always good enough,

I did test filebeat registry with statestore, badger, and sqlite as backends. Single writer throughput was pretty good for all of them (sqlite has had slight overhead). Unfortunately badger seems to scale pretty badly. With 100 concurrent files being collected badger did have a clear contention problem. If you keep the number of routines and updates small, all are fine. But sqlite seems to scale much better on concurrent load.

In all the implementations, JSON serialization and deserialization increases significatively the use of CPU. To mitigate that, the best option I have found is to reduce the size of the objects stored, so they contain only the info we need. This is something I think that would be good to do in any case, and I am doing that for cloudfoundry as part of #20775. We could also add a lru-cache layer of decoded objects, but this would add complexity, and we can add it later, also reducing the size of the objects would also help on this option.

We've got CBOR support in Beats. CBOR encoding/decoding is much faster and requires less CPU. Especially if many strings are involved (JSON requires strings to be scanned for UTF-8 compliance). Memlog unfortunately uses JSON to be somewhat backwards compatible with the old filebeat registry format.

@jsoriano
Copy link
Member Author

I did test filebeat registry with statestore, badger, and sqlite as backends. Single writer throughput was pretty good for all of them (sqlite has had slight overhead). Unfortunately badger seems to scale pretty badly. With 100 concurrent files being collected badger did have a clear contention problem. If you keep the number of routines and updates small, all are fine. But sqlite seems to scale much better on concurrent load.

Good point, I didn't test with high concurrency. But also I wouldn't expect very high concurrency on metadata enrichment, at least for the cloudfoundry case where in general there is only going to be one events consumer.

We've got CBOR support in Beats. CBOR encoding/decoding is much faster and requires less CPU. Especially if many strings are involved (JSON requires strings to be scanned for UTF-8 compliance). Memlog unfortunately uses JSON to be somewhat backwards compatible with the old filebeat registry format.

I will give a try to CBOR 👍 thanks for the suggestion.

@urso
Copy link

urso commented Sep 18, 2020

Good point, I didn't test with high concurrency. But also I wouldn't expect very high concurrency on metadata enrichment, at least for the cloudfoundry case where in general there is only going to be one events consumer.

Maybe. But keep in mind that processors are normally run in the go-routine of the input. On the other hand, it is just a cache. If we have to change the backend in the future because requirements changed it is a no brainer.

I will give a try to CBOR 👍 thanks for the suggestion.

Check out the codec.go from spool file support. It supports JSON, CBOR, and UBJSON. For write loads CBOR has had a small advantage over UBJSON. UBJSON a slightly faster for reads though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issue needs further discussion. int-goal Internal goal of an iteration Team:Platforms Label for the Integrations - Platforms team
Projects
None yet
5 participants