Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[System] Make network_summary namespace/container aware #605

Closed
sorantis opened this issue Feb 2, 2021 · 20 comments
Closed

[System] Make network_summary namespace/container aware #605

sorantis opened this issue Feb 2, 2021 · 20 comments
Labels
enhancement New feature or request Integration:system System Stalled Team:Integrations Label for the Integrations team

Comments

@sorantis
Copy link

sorantis commented Feb 2, 2021

Elastic Cloud SREs want to be able to export network summary metrics for all namespaces/docker containers.

Right now network_summary works for the host(/namespace) metricbeat is running on, but it would be really useful to debug networking issues if they were able to have the same kind of network statistics from all namespaces/containers running on a host without the need to run a separate metricbeat inside each container.

These network metrics are accessible from a pid perspective, e.g. in /proc/XXXX/net/netstat or /proc/XXXX/snmp
Since we know in which cgroup each pid belongs from /proc/XXXX/cgroup so it should be feasible to only grab these metrics once per cgroup and not for every pid per metricbeat poll.

These metrics should then have an additional field with the container/namespace id.

@sorantis sorantis added enhancement New feature or request Team:Integrations Label for the Integrations team Integration:system System labels Feb 2, 2021
@elasticmachine
Copy link

Pinging @elastic/integrations (Team:Integrations)

@fearful-symmetry
Copy link
Contributor

fearful-symmetry commented Feb 2, 2021

This is definitely a cool idea, and it won't be difficult. I assume we'll want to keep this as an addition to the network_summary metricset, since it's basically the same data. Does Cloud have any ideas as far as config? Since we probably don't want to turn this into a firehose, I wonder if we can have it match PIDs from some kind of glob, or map container activity to a given PID. If the latter is possible, I wonder if this would work better as some kind of processor somewhere else?

@sorantis
Copy link
Author

sorantis commented Feb 2, 2021

@elkargig, @matschaffer would love your input here as it will greatly help us in solving the use case.

@cargious
Copy link

cargious commented Feb 2, 2021

@ydubreuil may also be interested to this conversation around PID/mapping.

@elkargig
Copy link

elkargig commented Feb 2, 2021

Is it possible to configure a list of docker container names (which map to docker container ids -> cgroup IDs) instead of process names ? The docker container name -> docker container id map doesn't sound too expensive to be happening on every metricbeat run. If it is, could we set a caching TTL (and suffer all the consequences of cache invalidation errors in the future) ?

At least for Elastic Cloud that would be preferred I believe, as docker container names are static, but I fully understand why someone would prefer to run this for all processes matching a name, eg 'nginx', 'haproxy', etc.
A process name of java or python though could be too generic for our needs.

@fearful-symmetry
Copy link
Contributor

Yah, I assume that mapping container names to cgroup and process info is something we're already doing somewhere else. Again, the issue then becomes if this kind of data is better suited to a different metricset, or perhaps a processor of some kind.

@unixsurfer
Copy link

I think we could potentially follow the same logic as https://github.com/google/cadvisor.
I couldn't find in the code how they build the map for container names and PIDs, but I do remember that it was fetching/inspecting container names from docker to build a map and then traverse the appropriate tree under /proc. The used to have a cache layer, but I worked with it +4 years ago, so many things may have changed since then.

@ydubreuil
Copy link
Member

At the very least, we would need the cgroup id as it is reported in system.process.cgroup.id to identify the docker container id. But as @elkargig mentioned, having the docker container name would be ideal.

@fearful-symmetry
Copy link
Contributor

So, I'm leaning towards adding this to docker/network, since it should in theory we should be able to map everything to the docker container by name.

@exekias
Copy link

exekias commented Apr 6, 2021

add_docker_metadata is able to match and enrich documents based on their cgroup id. See the example in from the docs: #match_fields: ["system.process.cgroup.id"].

So we could just report cgroups network_summary, making sure the id is included. Then enable the add_docker_metadata processor to do the rest 🙂

I found this issue from @andrewkroh that may help here: elastic/beats#2483

@fearful-symmetry
Copy link
Contributor

So we could just report cgroups network_summary, making sure the id is included.

@exekias what exactly are you imagining here? network_summary reports a separate event for each cgroup ID?

@exekias
Copy link

exekias commented Apr 8, 2021

tbh I don't have a clear picture of how this could look 😇.

Right now network_summary reports totals for the host, and we probably want to keep things backwards compatible. That means that we cannot really report cgroup specific summaries under the same fields, right?

If we follow the pattern that we have for other cgroup metrics this would go under the process metricset, adding a system.process.cgroup.network field.

My biggest concern is that reporting processes with cgroup data is becoming expensive in terms of storage. I also see how @elkargig is more interested on having these metrics per cgroup (/container) as compared to process. I wonder if you folks would prefer a different organization, something like having a cgroups metricset that reports summaries per cgroup, instead of having cgroup data reported per process. Thoughts?

We can definitely enrich both cases with docker metadata (including container name).

@fearful-symmetry
Copy link
Contributor

My biggest concern is that reporting processes with cgroup data is becoming expensive in terms of storage. I also see how @elkargig is more interested on having these metrics per cgroup (/container) as compared to process.

Yah, that's why I was thinking of adding it to docker/network, since system.process seems to be ballooning in complexity, and it already hogs more space than any other metricset. Cgroup info usually most useful when it's attached to something else, so I'm not sure it makes sense to have a separate cgroup metricset, but I could be wrong.

@elkargig
Copy link

The end goal for us it to be able to graph "various network metrics per (docker) container name".

I'm not sure I fully understand the proposal here, but having the same network metrics captured per different process pid seems a little wasteful regarding stored resources, as multiple processes will be reporting the same network metrics as they belong in the same cgroup.

hope it helps

@exekias
Copy link

exekias commented Apr 15, 2021

Thanks, I think that helps answering the question! you are more interested in metrics at the container level over fine grained process details.

@fearful-symmetry I'm actually wondering if docker offers these metrics already and we are just not retrieving them?

@fearful-symmetry
Copy link
Contributor

I'm actually wondering if docker offers these metrics already and we are just not retrieving them?

@exekias if they're somewhere in the stats APIs, I don't see them. I started experimenting with this last week, I think the best way to do this is to grab the PIDs associated with the container via the docker API, and then query the network stats for each of those PIDs.

@fearful-symmetry
Copy link
Contributor

PR Has been merged here: elastic/beats#25354

Unfortunately this probably won't make it into the next release unless we want to backport this late, as I had a perfect storm complexity in the PR itself and of CI issues that resulted in it just sitting in review for a week.

@elkargig
Copy link

thanks a lot @fearful-symmetry ! is there a beats build we can use to test this before it gets released ?

@fearful-symmetry
Copy link
Contributor

@elkargig It'll be in the 7.14 snapshot builds, which should be available by Jun 29th-ish.

@botelastic
Copy link

botelastic bot commented Sep 19, 2022

Hi! We just realized that we haven't looked into this issue in a while. We're sorry! We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1. Thank you for your contribution!

@botelastic botelastic bot added the Stalled label Sep 19, 2022
@botelastic botelastic bot closed this as completed Mar 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Integration:system System Stalled Team:Integrations Label for the Integrations team
Projects
None yet
Development

No branches or pull requests

8 participants