Report sizes seem to increase over days after a restart #3576

bboreham · 2019-02-28T10:35:30Z

Noticed in data from a lot of probes - this is the average report size over the last 30 days:

The two drops are at points where all probes were restarted.
Pattern seems to suggest the probes are accumulating stuff in data structures which does not come back on restart, and therefore probably unnecessary.

fbarl · 2019-03-14T16:42:16Z

Two things I realized by trying to download some older reports with the help of #3581 in Weave Cloud:

When downloading raw reports with timestamp e.g. 2019-02-12T17:52:00Z, I noticed that their JSON contained a lot of entries with the current timestamp i.e. 2019-03-14T15:03:55Z. That seems like a bug with the AWS reporter since the historic raw reports clearly differ from the ones that would have been downloaded at that time in the past.
Furthermore, downloading a few raw reports with a fixed timestamp 2019-02-12T17:52:00Z, I noticed that their size oscillates about 5% up and down within a few seconds, suggesting more than just timestamp entries are taken from the current state, rather than historic.

With these observations in mind, it seems quite hard and unreliable to analyze historic reports through Scope as we might a bug or a few with the historic lookup.

@bboreham Am I missing something? Are any of the above issues already known? If not, I'd just file new issues.

I imagine it would be hard to tackle this issue in the meantime as my understanding is that S3 only stores parts of reports and collecting and merging them manually at a given timestamp would be quite impossible.

bboreham · 2019-03-14T18:01:21Z

For the second point, are you sure you're looking at reports from the same node?

bboreham · 2019-03-14T18:05:17Z

Maybe you are looking at the merged report for the whole instance?

What we store in S3 is the exact data that came in from each probe; it's a part of the whole instance, but it's a better place to start because it separates any issues in the probe from issues in the merging process.

bboreham · 2019-04-02T11:34:12Z

Another idea would be to use -probe.publish.stdout so you get the whole report in the log, and just leave the probe running for a while then see if you can spot it growing.
Maybe modify the code of StdoutPublisher to only print one in 100 or one in 1000 reports to save on disk space.

bboreham · 2019-05-24T13:01:26Z

We got another drop in report sizes after the release of 1.11 3 days ago.
This time we also have per-user stats, and I was seeing a lot of dead containers in the large reports from one particular user, so that is probably the next place to look.

bboreham · 2019-06-04T19:20:12Z

So that didn't fix it.

I added a log line to probe/docker/registry.go to print out events as they come in, then ran a command to create and destroy a container, several times:

for x in {1..100} ; do docker run --rm bash ; done

Here is the normal sequence of events:

<probe> INFO: 2019/06/04 18:04:34.004180 event create: 5050066d1a1417138e8367858abfa6c32b24ba9c26aba88c0130e0905e66fac0
<probe> INFO: 2019/06/04 18:04:34.010732 event attach: 5050066d1a1417138e8367858abfa6c32b24ba9c26aba88c0130e0905e66fac0
<probe> INFO: 2019/06/04 18:04:34.034367 event network:connect: 227fe2d486e77f56b7409f9ca15996aab446df3f687aabed902ced8768b44766
<probe> INFO: 2019/06/04 18:04:34.517102 event start: 5050066d1a1417138e8367858abfa6c32b24ba9c26aba88c0130e0905e66fac0
<probe> INFO: 2019/06/04 18:04:34.657756 event die: 5050066d1a1417138e8367858abfa6c32b24ba9c26aba88c0130e0905e66fac0
<probe> INFO: 2019/06/04 18:04:34.738521 event destroy: 5050066d1a1417138e8367858abfa6c32b24ba9c26aba88c0130e0905e66fac0

But sometimes we miss the destroy event:

<probe> INFO: 2019/06/04 18:04:34.907947 event create: 08b61f48790c7a6ae4ea1b084d112f1a1b224033bbb4960a7ae58cdf1fb581c5
<probe> INFO: 2019/06/04 18:04:34.926912 event attach: 08b61f48790c7a6ae4ea1b084d112f1a1b224033bbb4960a7ae58cdf1fb581c5
<probe> INFO: 2019/06/04 18:04:34.937667 event network:connect: 227fe2d486e77f56b7409f9ca15996aab446df3f687aabed902ced8768b44766
<probe> INFO: 2019/06/04 18:04:35.254530 event start: 08b61f48790c7a6ae4ea1b084d112f1a1b224033bbb4960a7ae58cdf1fb581c5
<probe> INFO: 2019/06/04 18:04:35.338868 event die: 08b61f48790c7a6ae4ea1b084d112f1a1b224033bbb4960a7ae58cdf1fb581c5

Sometimes we lose both die and destroy:

<probe> INFO: 2019/06/04 18:05:51.305000 event create: 5d8afb01b1345a193fcf944a7b4b9062849109ec4e3925f248195fe8b749930a
<probe> INFO: 2019/06/04 18:05:51.315773 event attach: 5d8afb01b1345a193fcf944a7b4b9062849109ec4e3925f248195fe8b749930a
<probe> INFO: 2019/06/04 18:05:51.332189 event network:connect: 227fe2d486e77f56b7409f9ca15996aab446df3f687aabed902ced8768b44766
<probe> INFO: 2019/06/04 18:05:51.944405 event start: 5d8afb01b1345a193fcf944a7b4b9062849109ec4e3925f248195fe8b749930a

and here's a normal one again:

<probe> INFO: 2019/06/04 18:06:22.699808 event create: acb3eccbf48fb486785229fed3ebe98f257392aa646c2e2e21430cefce72af12
<probe> INFO: 2019/06/04 18:06:22.716019 event attach: acb3eccbf48fb486785229fed3ebe98f257392aa646c2e2e21430cefce72af12
<probe> INFO: 2019/06/04 18:06:22.737031 event network:connect: 227fe2d486e77f56b7409f9ca15996aab446df3f687aabed902ced8768b44766
<probe> INFO: 2019/06/04 18:06:23.070987 event start: acb3eccbf48fb486785229fed3ebe98f257392aa646c2e2e21430cefce72af12
<probe> INFO: 2019/06/04 18:06:23.206659 event die: acb3eccbf48fb486785229fed3ebe98f257392aa646c2e2e21430cefce72af12
<probe> INFO: 2019/06/04 18:06:23.300760 event destroy: acb3eccbf48fb486785229fed3ebe98f257392aa646c2e2e21430cefce72af12

bboreham added the performance Excessive resource usage and latency; usually a bug or chore label Feb 28, 2019

dholbach assigned fbarl Mar 5, 2019

fbarl mentioned this issue Mar 14, 2019

Use Time Travel context when downloading raw reports #3581

Closed

fbarl removed their assignment Apr 4, 2019

dholbach added the help-wanted An issue that would be good for non-maintainers to attempt label Apr 16, 2019

bboreham mentioned this issue May 24, 2019

Always delete container info when we get a 'destroy' event #3623

Merged

bboreham closed this as completed in #3623 May 27, 2019

bboreham reopened this May 27, 2019

This was referenced Jun 4, 2019

fix(probe): Use a buffered chan to reduce the chance of losing events #3627

Merged

Re-sync containers list periodically #3629

Open

bboreham closed this as completed in #3627 Jun 5, 2019

bboreham mentioned this issue Jul 1, 2019

List of endpoints inside probe can grow over time #3645

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report sizes seem to increase over days after a restart #3576

Report sizes seem to increase over days after a restart #3576

bboreham commented Feb 28, 2019

fbarl commented Mar 14, 2019

bboreham commented Mar 14, 2019

bboreham commented Mar 14, 2019

bboreham commented Apr 2, 2019

bboreham commented May 24, 2019

bboreham commented Jun 4, 2019

Report sizes seem to increase over days after a restart #3576

Report sizes seem to increase over days after a restart #3576

Comments

bboreham commented Feb 28, 2019

fbarl commented Mar 14, 2019

bboreham commented Mar 14, 2019

bboreham commented Mar 14, 2019

bboreham commented Apr 2, 2019

bboreham commented May 24, 2019

bboreham commented Jun 4, 2019