Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report sizes seem to increase over days after a restart #3576

Closed
bboreham opened this issue Feb 28, 2019 · 6 comments · Fixed by #3623 or #3627
Closed

Report sizes seem to increase over days after a restart #3576

bboreham opened this issue Feb 28, 2019 · 6 comments · Fixed by #3623 or #3627
Labels
help-wanted An issue that would be good for non-maintainers to attempt performance Excessive resource usage and latency; usually a bug or chore

Comments

@bboreham
Copy link
Collaborator

Noticed in data from a lot of probes - this is the average report size over the last 30 days:

image

The two drops are at points where all probes were restarted.
Pattern seems to suggest the probes are accumulating stuff in data structures which does not come back on restart, and therefore probably unnecessary.

@bboreham bboreham added the performance Excessive resource usage and latency; usually a bug or chore label Feb 28, 2019
@fbarl
Copy link
Contributor

fbarl commented Mar 14, 2019

Two things I realized by trying to download some older reports with the help of #3581 in Weave Cloud:

  • When downloading raw reports with timestamp e.g. 2019-02-12T17:52:00Z, I noticed that their JSON contained a lot of entries with the current timestamp i.e. 2019-03-14T15:03:55Z. That seems like a bug with the AWS reporter since the historic raw reports clearly differ from the ones that would have been downloaded at that time in the past.
  • Furthermore, downloading a few raw reports with a fixed timestamp 2019-02-12T17:52:00Z, I noticed that their size oscillates about 5% up and down within a few seconds, suggesting more than just timestamp entries are taken from the current state, rather than historic.

With these observations in mind, it seems quite hard and unreliable to analyze historic reports through Scope as we might a bug or a few with the historic lookup.

@bboreham Am I missing something? Are any of the above issues already known? If not, I'd just file new issues.

I imagine it would be hard to tackle this issue in the meantime as my understanding is that S3 only stores parts of reports and collecting and merging them manually at a given timestamp would be quite impossible.

@bboreham
Copy link
Collaborator Author

For the second point, are you sure you're looking at reports from the same node?

@bboreham
Copy link
Collaborator Author

Maybe you are looking at the merged report for the whole instance?

What we store in S3 is the exact data that came in from each probe; it's a part of the whole instance, but it's a better place to start because it separates any issues in the probe from issues in the merging process.

@bboreham
Copy link
Collaborator Author

bboreham commented Apr 2, 2019

Another idea would be to use -probe.publish.stdout so you get the whole report in the log, and just leave the probe running for a while then see if you can spot it growing.
Maybe modify the code of StdoutPublisher to only print one in 100 or one in 1000 reports to save on disk space.

@fbarl fbarl removed their assignment Apr 4, 2019
@dholbach dholbach added the help-wanted An issue that would be good for non-maintainers to attempt label Apr 16, 2019
@bboreham
Copy link
Collaborator Author

We got another drop in report sizes after the release of 1.11 3 days ago.
This time we also have per-user stats, and I was seeing a lot of dead containers in the large reports from one particular user, so that is probably the next place to look.

@bboreham
Copy link
Collaborator Author

bboreham commented Jun 4, 2019

So that didn't fix it.

I added a log line to probe/docker/registry.go to print out events as they come in, then ran a command to create and destroy a container, several times:

for x in {1..100} ; do docker run --rm bash ; done

Here is the normal sequence of events:

<probe> INFO: 2019/06/04 18:04:34.004180 event create: 5050066d1a1417138e8367858abfa6c32b24ba9c26aba88c0130e0905e66fac0
<probe> INFO: 2019/06/04 18:04:34.010732 event attach: 5050066d1a1417138e8367858abfa6c32b24ba9c26aba88c0130e0905e66fac0
<probe> INFO: 2019/06/04 18:04:34.034367 event network:connect: 227fe2d486e77f56b7409f9ca15996aab446df3f687aabed902ced8768b44766
<probe> INFO: 2019/06/04 18:04:34.517102 event start: 5050066d1a1417138e8367858abfa6c32b24ba9c26aba88c0130e0905e66fac0
<probe> INFO: 2019/06/04 18:04:34.657756 event die: 5050066d1a1417138e8367858abfa6c32b24ba9c26aba88c0130e0905e66fac0
<probe> INFO: 2019/06/04 18:04:34.738521 event destroy: 5050066d1a1417138e8367858abfa6c32b24ba9c26aba88c0130e0905e66fac0

But sometimes we miss the destroy event:

<probe> INFO: 2019/06/04 18:04:34.907947 event create: 08b61f48790c7a6ae4ea1b084d112f1a1b224033bbb4960a7ae58cdf1fb581c5
<probe> INFO: 2019/06/04 18:04:34.926912 event attach: 08b61f48790c7a6ae4ea1b084d112f1a1b224033bbb4960a7ae58cdf1fb581c5
<probe> INFO: 2019/06/04 18:04:34.937667 event network:connect: 227fe2d486e77f56b7409f9ca15996aab446df3f687aabed902ced8768b44766
<probe> INFO: 2019/06/04 18:04:35.254530 event start: 08b61f48790c7a6ae4ea1b084d112f1a1b224033bbb4960a7ae58cdf1fb581c5
<probe> INFO: 2019/06/04 18:04:35.338868 event die: 08b61f48790c7a6ae4ea1b084d112f1a1b224033bbb4960a7ae58cdf1fb581c5

Sometimes we lose both die and destroy:

<probe> INFO: 2019/06/04 18:05:51.305000 event create: 5d8afb01b1345a193fcf944a7b4b9062849109ec4e3925f248195fe8b749930a
<probe> INFO: 2019/06/04 18:05:51.315773 event attach: 5d8afb01b1345a193fcf944a7b4b9062849109ec4e3925f248195fe8b749930a
<probe> INFO: 2019/06/04 18:05:51.332189 event network:connect: 227fe2d486e77f56b7409f9ca15996aab446df3f687aabed902ced8768b44766
<probe> INFO: 2019/06/04 18:05:51.944405 event start: 5d8afb01b1345a193fcf944a7b4b9062849109ec4e3925f248195fe8b749930a

and here's a normal one again:

<probe> INFO: 2019/06/04 18:06:22.699808 event create: acb3eccbf48fb486785229fed3ebe98f257392aa646c2e2e21430cefce72af12
<probe> INFO: 2019/06/04 18:06:22.716019 event attach: acb3eccbf48fb486785229fed3ebe98f257392aa646c2e2e21430cefce72af12
<probe> INFO: 2019/06/04 18:06:22.737031 event network:connect: 227fe2d486e77f56b7409f9ca15996aab446df3f687aabed902ced8768b44766
<probe> INFO: 2019/06/04 18:06:23.070987 event start: acb3eccbf48fb486785229fed3ebe98f257392aa646c2e2e21430cefce72af12
<probe> INFO: 2019/06/04 18:06:23.206659 event die: acb3eccbf48fb486785229fed3ebe98f257392aa646c2e2e21430cefce72af12
<probe> INFO: 2019/06/04 18:06:23.300760 event destroy: acb3eccbf48fb486785229fed3ebe98f257392aa646c2e2e21430cefce72af12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help-wanted An issue that would be good for non-maintainers to attempt performance Excessive resource usage and latency; usually a bug or chore
Projects
None yet
3 participants