Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli/debug: allow the user to refine which files get included #64094

Merged
merged 3 commits into from
Apr 26, 2021

Conversation

knz
Copy link
Contributor

@knz knz commented Apr 22, 2021

First commit from #64071.
Fixes #54654.

We have observed in practice that the majority of the size of
generated debug zip outputs is attributable to the discrete files
retrieved from the server:

  • log files
  • goroutine dumps
  • heap profiles

As clusters get larger and older, these files increase in number, and
the size of the debug zip output grows accordingly.

Meanwhile, in practice, there are many support cases where we know
ahead of running the debug zip command that only certain files are
going to be useful. Therefore, we wish to be able to restrict
which files get included / excluded from the generated archive.

Hence the functional change described below.

(Note: there is another expressed wish to limit the size of generated
archives based on a time or date range, not just file names /
types. This will be handled in a separate commit.)

Release note (cli change): It is now possible to fine tune which files
get retrieved from the server nodes by the debug zip command, using
the new flag --include-files and --exclude-files. These flags
take glob patterns that are applied to the file names server-side. For
example, to include only log files, use --include-files='*.log'.

The command cockroach debug list-files also accepts these flags and
can thus be used to experiment with the new flags before running the
debug zip command.

@knz knz requested a review from stevendanna April 22, 2021 18:59
@knz knz requested a review from a team as a code owner April 22, 2021 18:59
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@knz
Copy link
Contributor Author

knz commented Apr 22, 2021

cc @thtruo

Copy link
Collaborator

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r1, 2 of 2 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @stevendanna)

a discussion (no related file):
First commit reviewed over at #64071, apologies for not doing it all here.

Thanks for adding the missing tests!

This looks like a useful addition. I think the only blocker is the function name changes I mentioned in #64071 to get this building again.



pkg/cli/zip_per_node.go, line 388 at r3 (raw file):

				if err := zc.z.createRaw(nodePrinter.start("writing dump"), name, file.Contents); err != nil {
					return err
				}

It feels like there might be some room for a helper function or two in this file that would reduce some of the code duplication. But, that looks true of the existing code as well.

@knz knz force-pushed the 20210422-zip-filter-files branch from f1abc6c to 79c68c8 Compare April 23, 2021 09:59
Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @stevendanna)

a discussion (no related file):

Previously, stevendanna (Steven Danna) wrote…

First commit reviewed over at #64071, apologies for not doing it all here.

Thanks for adding the missing tests!

This looks like a useful addition. I think the only blocker is the function name changes I mentioned in #64071 to get this building again.

Thanks this seems to build now.



pkg/cli/zip_per_node.go, line 388 at r3 (raw file):

Previously, stevendanna (Steven Danna) wrote…

It feels like there might be some room for a helper function or two in this file that would reduce some of the code duplication. But, that looks true of the existing code as well.

This is going to be complicated until go provides us with generics. Every one of these blocks has a different mix of types and function signatures.

@knz knz force-pushed the 20210422-zip-filter-files branch 3 times, most recently from 348284f to 085bd52 Compare April 23, 2021 11:08
@knz knz force-pushed the 20210422-zip-filter-files branch from 085bd52 to 0f55cc6 Compare April 23, 2021 12:17
craig bot pushed a commit that referenced this pull request Apr 23, 2021
64071: cli/debug: new sub-command `list-files` r=stevendanna a=knz

Prerequisite to #64094.

This is meant to assist the human operator select appropriate
flags for the `debug zip` command.

Release note (cli change): The new command `debug list-files`
show the list of files that can be retrieved via the `debug zip`
command. It supports the `--nodes` and `--exclude-nodes` parameters
in the same way as `debug zip`.

Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
@knz knz force-pushed the 20210422-zip-filter-files branch 3 times, most recently from 8493677 to 3368cc2 Compare April 23, 2021 16:52
craig bot pushed a commit that referenced this pull request Apr 23, 2021
63767: bazel: tweak logic for staging test artifacts in bazci r=rail a=rickystewart

Because Bazel aggressively caches build/test artifacts, if we're not
careful, bazci can copy OLD artifacts from previous build/test runs into
the artifacts directory. This is particularly an issue because TeamCity
watches that directory for `test.xml` reports, and if an old `test.xml`
reports a test failure, TeamCity will notice that and report the failure
in the UI -- and importantly, even if we replace that `test.xml` with a
completely different one that reports that the test succeeed, TC will
not amend what's displayed in the UI accordingly. So this can manifest
as reported test failures from unrelated PR's showing up in TC in an
apparently unpredictable (though uncommmon) manner.

We fix this by making bazci a little smarter about when we choose to
stage artifacts:

1. The first time the watcher loops over all the test artifacts, never
   stage anything (the artifacts are probably cached -- not enough time
   has passed for any legitimate artifacts to appear, probably).
2. Only stage artifacts incrementally if their stats have changed since
   the initial round of caching.
3. During the final loop, stage ALL artifacts (if they haven't been
   staged yet), just to make sure we don't miss anything.

Also add lots of comments to make these design decisions and their
motivations a little clearer.

Resolves #63740.

Release note: None

64071: cli/debug: new sub-command `list-files` r=stevendanna a=knz

Prerequisite to #64094.

This is meant to assist the human operator select appropriate
flags for the `debug zip` command.

Release note (cli change): The new command `debug list-files`
show the list of files that can be retrieved via the `debug zip`
command. It supports the `--nodes` and `--exclude-nodes` parameters
in the same way as `debug zip`.

Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>
Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
knz added 3 commits April 26, 2021 12:56
The node selection logic was untested. This commit adds the missing
test.

Release note: None
We have observed in practice that the majority of the size of
generated `debug zip` outputs is attributable to the discrete files
retrieved from the server:

- log files
- goroutine dumps
- heap profiles

As clusters get larger and older, these files increase in number, and
the size of the `debug zip` output grows accordingly.

Meanwhile, in practice, there are many support cases where we know
ahead of running the `debug zip` command that only certain files are
going to be useful. Therefore, we wish to be able to restrict
which files get included / excluded from the generated archive.

Hence the functional change described below.

(Note: there is another expressed wish to limit the size of generated
archives based on a *time or date range*, not just file names /
types. This will be handled in a separate commit.)

Release note (cli change): It is now possible to fine tune which files
get retrieved from the server nodes by the `debug zip` command, using
the new flag `--include-files` and `--exclude-files`. These flags
take glob patterns that are applied to the file names server-side. For
example, to include only log files, use `--include-files='*.log'`.

The command `cockroach debug list-files` also accepts these flags and
can thus be used to experiment with the new flags before running the
`debug zip` command.
Release note: None
@knz knz force-pushed the 20210422-zip-filter-files branch from 3368cc2 to d13fda9 Compare April 26, 2021 11:12
@knz
Copy link
Contributor Author

knz commented Apr 26, 2021

@tbg reviewed this in #64128 (review)

Addressing these comments here: #64128 (review)

craig bot pushed a commit that referenced this pull request Apr 26, 2021
64128: cli/zip: clamp file retrieval by date, default to last 48 hours r=tbg,stevendanna a=knz

All commits but the last from #64094 and prior.
Fixes #57795

The size of generated `debug zip` outputs is mostly attributable to
the discrete files retrieved from the server:

- log files
- goroutine dumps
- heap profiles

Even though we have various garbage collection techniques applied to
these files, they still yield very large `debug zip` outputs:

- the default retained cap for log files is 100MB per group,
  so with the default of up to 5 log groups per node, we are retaining
  up to 500MB of log files per node.
- the default retained cap for goroutine dumps, heap profiles and
  memory stats is 128MB for each, so we are retaining up to 384MB of
  these files per node.

For example, for a 10-node cluster we are looking at up to 7GB of data
retained in these files in the default configuration.

While this amount of data is relatively innocuous while it remains
in the data directories server-side, the `debug zip` behavior
which was previously to retrieve *all* of it was detrimental
to the process of troubleshooting anomalies: this is a lot
of data to move around, it cannot be e-mailed effectively, etc.

In comparison, the other items retrieved by the `debug zip`
command (range descriptors, SQL system tables) are typically just
kilobytes large. Even a large cluster with tens of thousands of ranges
and system tables typically incurs `zip` payloads no greater than a dozen
megabytes.

Hence the change described in the release note below.

Release note (cli change): The `cockroach debug zip` command now
retrieves only the log files, goroutine dumps and heap profiles
pertaining to the last 48 hours prior to the command invocation.

This behavior is supported entirely client-side, i.e. it is not
necessary to upgrade the server nodes to effect these newly
configurable limits.

This behavior can be customized by the two new flags `--files-from`
and `--files-until`. Both are optional. See the command-line
help text for details.

The other data items retrieved by `debug zip` are not affected
by this time limit.

The two new flags are also supported by `cockroach debug list-files`.
It is advised to experiment with `list-files` prior to issuing
a `debug zip` command that may retrieve a large amount of data.

64211: lint: check `make bazel-generate` earlier in the `teamcity-check` script r=rail a=rickystewart

We used to do this after `make buildshort` and `make generate`, but
those can take a while. Do it before so obvious Bazel linter errors
break the build sooner.

Release note: None

Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>
@craig craig bot merged commit d13fda9 into cockroachdb:master Apr 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

debug.zip an option to limit what logs files are included
3 participants