Skip to content

Commit

Permalink
Merge #64128 #64211
Browse files Browse the repository at this point in the history
64128: cli/zip: clamp file retrieval by date, default to last 48 hours r=tbg,stevendanna a=knz

All commits but the last from #64094 and prior.
Fixes #57795

The size of generated `debug zip` outputs is mostly attributable to
the discrete files retrieved from the server:

- log files
- goroutine dumps
- heap profiles

Even though we have various garbage collection techniques applied to
these files, they still yield very large `debug zip` outputs:

- the default retained cap for log files is 100MB per group,
  so with the default of up to 5 log groups per node, we are retaining
  up to 500MB of log files per node.
- the default retained cap for goroutine dumps, heap profiles and
  memory stats is 128MB for each, so we are retaining up to 384MB of
  these files per node.

For example, for a 10-node cluster we are looking at up to 7GB of data
retained in these files in the default configuration.

While this amount of data is relatively innocuous while it remains
in the data directories server-side, the `debug zip` behavior
which was previously to retrieve *all* of it was detrimental
to the process of troubleshooting anomalies: this is a lot
of data to move around, it cannot be e-mailed effectively, etc.

In comparison, the other items retrieved by the `debug zip`
command (range descriptors, SQL system tables) are typically just
kilobytes large. Even a large cluster with tens of thousands of ranges
and system tables typically incurs `zip` payloads no greater than a dozen
megabytes.

Hence the change described in the release note below.

Release note (cli change): The `cockroach debug zip` command now
retrieves only the log files, goroutine dumps and heap profiles
pertaining to the last 48 hours prior to the command invocation.

This behavior is supported entirely client-side, i.e. it is not
necessary to upgrade the server nodes to effect these newly
configurable limits.

This behavior can be customized by the two new flags `--files-from`
and `--files-until`. Both are optional. See the command-line
help text for details.

The other data items retrieved by `debug zip` are not affected
by this time limit.

The two new flags are also supported by `cockroach debug list-files`.
It is advised to experiment with `list-files` prior to issuing
a `debug zip` command that may retrieve a large amount of data.

64211: lint: check `make bazel-generate` earlier in the `teamcity-check` script r=rail a=rickystewart

We used to do this after `make buildshort` and `make generate`, but
those can take a while. Do it before so obvious Bazel linter errors
break the build sooner.

Release note: None

Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>
  • Loading branch information
3 people committed Apr 26, 2021
3 parents da685bb + 1e9840f + 6a2b422 commit 17e605b
Show file tree
Hide file tree
Showing 17 changed files with 643 additions and 98 deletions.
8 changes: 4 additions & 4 deletions build/teamcity-check.sh
Original file line number Diff line number Diff line change
Expand Up @@ -38,16 +38,16 @@ fi

tc_start_block "Ensure generated code is up-to-date"
# Buffer noisy output and only print it on failure.
TEAMCITY_BAZEL_SUPPORT_LINT=1 # See teamcity-bazel-support.sh.
run run_bazel build/bazelutil/bazel-generate.sh &> artifacts/buildshort.log || (cat artifacts/buildshort.log && false)
rm artifacts/buildshort.log
check_clean "Run \`make bazel-generate\` to automatically regenerate these."
run build/builder.sh make generate &> artifacts/generate.log || (cat artifacts/generate.log && false)
rm artifacts/generate.log
check_clean "Run \`make generate\` to automatically regenerate these."
run build/builder.sh make buildshort &> artifacts/buildshort.log || (cat artifacts/buildshort.log && false)
rm artifacts/buildshort.log
check_clean "Run \`make buildshort\` to automatically regenerate these."
TEAMCITY_BAZEL_SUPPORT_LINT=1 # See teamcity-bazel-support.sh.
run run_bazel build/bazelutil/bazel-generate.sh &> artifacts/buildshort.log || (cat artifacts/buildshort.log && false)
rm artifacts/buildshort.log
check_clean "Run \`make bazel-generate\` to automatically regenerate these."
tc_end_block "Ensure generated code is up-to-date"

# generated code can generate new dependencies; check dependencies after generated code.
Expand Down
1 change: 1 addition & 0 deletions pkg/cli/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -284,6 +284,7 @@ go_test(
"start_test.go",
"statement_diag_test.go",
"userfiletable_test.go",
"zip_helpers_test.go",
"zip_test.go",
],
data = glob(["testdata/**"]),
Expand Down
90 changes: 90 additions & 0 deletions pkg/cli/cliflags/flags.go
Original file line number Diff line number Diff line change
Expand Up @@ -1288,6 +1288,96 @@ list of node IDs or ranges of node IDs, for example: 5,10-20,23.
The default is to not exclude any node.`,
}

ZipIncludedFiles = FlagInfo{
Name: "include-files",
Description: `
List of glob patterns that determine files that can be included
in the output. The list can be specified as a comma-delimited
list of patterns, or by using the flag multiple times.
The patterns apply to the base name of the file, without
a path prefix.
The default is to include all files.
<PRE>
</PRE>
This flag is applied before --exclude-files; for example,
including '*.log' and then excluding '*foo*.log' will
exclude 'barfoos.log'.
<PRE>
</PRE>
You can use the 'debug list-files' command to explore how
this flag is applied.`,
}

ZipExcludedFiles = FlagInfo{
Name: "exclude-files",
Description: `
List of glob patterns that determine files that are to
be excluded from the output. The list can be specified
as a comma-delimited list of patterns, or by using the
flag multiple times.
The patterns apply to the base name of the file, without
a path prefix.
<PRE>
</PRE>
This flag is applied after --include-files; for example,
including '*.log' and then excluding '*foo*.log' will
exclude 'barfoos.log'.
<PRE>
</PRE>
You can use the 'debug list-files' command to explore how
this flag is applied.`,
}

ZipFilesFrom = FlagInfo{
Name: "files-from",
Description: `
Limit file collection to those files modified after the
specified timestamp, inclusive.
The timestamp can be expressed as YYYY-MM-DD,
YYYY-MM-DD HH:MM or YYYY-MM-DD HH:MM:SS and is interpreted
in the UTC time zone.
The default value for this flag is 48 hours before now.
<PRE>
</PRE>
When customizing this flag to capture a narrow range
of time, consider adding extra seconds/minutes
to the range to accommodate clock drift and uncertainties.
<PRE>
</PRE>
You can use the 'debug list-files' command to explore how
this flag is applied.`,
}

ZipFilesUntil = FlagInfo{
Name: "files-until",
Description: `
Limit file collection to those files created before the
specified timestamp, inclusive.
The timestamp can be expressed as YYYY-MM-DD,
YYYY-MM-DD HH:MM or YYYY-MM-DD HH:MM:SS and is interpreted
in the UTC time zone.
The default value for this flag is some time beyond
the current time, to ensure files created during
the collection are also included.
<PRE>
</PRE>
When customizing this flag to capture a narrow range
of time, consider adding extra seconds/minutes
to the range to accommodate clock drift and uncertainties.
<PRE>
</PRE>
You can use the 'debug list-files' command to explore how
this flag is applied.`,
}

ZipRedactLogs = FlagInfo{
Name: "redact-logs",
Description: `
Expand Down
13 changes: 13 additions & 0 deletions pkg/cli/context.go
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ import (
"github.com/cockroachdb/cockroach/pkg/sql/sem/tree"
"github.com/cockroachdb/cockroach/pkg/storage"
"github.com/cockroachdb/cockroach/pkg/util/log/logconfig"
"github.com/cockroachdb/cockroach/pkg/util/timeutil"
"github.com/mattn/go-isatty"
"github.com/spf13/cobra"
"github.com/spf13/pflag"
Expand Down Expand Up @@ -330,16 +331,28 @@ type zipContext struct {
// How much concurrency to use during the collection. The code
// attempts to access multiple nodes concurrently by default.
concurrency int

// The log/heap/etc files to include.
files fileSelection
}

// setZipContextDefaults set the default values in zipCtx. This
// function is called by initCLIDefaults() and thus re-called in every
// test that exercises command-line parsing.
func setZipContextDefaults() {
zipCtx.nodes = nodeSelection{}
zipCtx.files = fileSelection{}
zipCtx.redactLogs = false
zipCtx.cpuProfDuration = 5 * time.Second
zipCtx.concurrency = 15

// File selection covers the last 48 hours by default.
// We add 24 hours to now for the end timestamp to ensure
// that files created during the zip operation are
// also included.
now := timeutil.Now()
zipCtx.files.startTimestamp = timestampValue(now.Add(-48 * time.Hour))
zipCtx.files.endTimestamp = timestampValue(now.Add(24 * time.Hour))
}

// dumpCtx captures the command-line parameters of the `dump` command.
Expand Down
26 changes: 19 additions & 7 deletions pkg/cli/debug_list_files.go
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,10 @@ var debugListFilesCmd = &cobra.Command{
}

func runDebugListFiles(cmd *cobra.Command, _ []string) error {
if err := zipCtx.files.validate(); err != nil {
return err
}

ctx, cancel := context.WithCancel(context.Background())
defer cancel()

Expand Down Expand Up @@ -110,7 +114,7 @@ func runDebugListFiles(cmd *cobra.Command, _ []string) error {
NodeId: nodeIDs,
ListOnly: true,
Type: serverpb.FileType(fileType),
Patterns: []string{"*"},
Patterns: zipCtx.files.retrievalPatterns(),
})
if err != nil {
log.Warningf(ctx, "cannot retrieve %s file list from node %d: %v", serverpb.FileType_name[fileType], nodeID, err)
Expand All @@ -132,18 +136,26 @@ func runDebugListFiles(cmd *cobra.Command, _ []string) error {
for _, nodeID := range nodeList {
nodeIDs := fmt.Sprintf("%d", nodeID)
for _, logFile := range logFiles[nodeID] {
ctime := extractTimeFromFileName(logFile.Name)
mtime := timeutil.Unix(0, logFile.ModTimeNanos)
if !zipCtx.files.isIncluded(logFile.Name, ctime, mtime) {
continue
}
totalSize += logFile.SizeBytes
ctime := formatTimeSimple(extractTimeFromFileName(logFile.Name))
mtime := formatTimeSimple(timeutil.Unix(0, logFile.ModTimeNanos))
rows = append(rows, []string{nodeIDs, "log", logFile.Name, ctime, mtime, fmt.Sprintf("%d", logFile.SizeBytes)})
ctimes := formatTimeSimple(ctime)
mtimes := formatTimeSimple(mtime)
rows = append(rows, []string{nodeIDs, "log", logFile.Name, ctimes, mtimes, fmt.Sprintf("%d", logFile.SizeBytes)})
}
for _, ft := range fileTypes {
fileType := int32(ft)
for _, other := range otherFiles[nodeID][fileType] {
ctime := extractTimeFromFileName(other.Name)
if !zipCtx.files.isIncluded(other.Name, ctime, ctime) {
continue
}
totalSize += other.FileSize
ctime := formatTimeSimple(extractTimeFromFileName(other.Name))
mtime := ctime
rows = append(rows, []string{nodeIDs, fileTypeNames[fileType], other.Name, ctime, mtime, fmt.Sprintf("%d", other.FileSize)})
ctimes := formatTimeSimple(ctime)
rows = append(rows, []string{nodeIDs, fileTypeNames[fileType], other.Name, ctimes, ctimes, fmt.Sprintf("%d", other.FileSize)})
}
}
}
Expand Down
6 changes: 5 additions & 1 deletion pkg/cli/flags.go
Original file line number Diff line number Diff line change
Expand Up @@ -626,11 +626,15 @@ func init() {
durationFlag(f, &zipCtx.cpuProfDuration, cliflags.ZipCPUProfileDuration)
intFlag(f, &zipCtx.concurrency, cliflags.ZipConcurrency)
}
// List-nodes + Zip commands.
// List-files + Zip commands.
for _, cmd := range []*cobra.Command{debugZipCmd, debugListFilesCmd} {
f := cmd.Flags()
varFlag(f, &zipCtx.nodes.inclusive, cliflags.ZipNodes)
varFlag(f, &zipCtx.nodes.exclusive, cliflags.ZipExcludeNodes)
stringSliceFlag(f, &zipCtx.files.includePatterns, cliflags.ZipIncludedFiles)
stringSliceFlag(f, &zipCtx.files.excludePatterns, cliflags.ZipExcludedFiles)
varFlag(f, &zipCtx.files.startTimestamp, cliflags.ZipFilesFrom)
varFlag(f, &zipCtx.files.endTimestamp, cliflags.ZipFilesUntil)
}

// Decommission command.
Expand Down
35 changes: 35 additions & 0 deletions pkg/cli/interactive_tests/test_zip_filter.tcl
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
#! /usr/bin/env expect -f

source [file join [file dirname $argv0] common.tcl]

spawn /bin/bash
send "PS1=':''/# '\r"
eexpect ":/# "

start_server $argv

# A server start populates a log directory with multiple files. We expect
# at least cockroach-stderr.*, cockroach.* and cockroach-pebble.*.

start_test "Check that --include-files excludes files that do not match the pattern"
send "$argv debug zip --cpu-profile-duration=0 --include-files='cockroach.*' /dev/null\r"

eexpect "log files found"
eexpect "skipping excluded log file: cockroach-stderr."
eexpect "skipping excluded log file: cockroach-pebble."
eexpect "\[log file: cockroach."
eexpect ":/# "
end_test

start_test "Check that --excludes-files excludes files that match the pattern"
send "$argv debug zip --cpu-profile-duration=0 --exclude-files='cockroach.*' /dev/null\r"

eexpect "log files found"
eexpect "\[log file: cockroach-stderr."
eexpect "\[log file: cockroach-pebble."
eexpect "skipping excluded log file: cockroach."
eexpect ":/# "
end_test


stop_server $argv
36 changes: 18 additions & 18 deletions pkg/cli/testdata/zip/partial1
Original file line number Diff line number Diff line change
Expand Up @@ -54,12 +54,12 @@ debug zip --concurrency=1 --cpu-profile-duration=0s /dev/null
[node 1] requesting stacks... received response... writing binary output: debug/nodes/1/stacks.txt... done
[node 1] requesting threads... received response... writing binary output: debug/nodes/1/threads.txt... done
[node 1] requesting heap profile... received response... writing binary output: debug/nodes/1/heap.pprof... done
[node 1] requesting heap files... received response...
[node 1] requesting heap files: last request failed: rpc error: ...
[node 1] requesting heap files: creating error output: debug/nodes/1/heapprof.err.txt... done
[node 1] requesting goroutine dumps... received response...
[node 1] requesting goroutine dumps: last request failed: rpc error: ...
[node 1] requesting goroutine dumps: creating error output: debug/nodes/1/goroutines.err.txt... done
[node 1] requesting heap file list... received response...
[node 1] requesting heap file list: last request failed: rpc error: ...
[node 1] requesting heap file list: creating error output: debug/nodes/1/heapprof.err.txt... done
[node 1] requesting goroutine dump list... received response...
[node 1] requesting goroutine dump list: last request failed: rpc error: ...
[node 1] requesting goroutine dump list: creating error output: debug/nodes/1/goroutines.err.txt... done
[node 1] requesting log file ...
[node 1] 1 log file ...
[node 1] [log file ...
Expand Down Expand Up @@ -173,12 +173,12 @@ debug zip --concurrency=1 --cpu-profile-duration=0s /dev/null
[node 2] requesting heap profile... received response...
[node 2] requesting heap profile: last request failed: rpc error: ...
[node 2] requesting heap profile: creating error output: debug/nodes/2/heap.pprof.err.txt... done
[node 2] requesting heap files... received response...
[node 2] requesting heap files: last request failed: rpc error: ...
[node 2] requesting heap files: creating error output: debug/nodes/2/heapprof.err.txt... done
[node 2] requesting goroutine dumps... received response...
[node 2] requesting goroutine dumps: last request failed: rpc error: ...
[node 2] requesting goroutine dumps: creating error output: debug/nodes/2/goroutines.err.txt... done
[node 2] requesting heap file list... received response...
[node 2] requesting heap file list: last request failed: rpc error: ...
[node 2] requesting heap file list: creating error output: debug/nodes/2/heapprof.err.txt... done
[node 2] requesting goroutine dump list... received response...
[node 2] requesting goroutine dump list: last request failed: rpc error: ...
[node 2] requesting goroutine dump list: creating error output: debug/nodes/2/goroutines.err.txt... done
[node 2] requesting log file ...
[node 2] requesting log file ...
[node 2] requesting log file ...
Expand Down Expand Up @@ -210,12 +210,12 @@ debug zip --concurrency=1 --cpu-profile-duration=0s /dev/null
[node 3] requesting stacks... received response... writing binary output: debug/nodes/3/stacks.txt... done
[node 3] requesting threads... received response... writing binary output: debug/nodes/3/threads.txt... done
[node 3] requesting heap profile... received response... writing binary output: debug/nodes/3/heap.pprof... done
[node 3] requesting heap files... received response...
[node 3] requesting heap files: last request failed: rpc error: ...
[node 3] requesting heap files: creating error output: debug/nodes/3/heapprof.err.txt... done
[node 3] requesting goroutine dumps... received response...
[node 3] requesting goroutine dumps: last request failed: rpc error: ...
[node 3] requesting goroutine dumps: creating error output: debug/nodes/3/goroutines.err.txt... done
[node 3] requesting heap file list... received response...
[node 3] requesting heap file list: last request failed: rpc error: ...
[node 3] requesting heap file list: creating error output: debug/nodes/3/heapprof.err.txt... done
[node 3] requesting goroutine dump list... received response...
[node 3] requesting goroutine dump list: last request failed: rpc error: ...
[node 3] requesting goroutine dump list: creating error output: debug/nodes/3/goroutines.err.txt... done
[node 3] requesting log file ...
[node 3] 1 log file ...
[node 3] [log file ...
Expand Down
24 changes: 12 additions & 12 deletions pkg/cli/testdata/zip/partial1_excluded
Original file line number Diff line number Diff line change
Expand Up @@ -54,12 +54,12 @@ debug zip /dev/null --concurrency=1 --exclude-nodes=2 --cpu-profile-duration=0
[node 1] requesting stacks... received response... writing binary output: debug/nodes/1/stacks.txt... done
[node 1] requesting threads... received response... writing binary output: debug/nodes/1/threads.txt... done
[node 1] requesting heap profile... received response... writing binary output: debug/nodes/1/heap.pprof... done
[node 1] requesting heap files... received response...
[node 1] requesting heap files: last request failed: rpc error: ...
[node 1] requesting heap files: creating error output: debug/nodes/1/heapprof.err.txt... done
[node 1] requesting goroutine dumps... received response...
[node 1] requesting goroutine dumps: last request failed: rpc error: ...
[node 1] requesting goroutine dumps: creating error output: debug/nodes/1/goroutines.err.txt... done
[node 1] requesting heap file list... received response...
[node 1] requesting heap file list: last request failed: rpc error: ...
[node 1] requesting heap file list: creating error output: debug/nodes/1/heapprof.err.txt... done
[node 1] requesting goroutine dump list... received response...
[node 1] requesting goroutine dump list: last request failed: rpc error: ...
[node 1] requesting goroutine dump list: creating error output: debug/nodes/1/goroutines.err.txt... done
[node 1] requesting log file ...
[node 1] 1 log file ...
[node 1] [log file ...
Expand Down Expand Up @@ -128,12 +128,12 @@ debug zip /dev/null --concurrency=1 --exclude-nodes=2 --cpu-profile-duration=0
[node 3] requesting stacks... received response... writing binary output: debug/nodes/3/stacks.txt... done
[node 3] requesting threads... received response... writing binary output: debug/nodes/3/threads.txt... done
[node 3] requesting heap profile... received response... writing binary output: debug/nodes/3/heap.pprof... done
[node 3] requesting heap files... received response...
[node 3] requesting heap files: last request failed: rpc error: ...
[node 3] requesting heap files: creating error output: debug/nodes/3/heapprof.err.txt... done
[node 3] requesting goroutine dumps... received response...
[node 3] requesting goroutine dumps: last request failed: rpc error: ...
[node 3] requesting goroutine dumps: creating error output: debug/nodes/3/goroutines.err.txt... done
[node 3] requesting heap file list... received response...
[node 3] requesting heap file list: last request failed: rpc error: ...
[node 3] requesting heap file list: creating error output: debug/nodes/3/heapprof.err.txt... done
[node 3] requesting goroutine dump list... received response...
[node 3] requesting goroutine dump list: last request failed: rpc error: ...
[node 3] requesting goroutine dump list: creating error output: debug/nodes/3/goroutines.err.txt... done
[node 3] requesting log file ...
[node 3] 1 log file ...
[node 3] [log file ...
Expand Down
Loading

0 comments on commit 17e605b

Please sign in to comment.