Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
63488: monitoring: renovate grafana dashboards r=kzh a=sai-roach

This PR adds in renovated grafana dashboards that aim for
feature parity with DB Console metrics page.

Dashboards:
- Overview
- Hardware
- Runtime
- SQL
- Storage
- Replication
- Distributed
- Queues
- Slow Requests
- Changefeeds

These dashboards can be previewed by following the instructions in
the monitoring [README.md](https://github.com/cockroachdb/cockroach/blob/master/monitoring/README.md) for spinning up a quick grafana instance.

Release note (ops change): The grafana dashboards have been updated to
more closely resemble DB console metrics.

66678: tracing: collector was incorrectly flattening recordings r=irfansharif,abarganier a=adityamaru

Previously, the trace collector was dialing up a node, visiting
all the root spans on the nodes inflight registry, and placing
`tracingpb.RecordedSpans` into a flat slice. This caused loss of
information about which spans belonged to a chain
rooted at a fixed root span. Such a chain is referred to as a
`tracing.Recording`. Every node can have multiple `tracing.Recording`s
with the same `trace_id`, and they each represent a traced remote
operation.

This change maintains the `tracing.Recording` grouping of spans
by getting the collector to return a `[]tracing.Recording` for each
node. The collectors' unit of iteration consequently becomes a
`tracing.Recording`. This makes more sense when you think about
how we want to consume these traces. Every `tracing.Recording` is
a new traced remote operation, and should be visualized as such in
Jaegar, JSON etc.

This change also augments the collector iterator to return the nodeID
of the node that the current `tracing.Recording` belongs too.

Informs: #64992

Release note: None

66715: workload: make rand workload aware of computed columns r=mgartner a=rafiss

fixes #66683

Release note: None

Co-authored-by: sai-roach <sai@cockroachlabs.com>
Co-authored-by: Aditya Maru <adityamaru@gmail.com>
Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
  • Loading branch information
4 people committed Jun 22, 2021
4 parents 2cafdfb + 70bff82 + 804ec07 + bfdb157 commit 3255e7c
Show file tree
Hide file tree
Showing 20 changed files with 9,366 additions and 4,798 deletions.
681 changes: 681 additions & 0 deletions monitoring/grafana-dashboards/changefeeds.json

Large diffs are not rendered by default.

1,156 changes: 1,156 additions & 0 deletions monitoring/grafana-dashboards/distributed.json

Large diffs are not rendered by default.

1,157 changes: 1,157 additions & 0 deletions monitoring/grafana-dashboards/hardware.json

Large diffs are not rendered by default.

614 changes: 614 additions & 0 deletions monitoring/grafana-dashboards/overview.json

Large diffs are not rendered by default.

1,479 changes: 1,479 additions & 0 deletions monitoring/grafana-dashboards/queues.json

Large diffs are not rendered by default.

1,841 changes: 0 additions & 1,841 deletions monitoring/grafana-dashboards/replicas.json

This file was deleted.

1,085 changes: 1,085 additions & 0 deletions monitoring/grafana-dashboards/replication.json

Large diffs are not rendered by default.

1,658 changes: 337 additions & 1,321 deletions monitoring/grafana-dashboards/runtime.json

Large diffs are not rendered by default.

571 changes: 571 additions & 0 deletions monitoring/grafana-dashboards/slow_request.json

Large diffs are not rendered by default.

2,104 changes: 1,412 additions & 692 deletions monitoring/grafana-dashboards/sql.json

Large diffs are not rendered by default.

1,263 changes: 482 additions & 781 deletions monitoring/grafana-dashboards/storage.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion pkg/util/tracing/collector/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,6 @@ go_library(
"//pkg/rpc/nodedialer",
"//pkg/util/log",
"//pkg/util/tracing",
"//pkg/util/tracing/tracingpb",
"//pkg/util/tracing/tracingservicepb:tracingservicepb_go_proto",
],
)
Expand All @@ -31,6 +30,7 @@ go_test(
"//pkg/base",
"//pkg/ccl/utilccl",
"//pkg/kv/kvserver/liveness",
"//pkg/roachpb",
"//pkg/rpc/nodedialer",
"//pkg/security",
"//pkg/security/securitytest",
Expand Down
78 changes: 46 additions & 32 deletions pkg/util/tracing/collector/collector.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,6 @@ import (
"github.com/cockroachdb/cockroach/pkg/rpc/nodedialer"
"github.com/cockroachdb/cockroach/pkg/util/log"
"github.com/cockroachdb/cockroach/pkg/util/tracing"
"github.com/cockroachdb/cockroach/pkg/util/tracing/tracingpb"
"github.com/cockroachdb/cockroach/pkg/util/tracing/tracingservicepb"
)

Expand Down Expand Up @@ -50,9 +49,9 @@ func New(
}
}

// Iterator can be used to return RecordedSpans from all live
// nodes in the cluster, in a streaming manner. The iterator buffers the
// RecordedSpans of one node at a time.
// Iterator can be used to return tracing.Recordings from all live nodes in the
// cluster, in a streaming manner. The iterator buffers the tracing.Recordings
// of one node at a time.
type Iterator struct {
collector *TraceCollector

Expand All @@ -64,19 +63,24 @@ type Iterator struct {
// and will be contacted for inflight trace spans by the iterator.
liveNodes []roachpb.NodeID

// curNodeIndex maintains the node from which the iterator has pulled inflight
// span recordings and buffered them in `recordedSpans` for consumption via
// the iterator.
// curNodeIndex maintains the index in liveNodes from which the iterator has
// pulled inflight span recordings and buffered them in `recordedSpans` for
// consumption via the iterator.
curNodeIndex int

// recordedSpanIndex maintains the current position of of the iterator in the
// list of recorded spans. The recorded spans that the iterator points to are
// buffered in `recordedSpans`.
recordedSpanIndex int
// curNode maintains the node from which the iterator has pulled inflight span
// recordings and buffered them in `recordings` for consumption via the
// iterator.
curNode roachpb.NodeID

// recordedSpans represents all recorded spans for a given node currently
// recordingIndex maintains the current position of the iterator in the list
// of tracing.Recordings. The tracing.Recording that the iterator points to is
// buffered in `recordings`.
recordingIndex int

// recordings represent all the tracing.Recordings for a given node currently
// accessed by the iterator.
recordedSpans []tracingpb.RecordedSpan
recordings []tracing.Recording

iterErr error
}
Expand Down Expand Up @@ -104,9 +108,9 @@ func (i *Iterator) Valid() bool {
return false
}

// If recordedSpanIndex is within recordedSpans and there are some buffered
// recordedSpans, it is valid to return from the buffer.
if i.recordedSpans != nil && i.recordedSpanIndex < len(i.recordedSpans) {
// If recordingIndex is within recordings and there are some buffered
// recordings, it is valid to return from the buffer.
if i.recordings != nil && i.recordingIndex < len(i.recordings) {
return true
}

Expand All @@ -117,29 +121,29 @@ func (i *Iterator) Valid() bool {

// Next sets the Iterator to point to the next value to be returned.
func (i *Iterator) Next() {
i.recordedSpanIndex++
i.recordingIndex++

// If recordedSpanIndex is within recordedSpans and there are some buffered
// recordedSpans, then we can return them when Value() is called.
if i.recordedSpans != nil && i.recordedSpanIndex < len(i.recordedSpans) {
// If recordingIndex is within recordings and there are some buffered
// recordings, it is valid to return from the buffer.
if i.recordings != nil && i.recordingIndex < len(i.recordings) {
return
}

// Reset buffer variables.
i.recordedSpans = nil
i.recordedSpanIndex = 0
i.recordings = nil
i.recordingIndex = 0

// Either there are no more spans or we have exhausted the recordings from the
// current node, and we need to pull the inflight recordings from another
// node.
// Keep searching for recordings from all live nodes in the cluster.
for i.recordedSpans == nil {
for i.recordings == nil {
// No more spans to return from any of the live nodes in the cluster.
if !(i.curNodeIndex < len(i.liveNodes)) {
return
}
i.recordedSpans, i.iterErr = i.collector.getTraceSpanRecordingsForNode(i.ctx, i.traceID,
i.liveNodes[i.curNodeIndex])
i.curNode = i.liveNodes[i.curNodeIndex]
i.recordings, i.iterErr = i.collector.getTraceSpanRecordingsForNode(i.ctx, i.traceID, i.curNode)
// TODO(adityamaru): We might want to consider not failing if a single node
// fails to return span recordings.
if i.iterErr != nil {
Expand All @@ -150,8 +154,8 @@ func (i *Iterator) Next() {
}

// Value returns the current value pointed to by the Iterator.
func (i *Iterator) Value() tracingpb.RecordedSpan {
return i.recordedSpans[i.recordedSpanIndex]
func (i *Iterator) Value() (roachpb.NodeID, tracing.Recording) {
return i.curNode, i.recordings[i.recordingIndex]
}

// Error returns the error encountered by the Iterator during iteration.
Expand All @@ -166,22 +170,32 @@ func (i *Iterator) Error() error {
// inflight spans, and relies on gRPC short circuiting local requests.
func (t *TraceCollector) getTraceSpanRecordingsForNode(
ctx context.Context, traceID uint64, nodeID roachpb.NodeID,
) ([]tracingpb.RecordedSpan, error) {
) ([]tracing.Recording, error) {
log.Infof(ctx, "getting span recordings from node %s", nodeID.String())
conn, err := t.dialer.Dial(ctx, nodeID, rpc.DefaultClass)
if err != nil {
return nil, err
}
traceClient := tracingservicepb.NewTracingClient(conn)
resp, err := traceClient.GetSpanRecordings(ctx,
&tracingservicepb.SpanRecordingRequest{TraceID: traceID})
&tracingservicepb.GetSpanRecordingsRequest{TraceID: traceID})
if err != nil {
return nil, err
}

sort.SliceStable(resp.SpanRecordings, func(i, j int) bool {
return resp.SpanRecordings[i].StartTime.Before(resp.SpanRecordings[j].StartTime)
var res []tracing.Recording
for _, recording := range resp.Recordings {
if recording.RecordedSpans == nil {
continue
}
res = append(res, recording.RecordedSpans)
}

// This sort ensures that if a node has multiple trace.Recordings then they
// are ordered relative to each other by StartTime.
sort.SliceStable(res, func(i, j int) bool {
return res[i][0].StartTime.Before(res[j][0].StartTime)
})

return resp.SpanRecordings, nil
return res, nil
}
70 changes: 41 additions & 29 deletions pkg/util/tracing/collector/collector_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,12 @@ package collector_test
import (
"context"
"fmt"
"sort"
"testing"
"time"

"github.com/cockroachdb/cockroach/pkg/base"
"github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness"
"github.com/cockroachdb/cockroach/pkg/roachpb"
"github.com/cockroachdb/cockroach/pkg/rpc/nodedialer"
"github.com/cockroachdb/cockroach/pkg/testutils/testcluster"
"github.com/cockroachdb/cockroach/pkg/util/leaktest"
Expand Down Expand Up @@ -129,51 +129,63 @@ func TestTracingCollectorGetSpanRecordings(t *testing.T) {
localTraceID, remoteTraceID, cleanup := setupTraces(localTracer, remoteTracer)
defer cleanup()

getSpansFromAllNodes := func(traceID uint64) tracing.Recording {
res := make(tracing.Recording, 0)
getSpansFromAllNodes := func(traceID uint64) map[roachpb.NodeID][]tracing.Recording {
res := make(map[roachpb.NodeID][]tracing.Recording)

var iter *collector.Iterator
for iter = traceCollector.StartIter(ctx, traceID); iter.Valid(); iter.Next() {
res = append(res, iter.Value())
nodeID, recording := iter.Value()
res[nodeID] = append(res[nodeID], recording)
}
require.NoError(t, iter.Error())

sort.SliceStable(res, func(i, j int) bool {
return res[i].StartTime.Before(res[j].StartTime)
})
return res
}

t.Run("fetch-local-recordings", func(t *testing.T) {
recordedSpan := getSpansFromAllNodes(localTraceID)
require.NoError(t, tracing.TestingCheckRecordedSpans(recordedSpan, `
span: root
tags: _unfinished=1 _verbose=1
event: structured=root
span: root.child
nodeRecordings := getSpansFromAllNodes(localTraceID)
node1Recordings := nodeRecordings[roachpb.NodeID(1)]
require.Equal(t, 1, len(node1Recordings))
require.NoError(t, tracing.TestingCheckRecordedSpans(node1Recordings[0], `
span: root
tags: _unfinished=1 _verbose=1
span: root.child.remotechild
event: structured=root
span: root.child
tags: _unfinished=1 _verbose=1
event: structured=root.child.remotechild
span: root.child.remotechilddone
tags: _verbose=1
`))
span: root.child.remotechilddone
tags: _verbose=1
`))
node2Recordings := nodeRecordings[roachpb.NodeID(2)]
require.Equal(t, 1, len(node2Recordings))
require.NoError(t, tracing.TestingCheckRecordedSpans(node2Recordings[0], `
span: root.child.remotechild
tags: _unfinished=1 _verbose=1
event: structured=root.child.remotechild
`))
})

// The traceCollector is running on node 1, so most of the recordings for this
// subtest will be passed back by node 2 over RPC.
t.Run("fetch-remote-recordings", func(t *testing.T) {
recordedSpan := getSpansFromAllNodes(remoteTraceID)
require.NoError(t, tracing.TestingCheckRecordedSpans(recordedSpan, `
span: root2
tags: _unfinished=1 _verbose=1
event: structured=root2
span: root2.child
nodeRecordings := getSpansFromAllNodes(remoteTraceID)
node1Recordings := nodeRecordings[roachpb.NodeID(1)]
require.Equal(t, 2, len(node1Recordings))
require.NoError(t, tracing.TestingCheckRecordedSpans(node1Recordings[0], `
span: root2.child.remotechild
tags: _unfinished=1 _verbose=1
span: root2.child.remotechild
tags: _unfinished=1 _verbose=1
span: root2.child.remotechild2
`))
require.NoError(t, tracing.TestingCheckRecordedSpans(node1Recordings[1], `
span: root2.child.remotechild2
tags: _unfinished=1 _verbose=1
`))

node2Recordings := nodeRecordings[roachpb.NodeID(2)]
require.Equal(t, 1, len(node2Recordings))
require.NoError(t, tracing.TestingCheckRecordedSpans(node2Recordings[0], `
span: root2
tags: _unfinished=1 _verbose=1
event: structured=root2
span: root2.child
tags: _unfinished=1 _verbose=1
`))
`))
})
}
4 changes: 2 additions & 2 deletions pkg/util/tracing/grpc_interceptor.go
Original file line number Diff line number Diff line change
Expand Up @@ -421,6 +421,6 @@ func (cs *tracingClientStream) CloseSend() error {
return err
}

// Recording represents a group of RecordedSpans, as returned by GetRecording.
// Spans are sorted by StartTime.
// Recording represents a group of RecordedSpans rooted at a fixed root span, as
// returned by GetRecording. Spans are sorted by StartTime.
type Recording []tracingpb.RecordedSpan
16 changes: 11 additions & 5 deletions pkg/util/tracing/service/service.go
Original file line number Diff line number Diff line change
Expand Up @@ -41,16 +41,22 @@ func New(tracer *tracing.Tracer) *Service {
}

// GetSpanRecordings implements the tracingpb.TraceServer interface.
//
// This method iterates over all active root spans registered with the nodes'
// local inflight span registry, and returns a tracing.Recording for each root
// span with a matching trace_id.
func (s *Service) GetSpanRecordings(
_ context.Context, request *tracingservicepb.SpanRecordingRequest,
) (*tracingservicepb.SpanRecordingResponse, error) {
var resp tracingservicepb.SpanRecordingResponse
_ context.Context, request *tracingservicepb.GetSpanRecordingsRequest,
) (*tracingservicepb.GetSpanRecordingsResponse, error) {
var resp tracingservicepb.GetSpanRecordingsResponse
err := s.tracer.VisitSpans(func(span *tracing.Span) error {
if span.TraceID() != request.TraceID {
return nil
}
for _, rec := range span.GetRecording() {
resp.SpanRecordings = append(resp.SpanRecordings, rec)
recording := span.GetRecording()
if recording != nil {
resp.Recordings = append(resp.Recordings,
tracingservicepb.GetSpanRecordingsResponse_Recording{RecordedSpans: recording})
}
return nil
})
Expand Down
20 changes: 14 additions & 6 deletions pkg/util/tracing/service/service_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -56,17 +56,25 @@ func TestTracingServiceGetSpanRecordings(t *testing.T) {

ctx := context.Background()
s := New(tracer1)
resp, err := s.GetSpanRecordings(ctx, &tracingservicepb.SpanRecordingRequest{TraceID: traceID1})
resp, err := s.GetSpanRecordings(ctx, &tracingservicepb.GetSpanRecordingsRequest{TraceID: traceID1})
require.NoError(t, err)
sort.SliceStable(resp.SpanRecordings, func(i, j int) bool {
return resp.SpanRecordings[i].StartTime.Before(resp.SpanRecordings[j].StartTime)
// We expect two Recordings.
// 1. root1, root1.child
// 2. fork1
require.Equal(t, 2, len(resp.Recordings))
// Sort the response based on the start time of the root spans in the
// recordings.
sort.SliceStable(resp.Recordings, func(i, j int) bool {
return resp.Recordings[i].RecordedSpans[0].StartTime.Before(resp.Recordings[j].RecordedSpans[0].StartTime)
})
require.NoError(t, tracing.TestingCheckRecordedSpans(resp.SpanRecordings, `
require.NoError(t, tracing.TestingCheckRecordedSpans(resp.Recordings[0].RecordedSpans, `
span: root1
tags: _unfinished=1 _verbose=1
span: root1.child
tags: _unfinished=1 _verbose=1
span: fork1
tags: _unfinished=1 _verbose=1
`))
require.NoError(t, tracing.TestingCheckRecordedSpans(resp.Recordings[1].RecordedSpans, `
span: fork1
tags: _unfinished=1 _verbose=1
`))
}
Loading

0 comments on commit 3255e7c

Please sign in to comment.