Fully-implemented, resource-aware `conduit stat` #627

olix0r · 2018-03-26T22:01:48Z

Tracking ticket for a fully-implemented, resource-aware conduit stat command.

Tasks

RFC

conduit help stat

Display traffic stats about one or many resources.

Valid resource types include: 

  * all
  * daemonsets (aka 'ds')
  * deployments (aka 'deploy')
  * jobs
  * namespaces (aka 'ns')
  * nodes (aka 'no')
  * pods (aka 'po')
  * replicasets (aka 'rs')
  * replicationcontrollers (aka 'rc')
  * services (aka 'svc')
  * statefulsets

This command will hide resources that have completed, such as pods that are in the Succeeded or Failed phases.

Examples:
  # Stat all pods.
  conduit stat pods

  # Stat a single replication controller with specified NAME.
  conduit stat replicationcontroller web

  # Stat a single pod in the "walrus" namespace.
  conduit -n walrus stat web-pod-13je7

  # Stat a pod identified by type and name specified in "pod.yaml".
  conduit stat -f pod.yaml

  # Stat all replication controllers and services together.
  conduit stat rc,services

  # Stat one or more resources by their type and names.
  conduit stat rc/web service/frontend pods/web-pod-13je7

  # Stat all resources with different types.
  conduit stat all

  # Show stats .
  conduit stat all

Options:
      --all-namespaces=false: If present, list the requested object(s) across all namespaces. Namespace in current context is ignored even if specified with --namespace.
  -E, --end-time='': If present, sets the end of the measured window to the specified time.
  -f, --filename=[]: Filename, directory, or URL to files identifying the resource to get from a server.
      --from='': If present, restricts stats to outbound traffic originating from the specified resource.
      --from-namespace='': Sets the namespace used to lookup the '--from' resource. By default the current '--namespace' is used.
      --from-selector='': Selector (label query) to filter on, supports '=', '==', and '!='. (e.g. -l key1=value1,key2=value2)
      --to='': If present, restricts stats to outbound traffic destined for the specified resource.
      --to-namespace='': Sets the namespace used to lookup the '--to' resource. By default the current '--namespace' is used.
      --to-selector='': Selector (label query) to filter on, supports '=', '==', and '!='. (e.g. -l key1=value1,key2=value2)
      --no-headers=false: When using the default or custom-column output format, don't print headers (default print headers).
  -l, --selector='': Selector (label query) to filter on, supports '=', '==', and '!='. (e.g. -l key1=value1,key2=value2)
  -w, --window='10s': Time period over which stats are measured.

Usage:
  conduit stat (TYPE [NAME | -l label] | TYPE/NAME ...) [flags] [options]

Use "conduit options" for a list of global command-line options (applies to all commands).

Example output:

:; conduit stat -n emojivoto all
NAME           MESHED      SUCCESS      RPS      LATENCY_P50      LATENCY_P95
deploy/emoji      1/1      100.00%      3.8              0ms             26ms
deploy/voting     1/1       70.00%      2.0              0ms              1ms
deploy/web        1/1       89.47%      3.8              6ms             38ms

NAME                     MESHED      SUCCESS     RPS     LATENCY_P50      LATENCY_P95
rs/emoji-59985cd65d         0/0            -       -               -                -
rs/emoji-64fdc8cd98         0/0            -       -               -                -
rs/emoji-6d9cf85cbb         1/1      100.00%     3.8             0ms             26ms
rs/vote-bot-6b5748dfbb      0/0            -       -               -                -
rs/vote-bot-6cc86f778d      0/0            -       -               -                -
rs/vote-bot-76cd4898c4      1/1       100.0%     0.0             0ms              0ms
rs/voting-678b88c5c6        0/0            -       -               -                -
rs/voting-6d4c959d48        0/0            -       -               -                -
rs/voting-76f6445c6         1/1       70.00%     2.0             0ms              1ms
rs/web-585f896bdd           0/0            -       -               -                -
rs/web-75c487dbc6           1/1       89.47%     3.8             6ms             38ms
rs/web-88df8cb76            0/0            -       -               -                -

NAME                              SUCCESS      RPS      LATENCY_P50      LATENCY_P95
po/emoji-6d9cf85cbb-m9sdm         100.00%      3.8              0ms             26ms
po/vote-bot-76cd4898c4-7wx2p      100.00%      0.0              0ms              9ms
po/voting-76f6445c6-xrqxv          70.00%      2.0              0ms              1ms
po/web-75c487dbc6-78bmm            89.47%      3.8              6ms             38ms

NAME               SUCCESS      RPS      LATENCY_P50      LATENCY_P95
svc/emoji-svc      100.00%      3.8              0ms             26ms
svc/voting-svc      70.00%      2.0              0ms              1ms
svc/web-svc         89.47%      3.8              6ms             38ms

:; conduit stat -n emojivoto deploy/voting --from=deploy
NAME            FROM             SUCCESS       RPS       LATENCY_P50       LATENCY_P95
deploy/voting   deploy/web        70.00%       2.0               0ms               1ms

:; conduit stat -n emojivoto deploy/web --to=deploy
NAME         TO                  SUCCESS       RPS       LATENCY_P50       LATENCY_P95
deploy/web   deploy/emoji        100.00%       3.8               0ms              26ms
deploy/web   deploy/voting        89.47%       3.8               6ms              38ms

@adleong @siggy

The text was updated successfully, but these errors were encountered:

siggy · 2018-03-27T20:04:29Z

Some examples to validate the query patterns described above:

conduit stat -n emojivoto deploy/emoji

# Deployment => ReplicaSet selector
selector=$(kubectl -n emojivoto get deploy/emoji -o json | jq -r '.spec.selector.matchLabels | to_entries[] | .key+"="+.value')

# ReplicaSet => pod-template-hash
# select ReplicaSet from Deployment selector, verify ownerReference back to Deployment, parse out pod-template-hash
podtemplatehash=$(kubectl -n emojivoto get rs -l $selector -o json | jq -r '.items[] | select(.metadata.ownerReferences[].controller == true and .metadata.ownerReferences[].kind == "Deployment" and .metadata.ownerReferences[].name == "emoji") | .spec.selector.matchLabels."pod-template-hash"')

# pod-template-hash => Prometheus request volume query
curl "http://localhost:9090/api/v1/query?query=sum(request_total%7Bpod_template_hash%3D%22$podtemplatehash%22%7D)"

pcalcado · 2018-03-29T21:16:07Z

@olix0r Overall LGTM, the only question I have is about the different resource types. So far we have only dealt with Deployments, but this has a whole lot of different types. Is this meant to be foward looking? What does MESHED mean for a Service or Job?

* Define a new telemetry Stat API Proposal definition for a new Stat API, for the purposes of satisfying the queries proposed in #627. StatSummary will replace Stat once implemented and the original Stat deleted.

klingerf · 2018-04-04T17:33:37Z

This is a follow-up from my comment on #663.

@olix0r Can you provide a bit more clarity on the --out-* set of flags?

The description for --out-from says "If present, restricts inbound stats to the the traffic originating from the specified resource." But as far as I can tell from this example, setting the --out-from flag causes us to return outbound stats, not inbound stats:

:; conduit stat -n emojivoto deploy/voting --out-from=deploy
NAME            OUT_FROM     OUT_SUCCESS   OUT_RPS   OUT_LATENCY_P50   OUT_LATENCY_P95
deploy/voting   deploy/web        70.00%       2.0               0ms               1ms

The description for --out-from-namespace says "Sets the namespace used to lookup the '--in-from' resource. By default the current '--namespace' is used.". But there is no --in-from flag listed in the help output. Maybe all of the --out-from* flags should be renamed --in-from*?

Based on the examples, it looks like the stat command returns inbound stats by default, but setting any of the --out-* flags causes us to return outbound stats instead, with some type of filtering applied? Is that a correct understanding? Is it possible to filter inbound stats in any way?

I'm not super familiar with the intended use cases, but whether or not we return inbound or outbound stats feels like a binary flag to me, whereas I would expect for us to be able to add filtering regardless of whether or not we're returning inbound or outbound stats.

olix0r · 2018-04-04T23:19:28Z

@klingerf

Can you provide a bit more clarity on the --out-* set of flags?

The intention is that these are both outbound in. --out-from should look at the outbound stats from the specified resource where the dst is the main resource type. --out-to should look at the outbound stats from the main resource to the specified dst resource.

The description for --out-from says "If present, restricts inbound stats to the the traffic originating from the specified resource." But as far as I can tell from this example, setting the --out-from flag causes us to return outbound stats, not inbound stats:

That's probably copypasta

The description for --out-from-namespace says "Sets the namespace used to lookup the '--in-from' resource. By default the current '--namespace' is used.". But there is no --in-from flag listed in the help output. Maybe all of the --out-from* flags should be renamed --in-from*?

Definitely copypasta

Based on the examples, it looks like the stat command returns inbound stats by default, but setting any of the --out-* flags causes us to return outbound stats instead, with some type of filtering applied? Is that a correct understanding? Is it possible to filter inbound stats in any way?

Correct. The main resource -- with no outs specified -- selects inbound stats. Because we don't have source annotations on inbound stats, no further filtering is supported on inbound.

klingerf · 2018-04-04T23:55:03Z

That all makes sense, thanks. With that in mind, I'd like to recommend that we revise the inbound filtering flags as follows:

      --from='': If present, restricts stats to the outbound traffic originating from the specified resource.
      --from-namespace='': Sets the namespace used to lookup the '--from' resource. By default the current '--namespace' is used.
      --from-selector='': Selector (label query) to filter on, supports '=', '==', and '!='. (e.g. -l key1=value1,key2=value2)

And the outbound filtering as follows:

      --to='': If present, restricts stats to the outbound traffic destined for the specified resource.
      --to-namespace='': Sets the namespace used to lookup the '--to' resource. By default the current '--namespace' is used.
      --to-selector='': Selector (label query) to filter on, supports '=', '==', and '!='. (e.g. -l key1=value1,key2=value2)

Personally I'd also drop all of the IN_ and OUT_ prefixes from the column headers as well, since I think that makes things more confusing in that context. What do you think?

olix0r · 2018-04-04T23:58:51Z

Personally I'd also drop all of the IN_ and OUT_ prefixes from the column headers as well, since I think that makes things more confusing in that context. What do you think?

This seems like a fine simplification for now. We may find a need to add it later, but simple is good.

klingerf · 2018-04-05T01:23:31Z

Ok, have updated accordingly.

@klingerf

Start implementing new conduit stat summary endpoint. Changes the public-api to call prometheus directly instead of the telemetry service. Wired through to `api/stat` on the web server, as well as `conduit statsummary` on the CLI. Works for deployments only. Current implementation just retrieves requests and mesh/total pod count (so latency stats are always 0). Uses API defined in #663 Example queries the stat endpoint will eventually satisfy in #627 This branch includes commits from @klingerf * run ./bin/dep ensure * run ./bin/update-go-deps-shas

The cli and public-api only supported deployments as a resource type. This change adds support for namespace as a resource type in the cli and public-api. This also change includes: - the cli statsummary command now prints `-`'s when objects are not in the mesh - removed `out-` from cli statsummary flags, and analagous proto changes - switched public-api to use native prometheus label types - misc error handling and logging fixes Part of #627 Signed-off-by: Andrew Seigner <siggy@buoyant.io>

The cli and public-api only supported deployments as a resource type. This change adds support for namespace as a resource type in the cli and public-api. This also change includes: - cli statsummary now prints `-`'s when objects are not in the mesh - cli statsummary prints `No resources found.` when applicable - removed `out-` from cli statsummary flags, and analagous proto changes - switched public-api to use native prometheus label types - misc error handling and logging fixes Part of #627 Signed-off-by: Andrew Seigner <siggy@buoyant.io>

* Add namespace as a resource type in public-api The cli and public-api only supported deployments as a resource type. This change adds support for namespace as a resource type in the cli and public-api. This also change includes: - cli statsummary now prints `-`'s when objects are not in the mesh - cli statsummary prints `No resources found.` when applicable - removed `out-` from cli statsummary flags, and analagous proto changes - switched public-api to use native prometheus label types - misc error handling and logging fixes Part of #627 Signed-off-by: Andrew Seigner <siggy@buoyant.io> * Refactor filter and groupby label formulation Signed-off-by: Kevin Lingerfelt <kl@buoyant.io> * Rename stat_summary.go to stat.go in cli Signed-off-by: Kevin Lingerfelt <kl@buoyant.io> * Update rbac privileges for namespace stats Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>

grampelberg · 2018-06-25T18:02:45Z

@klingerf what's still missing from this issue?

klingerf · 2018-06-26T17:00:26Z

@grampelberg I'd be fine with closing this, but I don't think we have issues tracking the remaining resources from the description for which we don't have stats. Specifically, we aren't exposing stats yet for:

  * daemonsets (aka 'ds')
  * jobs
  * nodes (aka 'no')
  * replicasets (aka 'rs')
  * statefulsets

grampelberg · 2018-06-26T17:04:39Z

I really like how @siggy has been tracking progress for #420. Maybe we should just do a task list? I was mostly interested in what was still outstanding.

klingerf · 2018-06-26T17:34:12Z

@grampelberg Works for me -- have updated the title and description accordingly and included a task list.

grampelberg · 2018-06-26T18:21:02Z

@klingerf you're a gentleman and a scholar. Thank you!

stale · 2018-10-07T22:47:08Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

Filtering by Kubernetes job was not supported. Also filtering by any unknown type caused a panic. Add filtering support by Kubernetes job, with special case mapping `job` to `k8s_job`, to not conflict with Prometheus' job label. Fix panic when unknown type specified as a `--from` or `--to` flag. Fix `job` label from `linkerd-proxy` overwriting Prometheus `job` label at collection time. This caused all metrics collected by proxy sidecars in Kubernetes jobs to be collected into an incorrect Prometheus job, rather than the expected `linkerd-proxy` Prometheus job. Fix `unsupported resource type` tap error message incorrectly printing the target resource rather than the destination. Set `--controller-log-level debug` in `install_test.go` for easier debugging. Expose `slow-cooker`'s metrics via a k8s service in the tap integration test, to validate proxy requests with a job as destination. Fixes #1872 Part of #627 Signed-off-by: Andrew Seigner <siggy@buoyant.io>

olix0r self-assigned this Mar 26, 2018

olix0r added this to the 0.4.0 milestone Mar 26, 2018

olix0r added area/cli priority/P1 Planned for Release area/telemetry labels Mar 26, 2018

This was referenced Mar 27, 2018

Clean up Prometheus labels scraped from proxy #633

Merged

Figure out success rate classification in telemetry #634

Closed

rmars mentioned this issue Apr 2, 2018

Define a new telemetry Stat API #663

Merged

rmars mentioned this issue Apr 4, 2018

Start implementing conduit stat summary endpoint #671

Merged

rmars mentioned this issue Apr 5, 2018

Add kubectl get-style resource parsing to conduit stat CLI command #683

Closed

klingerf mentioned this issue Apr 11, 2018

Add latency stats in new stat summary endpoint #737

Merged

siggy mentioned this issue Apr 13, 2018

Add namespace as a resource type in public-api #760

Merged

olix0r removed this from the 0.4.0 milestone Apr 16, 2018

olix0r added this to the 0.4.1 milestone Apr 16, 2018

This was referenced Apr 18, 2018

cli: standardize kubernetes resource parsing #792

Closed

service support in stat command #805

Closed

siggy modified the milestones: 0.4.1, 0.5.0 Apr 25, 2018

siggy mentioned this issue May 8, 2018

Adding statefulsets to inject #910

Merged

olix0r removed their assignment May 14, 2018

klingerf removed this from the 0.5.0 milestone May 29, 2018

klingerf changed the title ~~rfc: resource-aware conduit stat~~ Fully-implemented, resource-aware conduit stat command Jun 26, 2018

klingerf changed the title ~~Fully-implemented, resource-aware conduit stat command~~ Fully-implemented, resource-aware conduit stat Jun 26, 2018

stale bot added the wontfix label Oct 7, 2018

stale bot closed this as completed Oct 23, 2018

siggy mentioned this issue Dec 1, 2018

Add filtering by job in stat, tap, top; fix panic #1904

Merged

klingerf mentioned this issue Dec 17, 2018

Wire up stats and dashboards for StatefulSets #1983

Closed

github-actions bot locked as resolved and limited conversation to collaborators Jul 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fully-implemented, resource-aware `conduit stat` #627

Fully-implemented, resource-aware `conduit stat` #627

olix0r commented Mar 26, 2018 •

edited by klingerf

Loading

siggy commented Mar 27, 2018 •

edited

Loading

pcalcado commented Mar 29, 2018

klingerf commented Apr 4, 2018

olix0r commented Apr 4, 2018

klingerf commented Apr 4, 2018 •

edited

Loading

olix0r commented Apr 4, 2018

klingerf commented Apr 5, 2018

grampelberg commented Jun 25, 2018

klingerf commented Jun 26, 2018

grampelberg commented Jun 26, 2018

klingerf commented Jun 26, 2018

grampelberg commented Jun 26, 2018

stale bot commented Oct 7, 2018

Fully-implemented, resource-aware conduit stat #627

Fully-implemented, resource-aware conduit stat #627

Comments

olix0r commented Mar 26, 2018 • edited by klingerf Loading

Tasks

RFC

siggy commented Mar 27, 2018 • edited Loading

conduit stat -n emojivoto deploy/emoji

pcalcado commented Mar 29, 2018

klingerf commented Apr 4, 2018

olix0r commented Apr 4, 2018

klingerf commented Apr 4, 2018 • edited Loading

olix0r commented Apr 4, 2018

klingerf commented Apr 5, 2018

grampelberg commented Jun 25, 2018

klingerf commented Jun 26, 2018

grampelberg commented Jun 26, 2018

klingerf commented Jun 26, 2018

grampelberg commented Jun 26, 2018

stale bot commented Oct 7, 2018

Fully-implemented, resource-aware `conduit stat` #627

Fully-implemented, resource-aware `conduit stat` #627

olix0r commented Mar 26, 2018 •

edited by klingerf

Loading

siggy commented Mar 27, 2018 •

edited

Loading

klingerf commented Apr 4, 2018 •

edited

Loading