Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fully-implemented, resource-aware conduit stat #627

Closed
7 of 12 tasks
olix0r opened this issue Mar 26, 2018 · 13 comments
Closed
7 of 12 tasks

Fully-implemented, resource-aware conduit stat #627

olix0r opened this issue Mar 26, 2018 · 13 comments

Comments

@olix0r
Copy link
Member

olix0r commented Mar 26, 2018

Tracking ticket for a fully-implemented, resource-aware conduit stat command.

Tasks

RFC

conduit help stat

Display traffic stats about one or many resources.

Valid resource types include: 

  * all
  * daemonsets (aka 'ds')
  * deployments (aka 'deploy')
  * jobs
  * namespaces (aka 'ns')
  * nodes (aka 'no')
  * pods (aka 'po')
  * replicasets (aka 'rs')
  * replicationcontrollers (aka 'rc')
  * services (aka 'svc')
  * statefulsets

This command will hide resources that have completed, such as pods that are in the Succeeded or Failed phases.

Examples:
  # Stat all pods.
  conduit stat pods

  # Stat a single replication controller with specified NAME.
  conduit stat replicationcontroller web

  # Stat a single pod in the "walrus" namespace.
  conduit -n walrus stat web-pod-13je7

  # Stat a pod identified by type and name specified in "pod.yaml".
  conduit stat -f pod.yaml

  # Stat all replication controllers and services together.
  conduit stat rc,services

  # Stat one or more resources by their type and names.
  conduit stat rc/web service/frontend pods/web-pod-13je7

  # Stat all resources with different types.
  conduit stat all

  # Show stats .
  conduit stat all

Options:
      --all-namespaces=false: If present, list the requested object(s) across all namespaces. Namespace in current context is ignored even if specified with --namespace.
  -E, --end-time='': If present, sets the end of the measured window to the specified time.
  -f, --filename=[]: Filename, directory, or URL to files identifying the resource to get from a server.
      --from='': If present, restricts stats to outbound traffic originating from the specified resource.
      --from-namespace='': Sets the namespace used to lookup the '--from' resource. By default the current '--namespace' is used.
      --from-selector='': Selector (label query) to filter on, supports '=', '==', and '!='. (e.g. -l key1=value1,key2=value2)
      --to='': If present, restricts stats to outbound traffic destined for the specified resource.
      --to-namespace='': Sets the namespace used to lookup the '--to' resource. By default the current '--namespace' is used.
      --to-selector='': Selector (label query) to filter on, supports '=', '==', and '!='. (e.g. -l key1=value1,key2=value2)
      --no-headers=false: When using the default or custom-column output format, don't print headers (default print headers).
  -l, --selector='': Selector (label query) to filter on, supports '=', '==', and '!='. (e.g. -l key1=value1,key2=value2)
  -w, --window='10s': Time period over which stats are measured.

Usage:
  conduit stat (TYPE [NAME | -l label] | TYPE/NAME ...) [flags] [options]

Use "conduit options" for a list of global command-line options (applies to all commands).

Example output:

:; conduit stat -n emojivoto all
NAME           MESHED      SUCCESS      RPS      LATENCY_P50      LATENCY_P95
deploy/emoji      1/1      100.00%      3.8              0ms             26ms
deploy/voting     1/1       70.00%      2.0              0ms              1ms
deploy/web        1/1       89.47%      3.8              6ms             38ms

NAME                     MESHED      SUCCESS     RPS     LATENCY_P50      LATENCY_P95
rs/emoji-59985cd65d         0/0            -       -               -                -
rs/emoji-64fdc8cd98         0/0            -       -               -                -
rs/emoji-6d9cf85cbb         1/1      100.00%     3.8             0ms             26ms
rs/vote-bot-6b5748dfbb      0/0            -       -               -                -
rs/vote-bot-6cc86f778d      0/0            -       -               -                -
rs/vote-bot-76cd4898c4      1/1       100.0%     0.0             0ms              0ms
rs/voting-678b88c5c6        0/0            -       -               -                -
rs/voting-6d4c959d48        0/0            -       -               -                -
rs/voting-76f6445c6         1/1       70.00%     2.0             0ms              1ms
rs/web-585f896bdd           0/0            -       -               -                -
rs/web-75c487dbc6           1/1       89.47%     3.8             6ms             38ms
rs/web-88df8cb76            0/0            -       -               -                -

NAME                              SUCCESS      RPS      LATENCY_P50      LATENCY_P95
po/emoji-6d9cf85cbb-m9sdm         100.00%      3.8              0ms             26ms
po/vote-bot-76cd4898c4-7wx2p      100.00%      0.0              0ms              9ms
po/voting-76f6445c6-xrqxv          70.00%      2.0              0ms              1ms
po/web-75c487dbc6-78bmm            89.47%      3.8              6ms             38ms

NAME               SUCCESS      RPS      LATENCY_P50      LATENCY_P95
svc/emoji-svc      100.00%      3.8              0ms             26ms
svc/voting-svc      70.00%      2.0              0ms              1ms
svc/web-svc         89.47%      3.8              6ms             38ms
:; conduit stat -n emojivoto deploy/voting --from=deploy
NAME            FROM             SUCCESS       RPS       LATENCY_P50       LATENCY_P95
deploy/voting   deploy/web        70.00%       2.0               0ms               1ms
:; conduit stat -n emojivoto deploy/web --to=deploy
NAME         TO                  SUCCESS       RPS       LATENCY_P50       LATENCY_P95
deploy/web   deploy/emoji        100.00%       3.8               0ms              26ms
deploy/web   deploy/voting        89.47%       3.8               6ms              38ms

@adleong @siggy

@olix0r olix0r self-assigned this Mar 26, 2018
@olix0r olix0r added this to the 0.4.0 milestone Mar 26, 2018
@siggy
Copy link
Member

siggy commented Mar 27, 2018

Some examples to validate the query patterns described above:

conduit stat -n emojivoto deploy/emoji

# Deployment => ReplicaSet selector
selector=$(kubectl -n emojivoto get deploy/emoji -o json | jq -r '.spec.selector.matchLabels | to_entries[] | .key+"="+.value')

# ReplicaSet => pod-template-hash
# select ReplicaSet from Deployment selector, verify ownerReference back to Deployment, parse out pod-template-hash
podtemplatehash=$(kubectl -n emojivoto get rs -l $selector -o json | jq -r '.items[] | select(.metadata.ownerReferences[].controller == true and .metadata.ownerReferences[].kind == "Deployment" and .metadata.ownerReferences[].name == "emoji") | .spec.selector.matchLabels."pod-template-hash"')

# pod-template-hash => Prometheus request volume query
curl "http://localhost:9090/api/v1/query?query=sum(request_total%7Bpod_template_hash%3D%22$podtemplatehash%22%7D)"

@pcalcado
Copy link
Contributor

@olix0r Overall LGTM, the only question I have is about the different resource types. So far we have only dealt with Deployments, but this has a whole lot of different types. Is this meant to be foward looking? What does MESHED mean for a Service or Job?

rmars added a commit that referenced this issue Apr 3, 2018
* Define a new telemetry Stat API

Proposal definition for a new Stat API, for the purposes of satisfying the queries proposed in #627.
StatSummary will replace Stat once implemented and the original Stat deleted.
@klingerf
Copy link
Contributor

klingerf commented Apr 4, 2018

This is a follow-up from my comment on #663.

@olix0r Can you provide a bit more clarity on the --out-* set of flags?

The description for --out-from says "If present, restricts inbound stats to the the traffic originating from the specified resource." But as far as I can tell from this example, setting the --out-from flag causes us to return outbound stats, not inbound stats:

:; conduit stat -n emojivoto deploy/voting --out-from=deploy
NAME            OUT_FROM     OUT_SUCCESS   OUT_RPS   OUT_LATENCY_P50   OUT_LATENCY_P95
deploy/voting   deploy/web        70.00%       2.0               0ms               1ms

The description for --out-from-namespace says "Sets the namespace used to lookup the '--in-from' resource. By default the current '--namespace' is used.". But there is no --in-from flag listed in the help output. Maybe all of the --out-from* flags should be renamed --in-from*?

Based on the examples, it looks like the stat command returns inbound stats by default, but setting any of the --out-* flags causes us to return outbound stats instead, with some type of filtering applied? Is that a correct understanding? Is it possible to filter inbound stats in any way?

I'm not super familiar with the intended use cases, but whether or not we return inbound or outbound stats feels like a binary flag to me, whereas I would expect for us to be able to add filtering regardless of whether or not we're returning inbound or outbound stats.

@olix0r
Copy link
Member Author

olix0r commented Apr 4, 2018

@klingerf

Can you provide a bit more clarity on the --out-* set of flags?

The intention is that these are both outbound in. --out-from should look at the outbound stats from the specified resource where the dst is the main resource type. --out-to should look at the outbound stats from the main resource to the specified dst resource.

The description for --out-from says "If present, restricts inbound stats to the the traffic originating from the specified resource." But as far as I can tell from this example, setting the --out-from flag causes us to return outbound stats, not inbound stats:

That's probably copypasta :hurtrealbad:

The description for --out-from-namespace says "Sets the namespace used to lookup the '--in-from' resource. By default the current '--namespace' is used.". But there is no --in-from flag listed in the help output. Maybe all of the --out-from* flags should be renamed --in-from*?

Definitely copypasta :finnadie:

Based on the examples, it looks like the stat command returns inbound stats by default, but setting any of the --out-* flags causes us to return outbound stats instead, with some type of filtering applied? Is that a correct understanding? Is it possible to filter inbound stats in any way?

Correct. The main resource -- with no outs specified -- selects inbound stats. Because we don't have source annotations on inbound stats, no further filtering is supported on inbound.

@klingerf
Copy link
Contributor

klingerf commented Apr 4, 2018

That all makes sense, thanks. With that in mind, I'd like to recommend that we revise the inbound filtering flags as follows:

      --from='': If present, restricts stats to the outbound traffic originating from the specified resource.
      --from-namespace='': Sets the namespace used to lookup the '--from' resource. By default the current '--namespace' is used.
      --from-selector='': Selector (label query) to filter on, supports '=', '==', and '!='. (e.g. -l key1=value1,key2=value2)

And the outbound filtering as follows:

      --to='': If present, restricts stats to the outbound traffic destined for the specified resource.
      --to-namespace='': Sets the namespace used to lookup the '--to' resource. By default the current '--namespace' is used.
      --to-selector='': Selector (label query) to filter on, supports '=', '==', and '!='. (e.g. -l key1=value1,key2=value2)

Personally I'd also drop all of the IN_ and OUT_ prefixes from the column headers as well, since I think that makes things more confusing in that context. What do you think?

@olix0r
Copy link
Member Author

olix0r commented Apr 4, 2018

Personally I'd also drop all of the IN_ and OUT_ prefixes from the column headers as well, since I think that makes things more confusing in that context. What do you think?

This seems like a fine simplification for now. We may find a need to add it later, but simple is good.

@klingerf
Copy link
Contributor

klingerf commented Apr 5, 2018

Ok, have updated accordingly.

rmars added a commit that referenced this issue Apr 6, 2018
Start implementing new conduit stat summary endpoint. 
Changes the public-api to call prometheus directly instead of the
telemetry service. Wired through to `api/stat` on the web server,
as well as `conduit statsummary` on the CLI. Works for deployments only.

Current implementation just retrieves requests and mesh/total pod count 
(so latency stats are always 0). 

Uses API defined in #663
Example queries the stat endpoint will eventually satisfy in #627

This branch includes commits from @klingerf 

* run ./bin/dep ensure
* run ./bin/update-go-deps-shas
siggy added a commit that referenced this issue Apr 13, 2018
The cli and public-api only supported deployments as a resource type.

This change adds support for namespace as a resource type in the cli and
public-api. This also change includes:
- the cli statsummary command now prints `-`'s when objects are not in
  the mesh
- removed `out-` from cli statsummary flags, and analagous proto changes
- switched public-api to use native prometheus label types
- misc error handling and logging fixes

Part of #627

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
siggy added a commit that referenced this issue Apr 13, 2018
The cli and public-api only supported deployments as a resource type.

This change adds support for namespace as a resource type in the cli and
public-api. This also change includes:
- cli statsummary now prints `-`'s when objects are not in the mesh
- cli statsummary prints `No resources found.` when applicable
- removed `out-` from cli statsummary flags, and analagous proto changes
- switched public-api to use native prometheus label types
- misc error handling and logging fixes

Part of #627

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
siggy added a commit that referenced this issue Apr 13, 2018
The cli and public-api only supported deployments as a resource type.

This change adds support for namespace as a resource type in the cli and
public-api. This also change includes:
- cli statsummary now prints `-`'s when objects are not in the mesh
- cli statsummary prints `No resources found.` when applicable
- removed `out-` from cli statsummary flags, and analagous proto changes
- switched public-api to use native prometheus label types
- misc error handling and logging fixes

Part of #627

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
klingerf pushed a commit that referenced this issue Apr 13, 2018
The cli and public-api only supported deployments as a resource type.

This change adds support for namespace as a resource type in the cli and
public-api. This also change includes:
- cli statsummary now prints `-`'s when objects are not in the mesh
- cli statsummary prints `No resources found.` when applicable
- removed `out-` from cli statsummary flags, and analagous proto changes
- switched public-api to use native prometheus label types
- misc error handling and logging fixes

Part of #627

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
klingerf pushed a commit that referenced this issue Apr 13, 2018
The cli and public-api only supported deployments as a resource type.

This change adds support for namespace as a resource type in the cli and
public-api. This also change includes:
- cli statsummary now prints `-`'s when objects are not in the mesh
- cli statsummary prints `No resources found.` when applicable
- removed `out-` from cli statsummary flags, and analagous proto changes
- switched public-api to use native prometheus label types
- misc error handling and logging fixes

Part of #627

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
klingerf pushed a commit that referenced this issue Apr 13, 2018
The cli and public-api only supported deployments as a resource type.

This change adds support for namespace as a resource type in the cli and
public-api. This also change includes:
- cli statsummary now prints `-`'s when objects are not in the mesh
- cli statsummary prints `No resources found.` when applicable
- removed `out-` from cli statsummary flags, and analagous proto changes
- switched public-api to use native prometheus label types
- misc error handling and logging fixes

Part of #627

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
siggy added a commit that referenced this issue Apr 13, 2018
* Add namespace as a resource type in public-api

The cli and public-api only supported deployments as a resource type.

This change adds support for namespace as a resource type in the cli and
public-api. This also change includes:
- cli statsummary now prints `-`'s when objects are not in the mesh
- cli statsummary prints `No resources found.` when applicable
- removed `out-` from cli statsummary flags, and analagous proto changes
- switched public-api to use native prometheus label types
- misc error handling and logging fixes

Part of #627

Signed-off-by: Andrew Seigner <siggy@buoyant.io>

* Refactor filter and groupby label formulation

Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>

* Rename stat_summary.go to stat.go in cli

Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>

* Update rbac privileges for namespace stats

Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>
@olix0r olix0r removed this from the 0.4.0 milestone Apr 16, 2018
@olix0r olix0r added this to the 0.4.1 milestone Apr 16, 2018
@siggy siggy modified the milestones: 0.4.1, 0.5.0 Apr 25, 2018
@olix0r olix0r removed their assignment May 14, 2018
@klingerf klingerf removed this from the 0.5.0 milestone May 29, 2018
@grampelberg
Copy link
Contributor

@klingerf what's still missing from this issue?

@klingerf
Copy link
Contributor

@grampelberg I'd be fine with closing this, but I don't think we have issues tracking the remaining resources from the description for which we don't have stats. Specifically, we aren't exposing stats yet for:

  * daemonsets (aka 'ds')
  * jobs
  * nodes (aka 'no')
  * replicasets (aka 'rs')
  * statefulsets

@grampelberg
Copy link
Contributor

I really like how @siggy has been tracking progress for #420. Maybe we should just do a task list? I was mostly interested in what was still outstanding.

@klingerf klingerf changed the title rfc: resource-aware conduit stat Fully-implemented, resource-aware conduit stat command Jun 26, 2018
@klingerf klingerf changed the title Fully-implemented, resource-aware conduit stat command Fully-implemented, resource-aware conduit stat Jun 26, 2018
@klingerf
Copy link
Contributor

@grampelberg Works for me -- have updated the title and description accordingly and included a task list.

@grampelberg
Copy link
Contributor

@klingerf you're a gentleman and a scholar. Thank you!

@stale
Copy link

stale bot commented Oct 7, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Oct 7, 2018
@stale stale bot closed this as completed Oct 23, 2018
siggy added a commit that referenced this issue Dec 1, 2018
Filtering by Kubernetes job was not supported. Also filtering by any unknown
type caused a panic.

Add filtering support by Kubernetes job, with special case mapping `job` to
`k8s_job`, to not conflict with Prometheus' job label.

Fix panic when unknown type specified as a `--from` or `--to` flag.

Fix `job` label from `linkerd-proxy` overwriting Prometheus `job` label at
collection time. This caused all metrics collected by proxy sidecars in
Kubernetes jobs to be collected into an incorrect Prometheus job, rather than
the expected `linkerd-proxy` Prometheus job.

Fix `unsupported resource type` tap error message incorrectly printing the
target resource rather than the destination.

Set `--controller-log-level debug` in `install_test.go` for easier debugging.

Expose `slow-cooker`'s metrics via a k8s service in the tap integration test, to
validate proxy requests with a job as destination.

Fixes #1872
Part of #627

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
siggy added a commit that referenced this issue Dec 3, 2018
Filtering by Kubernetes job was not supported. Also filtering by any unknown
type caused a panic.

Add filtering support by Kubernetes job, with special case mapping `job` to
`k8s_job`, to not conflict with Prometheus' job label.

Fix panic when unknown type specified as a `--from` or `--to` flag.

Fix `job` label from `linkerd-proxy` overwriting Prometheus `job` label at
collection time. This caused all metrics collected by proxy sidecars in
Kubernetes jobs to be collected into an incorrect Prometheus job, rather than
the expected `linkerd-proxy` Prometheus job.

Fix `unsupported resource type` tap error message incorrectly printing the
target resource rather than the destination.

Set `--controller-log-level debug` in `install_test.go` for easier debugging.

Expose `slow-cooker`'s metrics via a k8s service in the tap integration test, to
validate proxy requests with a job as destination.

Fixes #1872
Part of #627

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 18, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants