Consider returning empty usage data for /v3/processes/:guid/stats #2227

tcdowney · 2021-04-22T17:29:01Z

Thanks for submitting an issue to cloud_controller_ng. We are always trying to improve! To help us, please fill out the following template.

Issue

When log-cache (or metric-proxy on cf-for-k8s) is unavailable or returns a bad response for a particular process the entire /v3/processes/:guid/stats endpoint will exit with an error.

Context

This behavior is unfortunate for clients who do not care about process metrics and instead are using the endpoint for other information (e.g. is my process running? what ports is my process listening on?). Ideally the endpoint could continue to respond with the information it is able to provide when the platform is degraded.

Steps to Reproduce

On cf-for-k8s you can easily reproduce this behavior by running cf restart on an app in a loops. It fails roughly every 10 restarts since metric-proxy has difficulty fetching metrics from the Kubernetes API while pod containers are being churned.

More easily you can reproduce this by explicitly raising an error in the log-cache client in this method:

cloud_controller_ng/lib/logcache/client.rb

Line 32 in 637fe7d

    
           def container_metrics(source_guid:, envelope_limit: DEFAULT_LIMIT, start_time:, end_time:)

Expected result

Request:

curl "https://api.example.org/v3/processes/[guid]/stats" \
  -X GET \
  -H "Authorization: bearer [token]"

Response:

HTTP/1.1 200 OK
Content-Type: application/json

{
  "resources": [
    {
      "type": "web",
      "index": 0,
      "state": "RUNNING",
      "usage": { },
      "host": "10.244.16.10",
      "instance_ports": [
        {
          "external": 64546,
          "internal": 8080,
          "external_tls_proxy_port": 61002,
          "internal_tls_proxy_port": 61003
        }
      ],
      "uptime": 9042,
      "mem_quota": 268435456,
      "disk_quota": 1073741824,
      "fds_quota": 16384,
      "isolation_segment": "example_iso_segment",
      "details": null
    }
  ]
}

or

HTTP/1.1 200 OK
Content-Type: application/json

{
  "resources": [
    {
      "type": "web",
      "index": 0,
      "state": "RUNNING",
      "usage": {
        "time": "2016-03-23T23:17:30.476314154Z",
        "cpu": null,
        "mem": null,
        "disk": null
      },
      "host": "10.244.16.10",
      "instance_ports": [
        {
          "external": 64546,
          "internal": 8080,
          "external_tls_proxy_port": 61002,
          "internal_tls_proxy_port": 61003
        }
      ],
      "uptime": 9042,
      "mem_quota": 268435456,
      "disk_quota": 1073741824,
      "fds_quota": 16384,
      "isolation_segment": "example_iso_segment",
      "details": null
    }
  ]
}

Current result

An error is returned from the CF API. Clients typically do not handle this well and it results in failed cf push and cf restart commands for users.

Possible Fix

One potential fix is to return 0s for the values since this is what we do for empty envelopes. However, in a recent story (https://www.pivotaltracker.com/n/projects/966314/stories/177066349) it was pointed out that his is not ideal since 0 is a valid value for these metrics by @jenspinney. I agree that that's not ideal. Alternatively you could use a sentinel value integer like -1, but I imagine clients that are making decisions based on this data might not handle that well and it may result in odd bugs.

I think the safest and most semantic fix here is to return an empty usage array. Some clients may not handle that correctly, but ideally it would manifest as a response parsing error rather than a mathematical error.

Alternatively the onus could be put on the clients to retry when they get errors back from this endpoint. However, if log-cache is unavailable for long periods of time, this will not be much help. Since this endpoint is in the critical path for cf push and cf (re)start I think it makes the most sense to try and improve the experience in Cloud Controller.

For good measure, I did create an issue with the CLI, though.

cloudfoundry/cli#2160

The text was updated successfully, but these errors were encountered:

tcdowney · 2021-04-27T19:04:31Z

We had a discussion around this on Slack, there were concerns about a 200 response masking the fact that errors were occurring and I mentioned potentially including warnings or errors to surface the issue (in addition to logs server side).

Now that I read through the API docs, I see we've only added these to the singular Job resource right now and there aren't any examples of a list endpoint having them. I'm imagining this could look something like the following in this case:

{
  "warnings": [
    {
      "detail": "Stats server temporarily unavailable."
    }
  ],
  "resources": [
    {
      "type": "web",
      "index": 0,
      "state": "RUNNING",
      "usage": {},
      "host": "10.244.16.10",
      "instance_ports": [
        {
          "external": 64546,
          "internal": 8080,
          "external_tls_proxy_port": 61002,
          "internal_tls_proxy_port": 61003
        }
      ],
      "uptime": 9042,
      "fds_quota": 16384,
      "isolation_segment": "example_iso_segment",
      "details": null
    }
  ]
}

One other thing I realized was that the mem_quota and disk_quota fields are also backed by log-cache now. Assuming we go with "usage": {}, what should we do with these other two fields? Omit the fields entirely? Set them to null? Have them fall back to the desired state for these values?

I think omitting them would align more closely with the existing behavior of the endpoint which provides a very limited set of fields when instances are down.

cloud_controller_ng/app/presenters/v3/process_stats_presenter.rb

Lines 54 to 63 in 24a27be

    
           def down_instance_stats_hash(index, stats) 
        
             { 
        
               type:   @type, 
        
               index:  index, 
        
               state:  stats[:state], 
        
               uptime: stats[:uptime], 
        
               isolation_segment: stats[:isolation_segment], 
        
               details: stats[:details] 
        
             } 
        
           end

cc @sethboyles @Gerg

tcdowney · 2021-04-27T20:12:12Z

Actually, regarding the mem_quota and disk_quota, I see they already have the potential to be nil. I'll just maintain that behavior.

cloud_controller_ng/lib/cloud_controller/diego/reporters/instances_stats_reporter.rb

Lines 54 to 55 in 24a27be

    
           mem_quota:  quota_stats[index]&.memoryBytesQuota, 
        
           disk_quota: quota_stats[index]&.diskBytesQuota,

sethboyles · 2021-04-28T00:00:06Z

Hm, I also thought we had those warnings on more resources besides jobs.

Interestingly, we have this method add_warning_headers in our v3 application controller that adds a warning header to the response:

cloud_controller_ng/app/controllers/v3/application_controller.rb

Lines 144 to 151 in 24a27be

    
           def add_warning_headers(warnings) 
        
             return if warnings.nil? 
        
             raise ArgumentError.new('warnings should be an array') unless warnings.is_a?(Array) 
        
             warnings.each do |warning| 
        
               response.add_header('X-Cf-Warnings', CGI.escape(warning)) 
        
             end 
        
           end

It was introduced by this PR: https://github.com/cloudfoundry/cloud_controller_ng/pull/1359/files

and the only usage of it was removed shortly afterward when that endpoint was made to handle the service broker creation asynchronously:
8eeabf5#diff-374c7916a418e1daf84ad9223fb55dc0097dc1a9d28cdbdf779502af97ed9c08

So we have this method that is unused---BUT the CLI apparently still has code in it to print out these header warnings when it sees them.

on a bosh-lite I added a warning like this:

class AppsV3Controller < ApplicationController
  def index
    add_warning_headers(["IN APPS CONTROLLER"])
    message = AppsListMessage.from_params(query_params)

And when running a CLI command it gets printed:

± sb → cf apps
Getting apps in org org / space space as admin...

IN APPS CONTROLLER
name    requested state   processes             routes
dora    started           web:1/1, worker:0/0   dora.flint-duck.capi.land

I haven't had a chance to think too much about this pattern and whether or not its preferable to the above. But it is an option...

- this change rescues errors from log-cache and metric-proxy and passes them up as warnings to callers instead of exceptions so that the stats endpoint can continue to function on a best-effort basis - on /v3/processes/:guid/stats these warnings are included in a new "warnings" key on the response, on v2 they are ignored - addresses #2227 Authored-by: Tim Downey <tdowney@vmware.com>

tcdowney · 2021-04-28T03:51:14Z

@sethboyles I created a draft PR to get some initial feedback:
#2250

I had to jump through some hoops to get the "warning messages" passed up high enough for the controllers to act on them and ended up using Go style multi-value returns to do so. It's not the most idiomatic Ruby I've written so I'd appreciate y'all's thoughts on that. I'm also happy to use the 'X-Cf-Warnings' warning header in addition to or in place of the "warnings" key.

If this approach seems fine, though, I'll probably clean it up a bit further and add a commit to tweak the docs and remove the now unused temporary_ignore_server_unavailable_errors config option.

- this change rescues errors from log-cache and metric-proxy and passes them up as warnings to callers instead of exceptions so that the stats endpoint can continue to function on a best-effort basis - on /v3/processes/:guid/stats these warnings are included in the 'X-Cf-Warnings' header - addresses #2227 Authored-by: Tim Downey <tdowney@vmware.com>

- this temporary config flag was used to ignore server unavailable errors coming back from log-cache (or metric-proxy in cf-for-k8s) - given the change proposed in #2227 to rescue more errors and serve partial stats responses this config is no longer necessary Authored-by: Tim Downey <tdowney@vmware.com>

In #2227 we updated the Process Stats endpoint to be able to return partial responses (with a warning) when the stats server is unavailable. Authored-by: Tim Downey <tdowney@vmware.com>

tcdowney · 2021-04-30T21:32:04Z

@sethboyles one annoying thing about the warning I just discovered is that the CLI emits it to stdout (I had assumed it was going to stderr) -- even on cf curl.

So if you've got something trying to parse this as json it ends up choking on Stats server temporarily unavailable..

cf curl /v3/processes/e9c6facc-7c9b-4cf6-9a7f-e5a4c908157c/stats
{
   "resources": [
      {
         "type": "web",
         "index": 0,
         "state": "RUNNING",
         "host": "127.0.0.1",
         "uptime": 173624,
         "mem_quota": null,
         "disk_quota": null,
         "fds_quota": 16384,
         "isolation_segment": "placeholder",
         "details": null,
         "instance_ports": [
            {
               "external": 80,
               "internal": 8080,
               "external_tls_proxy_port": null,
               "internal_tls_proxy_port": null
            }
         ],
         "usage": {}
      }
   ]
}
Stats server temporarily unavailable.

This is probably more of a CLI issue, but it was a bit unexpected.

sethboyles · 2021-04-30T22:10:47Z

Ah, interesting. Is this just particular to the cf curl command?

± sb |main {4} U:1 ✗| → cf app dora 2>&1 > /dev/null
Stats server temporarily unavailable.

± sb |main {4} U:1 ✗| → cf curl v3/apps/$(cf app dora --guid)/processes/web/stats 2>&1 > /dev/null
(nothing)

this is using CF CLI v7

If so, might be worth opening an issue for the cf cli

- Since the `/v3/processes/:guid/stats endpoint now may set an `X-Cf-Warnings` header (see cloudfoundry/cloud_controller_ng#2227), the CLI may return the contents of this header along with the endpoint's regular JSON response which causes parsing errors in the BARAs - Logged cloudfoundry/cli#2164 with the CLI to make it so that header isn't printed to `stdout` for this command, but until then we can just switch to use a regular http client to fetch this endpoint [#177518468](https://www.pivotaltracker.com/story/show/177518468) Authored-by: Tim Downey <tdowney@vmware.com>

sethboyles · 2021-05-10T19:31:38Z

This was release in 1.110 (though it seems to have escaped the release notes)

tcdowney self-assigned this Apr 27, 2021

tcdowney mentioned this issue Apr 28, 2021

Support partial Process Stats responses when the stats service (log-cache or metric-proxy) is unavailable #2250

Merged

5 tasks

tcdowney mentioned this issue May 3, 2021

'X-Cf-Warnings' header should not print to stdout by default for cf curl cloudfoundry/cli#2164

Closed

4 tasks

sethboyles closed this as completed May 10, 2021

tcdowney mentioned this issue Feb 18, 2022

[Feature]: API can get process stats cloudfoundry/korifi#681

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider returning empty usage data for /v3/processes/:guid/stats #2227

Consider returning empty usage data for /v3/processes/:guid/stats #2227

tcdowney commented Apr 22, 2021 •

edited

Loading

tcdowney commented Apr 27, 2021 •

edited

Loading

tcdowney commented Apr 27, 2021

sethboyles commented Apr 28, 2021

tcdowney commented Apr 28, 2021

tcdowney commented Apr 30, 2021

sethboyles commented Apr 30, 2021 •

edited

Loading

sethboyles commented May 10, 2021

Consider returning empty usage data for /v3/processes/:guid/stats #2227

Consider returning empty usage data for /v3/processes/:guid/stats #2227

Comments

tcdowney commented Apr 22, 2021 • edited Loading

Issue

Context

Steps to Reproduce

Expected result

Current result

Possible Fix

tcdowney commented Apr 27, 2021 • edited Loading

tcdowney commented Apr 27, 2021

sethboyles commented Apr 28, 2021

tcdowney commented Apr 28, 2021

tcdowney commented Apr 30, 2021

sethboyles commented Apr 30, 2021 • edited Loading

sethboyles commented May 10, 2021

tcdowney commented Apr 22, 2021 •

edited

Loading

tcdowney commented Apr 27, 2021 •

edited

Loading

sethboyles commented Apr 30, 2021 •

edited

Loading