Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement /v1/agent/health/service/<service name> endpoint #3551

Merged
merged 5 commits into from
Jan 7, 2019

Conversation

kamaradclimber
Copy link
Contributor

This endpoint aggregate all checks related to on the agent
and return an appropriate http code + the string describing the worst
check.

This allows to cleanly expose service status to other component, hiding
complexity of multiple checks.
This is especially useful to use consul to feed a loadbalancer which
would deleguate healthchecking to consul agent.

Exposing this endpoint on the agent is necessary to avoid a hit on
consul servers and avoid decreasing resiliency (this endpoint will work
even if there is no consul leader in the cluster).

Fix #2488, relates to #802

Change-Id: Ib340c62bbbba46fd4256ed31474d8ffb1762d4df
Signed-off-by: Grégoire Seux g.seux@criteo.com

@kamaradclimber kamaradclimber force-pushed the agent_health_service branch 2 times, most recently from 8c15944 to 44f2488 Compare October 7, 2017 17:28
@kamaradclimber
Copy link
Contributor Author

test in master branch do not pass at the moment, but I'll rebase when they will be green again

Copy link

@pierrecdn pierrecdn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As requested, mostly copied what has been said in criteo-forks@ea589ed.

}
if len(serviceChecks) == 0 {
resp.WriteHeader(http.StatusBadRequest)
fmt.Fprintf(resp, "Invalid serviceID %s", serviceID)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bad request or Not found (404) in that case ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

404 is probably more relevant, thx

// TODO: should we filter using acls like in AgentChecks method?
serviceChecks := make(api.HealthChecks, 0)
for _, c := range checks {
if c.ServiceID == serviceID {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually these APIs are based on the service names.
It would be probably a good thing to also authorize calling the endpoint by passing the name. This would imply returning all the services matching the service name on the agent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we add another endpoint, I think it might be confusing for users to interpret parameter as service id or name depending on xxx (based on what btw?).

Any input from hashicorp member?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the endpoints could be different (/service vs /services for ex).
My point was to allow using svc names when using this API.

case api.HealthPassing:
resp.WriteHeader(http.StatusOK)
default:
resp.WriteHeader(http.StatusServiceUnavailable)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the "unavailable" semantic is limited, HTTP-speaking, to "503", I would use the body to add details regarding what is currently failing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of information would you add? which healthcheck?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would at least add the name of the failing healthcheck(s).

@kamaradclimber
Copy link
Contributor Author

@slackpad would you have feedback regarding this PR?

@kamaradclimber
Copy link
Contributor Author

@magiconair would you have feedback on that PR?

@kamaradclimber kamaradclimber force-pushed the agent_health_service branch 2 times, most recently from 3469195 to 97809cc Compare November 19, 2017 12:01
@kamaradclimber
Copy link
Contributor Author

rebasing on latest commit on master passing the CI. I'm hoping this will allow to get some attention on this PR.

@kamaradclimber kamaradclimber force-pushed the agent_health_service branch 2 times, most recently from 9de4b16 to 58e43ee Compare January 23, 2018 17:56
@kamaradclimber
Copy link
Contributor Author

I've rebased and fixed conflicts. Is anyone interested by the feature? I'd gladly take feedback from hashicorp member.

@kamaradclimber
Copy link
Contributor Author

After discussion with @pierrecdn, he will update this PR to take the service name as input (instead of service id). This will better fit the use case of loadbalancer (such as haproxy or F5) for which healthcheck endpoint is usually not configurable apart from the hostname/post couple.

@pierrecdn
Copy link

So as discussed, I'm adding these changes on top.

@pierrecdn pierrecdn force-pushed the agent_health_service branch 3 times, most recently from e40056c to c567c78 Compare February 13, 2018 19:20
@kamaradclimber kamaradclimber changed the title Implement /v1/agent/health/service/<service id> endpoint Implement /v1/agent/health/service/<service name> endpoint Mar 5, 2018
@pierresouchay
Copy link
Contributor

Added more tests

We could also use this service to create aliases of services, see for instance https://groups.google.com/d/msg/consul-tool/iKJlYP-wGLw/HA4XcQERBQAJ

@kamaradclimber
Copy link
Contributor Author

@preetapan would you have feedback on that PR?
We are using it to improve our LB stack to leverage consul health checks (instead of asking the loadbalancers to have their own).

@pearkes
Copy link
Contributor

pearkes commented Apr 27, 2018

A couple things, sorry for the delay in feedback here @kamaradclimber.

  • Do you mind removing changes to the makefile?
  • This will need documentation (website/) given it is a new API endpoint
  • Importantly, we're having a bit of trouble getting to the bottom of your use case. We see some similar discussion in Local health endpoint #2488 in regards to ELBs, and think this looks like generally a reasonable change, but still don't see the specific use case. Can you explain in depth?

Thanks!

@pierresouchay
Copy link
Contributor

Hello @pearkes ,

We basically want to use this in order to use Consul as a Cache for all healthChecks for a given service.
The idea is to have a service with 3 healthChecks:

  • one very fast and dumb on service port (currently used by our LBs)
  • a 2nd one very complex run every 10s, more complex, opening a database connection for instance
  • another more complex run every minute on admin port (detecting deadlocks for instance)

HeathChecks 2 and 3 are on another port than the main service, but the first healthcheck is dumb. What we want to do is to have a combinaison of all those heath checks, thus the loadbalancer will take better decision. Thus, this endpoint output the result of healthchecks[1,2,3] without calling those healthchecks (so, for instance does not open a new DB connection).

What it means is:

  • every LB can call very quickly this new endpoint and have a very good quality heathcheck
  • as soon as we add new heath checks, we do not call heavily those heath checks that are called ONLY by Consul, but the aggregated result is sent to LB
  • it also means we can use ONLY one healthcheck per service on our load balancers (many of them cannot have several healthchecks)

Another option is to be able to rename services as described in this thread: https://groups.google.com/d/msg/consul-tool/iKJlYP-wGLw/HA4XcQERBQAJ => it gives us the ability to easily rename service without duplicating everything

I'll try to create the website documentation as soon as possible

Kind regards

@kadaan
Copy link

kadaan commented May 1, 2018

@kamaradclimber I had the same needs as you and solved it via consulate. The main difference is Consulate is a sidecar that can be installed on the same box as consul-agent.

@kamaradclimber
Copy link
Contributor Author

Thanks @kadaan. Indeed consulate solve the same need, with more configurability. However it requires one more agent running on all machines (or container depending on your model), to monitor, to update, to patch to tweak.
Given the simplicity of the proposed feature, I feel it makes sense to have it in consul itself.

Of course it depends on what consul maintainers think about this.

@kadaan
Copy link

kadaan commented May 1, 2018

@kamaradclimber Thanks! There are plusses and minuses both ways. For us it works well as all the concerns above are abstracted away via ansible. We use consul to verify consulate health, and prometheus scrape the metrics from consulate using consul discovery. That said, if it had already been part of consul, there is no way I would have spent time building it.

One other benefit of the sidecar approach for us is being able to get this built and out in a really short period of time. Trying to get even a "simple" feature merge into a open source project can take a long time. When things are baked into the core, they really need to be sure that the long term support costs are worth it, the feature is generic enough, etc.

So, if Hashicorp decides to take it as a core piece of the consul offering, I'll move to that as soon as I can upgrade.

On the other hand, if they don't take the change, I see Consulate growing in importance for our stack.

Like you we want to boil down all health checking to be backed by consul checks. That means the ELBs should verify consul health checks, ansible handlers should verify consul health checks, etc..

@kamaradclimber
Copy link
Contributor Author

@pearkes, is the use case clearer?
We want to leverage consul as a source for our load balancer while avoiding to duplicate health checks definition.
Doing so allow users to understand only one service definition format (consul) and benefit from healthchecks in all systems built around it:

  • pseudo loadbalancing with dns (using consul dns api)
  • regular loadbalancers (in our case haproxy and F5)
  • client side loadbalancers (loadbalancer running in client processes or as client companions)

As @pierresouchay explained, with thispatch, consul has the ability to hide the complexity of healthchecks (definition, frequency) and expose a dead simple api for all consumers.

One addition benefit for users is they can describe very heavy healthchecks if necessary, which will be executed only by the consul agent (instead of having it executed by all loadbalancers instances).

In criteo, we've built a free loadbalancer for all our internal users based on consul+haproxy (next paris HUG will give some details I think) where all consul services benefit from a loadbalancer without any additional configuration required thanks to this patch.

As explained in the first commit of the PR, having such endpoint of the agent (instead of having it on the consul server) allows to have a very good resiliency. In a large cluster, loosing consul leadership during incident is nearly inevitable, you don't want your loadbalancing stack to die whne this happens.

@pearkes
Copy link
Contributor

pearkes commented May 4, 2018

Thanks for all of the context @kamaradclimber and @pierrecdn -- we agree this seems like a good idea and we'll review in detail soon.

@pierresouchay
Copy link
Contributor

@pearkes Any news or suggestions regarding this PR ?

Kind regards

Copy link
Member

@mkeeler mkeeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pierresouchay First off, sorry for the delay on this review. I have been heads down on an upcoming ACL overhaul.

In addition to the 3 minor comments I left, I would still like to see the API package updated to utilize the new endpoint in api/agent.go.

fmt.Fprint(resp, status)
return nil, nil
}
resp.Header().Add("Content-Type", "application/json")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for this as its done in http.go

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is actually needed: since we return non HTTP 200 Error codes, If we don't do that, the Mime Type won't be properly sent (because HTTP Status Code has already been sent)

I did not find a way to both specify the mime-type, having errors properly specified (ie: having correct JSON format) and having the proper error code.

What do you suggest?

fmt.Fprintf(resp, "ServiceId %s not found", serviceID)
return nil, nil
}
resp.Header().Add("Content-Type", "application/json")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dito

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

fmt.Fprint(resp, status)
return nil, nil
}
resp.Header().Add("Content-Type", "application/json")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dito

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reason. At the time of this patch as least, there were no way I could send any JSON data with non HTTP 200

@pierresouchay
Copy link
Contributor

@mkeeler Sorry, I did not see your previous old comment regarding api/agent, trying to add this ASAP

@pierresouchay
Copy link
Contributor

@mkeeler Are you happy with this latest patch?

@banks banks added this to the Upcoming milestone Nov 9, 2018
@banks
Copy link
Member

banks commented Nov 9, 2018

Thanks for pursuing this and rebasing to make it mergeable again, we are still interested!

Matt is away for a while on leave and we've juggling some other release work but it's still on the radar!

Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple notes on bad merge and doc formatting. Will discuss as mentioned but as I saw this I thought I'd mention it now so it doesn't add extra time after!

```
=======

>>>>>>> origin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like bad merge junk

`Accept` starting with `text/plain`.

Those endpoints return the aggregated values of all healthchecks for the
service and will return the corresponding HTTP codes:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "for the service instance(s)"


| Result | Meaning |
| ------ | ------------------------------------------------------ |
| `200` | All healthchecks of this service are passing |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: suggest we drop "of this service" or change it to "of every matching service instance" which seems unnecessary.

the aggregated status of given service
* create aliases for a given service (thus, the healthcheck of alias uses
http://localhost:8500/v1/agent/service/id/aliased_service_id healthcheck)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These docs and examples are really great thanks!

Can we split the endpoints to make the consistent with all the other endpoint docs though please?

For example if we have other endpoints like /catalog/service/:name and /catalog/connect/:name that are almost identical except for one semantic detail we just document one and then make a stub with links in the other etc: https://www.consul.io/api/catalog.html#list-nodes-for-connect-capable-service

I think that patter would work fine here although the response format is subtly different so maybe only the description park would be shared/linked between.

Once, split, can we follow the same format as all other endpoints. i.e.:

## Title

Description of purpose

[Extended notes]

| Method | Path                         | Produces                   |
| ------ | ---------------------------- | -------------------------- |
| `GET`  | `/path`            | `application/json`         |

The table below shows this endpoint's support for
[blocking queries](/api/index.html#blocking-queries),
[consistency modes](/api/index.html#consistency-modes),
[agent caching](/api/index.html#agent-caching), and
[required ACLs](/api/index.html#acls).

| Blocking Queries | Consistency Modes | Agent Caching | ACL Required   |
| ---------------- | ----------------- | ------------- | -------------- |
| `NO`             | `none`            | `none`        | `service:read` |

### Sample Request

```text
$ curl \
    http://127.0.0.1:8500/v1/agent/services

Sample Response

{
  "redis": {
      "ID": "redis",
      "Service": "redis",
      "Tags": [],
      "Meta": {
          "redis_version": "4.0"
      },
      "Port": 8000,
      "Address": "",
      "EnableTagOverride": false,
      "Weights": {
          "Passing": 10,
          "Warning": 1
      }
  }
}

I suggest the format variants you make the table like we did for metrics (in this same file!):

image

Thanks!

@ShimmerGlass
Copy link
Contributor

@banks I've updated the docs and split the endpoints

@pierresouchay
Copy link
Contributor

@Aestek Thank you

pierresouchay and others added 4 commits November 9, 2018 17:02
This endpoint aggregate all checks related to <service id> on the agent
and return an appropriate http code + the string describing the worst
check.

This allows to cleanly expose service status to other component, hiding
complexity of multiple checks.
This is especially useful to use consul to feed a loadbalancer which
would deleguate healthchecking to consul agent.

Exposing this endpoint on the agent is necessary to avoid a hit on
consul servers and avoid decreasing resiliency (this endpoint will work
even if there is no consul leader in the cluster).

Fix hashicorp#2488, relates to hashicorp#802

Change-Id: Ib340c62bbbba46fd4256ed31474d8ffb1762d4df
Signed-off-by: Grégoire Seux <g.seux@criteo.com>
In order to make it work, had to perform changes in http to
return a payload while not returning HTTP 200
@ShimmerGlass
Copy link
Contributor

(rewrote git history so its simpler to rebase on our side)

@pierresouchay
Copy link
Contributor

@Peakles Possible to have a look, there were a kind of Consensus around this PR, and we spent quite some time rebasing it constantly for more than 1 year.

Furthermore, it does not seems that risky since it does not modify any existing endpoint

@banks
Copy link
Member

banks commented Nov 28, 2018

Thanks for patience here folks. We do appreciate the effort, just as always a balancing act of spending time thinking through all the future implications and docs etc. and other work.

I'm going to tentatively tag this for 1.4.1. so it's on our list. That's not a definite promise it will get in if we have to release early etc. but hopefully it just needs us to recall the context and give a final pass.

@banks banks modified the milestones: Upcoming, 1.4.1 Nov 28, 2018
@mkeeler
Copy link
Member

mkeeler commented Jan 4, 2019

@pierresouchay Now that I am back I am going to give this another look today. All your comments to my previous batch of requests sound good but its been a while so I am going to look it all over again.

@pierresouchay
Copy link
Contributor

@mkeeler thank you so much!

Copy link
Member

@mkeeler mkeeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good. There is just the one question I have regarding usage of the CodeWithPayloadError struct.

Once that is worked out I see no barriers to getting this merged.

api/api.go Outdated
@@ -775,7 +787,9 @@ func (c *Client) query(endpoint string, out interface{}, q *QueryOptions) (*Quer
r.setQueryOptions(q)
rtt, resp, err := requireOK(c.doRequest(r))
if err != nil {
return nil, err
if _, ok := err.(CodeWithPayloadError); !ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am just missing it but how can the CodeWithPayloadError ever be returned by requireOk/doRequest.

Also if this is the only place CodeWithPayloadError is used here and its not actually needed then the struct should be moved into the main agent package and out of the api package.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkeeler Yes, you are right (many refactoring were performed, it needed to be in api package in previous interations). Fixed in next patch

@kamaradclimber kamaradclimber requested a review from a team as a code owner January 7, 2019 09:14
Copy link
Member

@mkeeler mkeeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I think it will still require someone from consul-docs to review this too.

@mkeeler mkeeler merged commit 4f62a3b into hashicorp:master Jan 7, 2019
@kamaradclimber
Copy link
Contributor Author

🎆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants