-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement /v1/agent/health/service/<service name> endpoint #3551
Conversation
8c15944
to
44f2488
Compare
test in master branch do not pass at the moment, but I'll rebase when they will be green again |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As requested, mostly copied what has been said in criteo-forks@ea589ed.
agent/agent_endpoint.go
Outdated
} | ||
if len(serviceChecks) == 0 { | ||
resp.WriteHeader(http.StatusBadRequest) | ||
fmt.Fprintf(resp, "Invalid serviceID %s", serviceID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bad request or Not found (404) in that case ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
404 is probably more relevant, thx
agent/agent_endpoint.go
Outdated
// TODO: should we filter using acls like in AgentChecks method? | ||
serviceChecks := make(api.HealthChecks, 0) | ||
for _, c := range checks { | ||
if c.ServiceID == serviceID { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually these APIs are based on the service names.
It would be probably a good thing to also authorize calling the endpoint by passing the name. This would imply returning all the services matching the service name on the agent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless we add another endpoint, I think it might be confusing for users to interpret parameter as service id or name depending on xxx (based on what btw?).
Any input from hashicorp member?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the endpoints could be different (/service vs /services for ex).
My point was to allow using svc names when using this API.
agent/agent_endpoint.go
Outdated
case api.HealthPassing: | ||
resp.WriteHeader(http.StatusOK) | ||
default: | ||
resp.WriteHeader(http.StatusServiceUnavailable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the "unavailable" semantic is limited, HTTP-speaking, to "503", I would use the body to add details regarding what is currently failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of information would you add? which healthcheck?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would at least add the name of the failing healthcheck(s).
@slackpad would you have feedback regarding this PR? |
@magiconair would you have feedback on that PR? |
3469195
to
97809cc
Compare
rebasing on latest commit on master passing the CI. I'm hoping this will allow to get some attention on this PR. |
9de4b16
to
58e43ee
Compare
I've rebased and fixed conflicts. Is anyone interested by the feature? I'd gladly take feedback from hashicorp member. |
After discussion with @pierrecdn, he will update this PR to take the service name as input (instead of service id). This will better fit the use case of loadbalancer (such as haproxy or F5) for which healthcheck endpoint is usually not configurable apart from the hostname/post couple. |
So as discussed, I'm adding these changes on top. |
e40056c
to
c567c78
Compare
Added more tests We could also use this service to create aliases of services, see for instance https://groups.google.com/d/msg/consul-tool/iKJlYP-wGLw/HA4XcQERBQAJ |
@preetapan would you have feedback on that PR? |
A couple things, sorry for the delay in feedback here @kamaradclimber.
Thanks! |
Hello @pearkes , We basically want to use this in order to use Consul as a Cache for all healthChecks for a given service.
HeathChecks 2 and 3 are on another port than the main service, but the first healthcheck is dumb. What we want to do is to have a combinaison of all those heath checks, thus the loadbalancer will take better decision. Thus, this endpoint output the result of healthchecks[1,2,3] without calling those healthchecks (so, for instance does not open a new DB connection). What it means is:
Another option is to be able to rename services as described in this thread: https://groups.google.com/d/msg/consul-tool/iKJlYP-wGLw/HA4XcQERBQAJ => it gives us the ability to easily rename service without duplicating everything I'll try to create the website documentation as soon as possible Kind regards |
@kamaradclimber I had the same needs as you and solved it via consulate. The main difference is Consulate is a sidecar that can be installed on the same box as consul-agent. |
Thanks @kadaan. Indeed consulate solve the same need, with more configurability. However it requires one more agent running on all machines (or container depending on your model), to monitor, to update, to patch to tweak. Of course it depends on what consul maintainers think about this. |
@kamaradclimber Thanks! There are plusses and minuses both ways. For us it works well as all the concerns above are abstracted away via ansible. We use consul to verify consulate health, and prometheus scrape the metrics from consulate using consul discovery. That said, if it had already been part of consul, there is no way I would have spent time building it. One other benefit of the sidecar approach for us is being able to get this built and out in a really short period of time. Trying to get even a "simple" feature merge into a open source project can take a long time. When things are baked into the core, they really need to be sure that the long term support costs are worth it, the feature is generic enough, etc. So, if Hashicorp decides to take it as a core piece of the consul offering, I'll move to that as soon as I can upgrade. On the other hand, if they don't take the change, I see Consulate growing in importance for our stack. Like you we want to boil down all health checking to be backed by consul checks. That means the ELBs should verify consul health checks, ansible handlers should verify consul health checks, etc.. |
@pearkes, is the use case clearer?
As @pierresouchay explained, with thispatch, consul has the ability to hide the complexity of healthchecks (definition, frequency) and expose a dead simple api for all consumers. One addition benefit for users is they can describe very heavy healthchecks if necessary, which will be executed only by the consul agent (instead of having it executed by all loadbalancers instances). In criteo, we've built a free loadbalancer for all our internal users based on consul+haproxy (next paris HUG will give some details I think) where all consul services benefit from a loadbalancer without any additional configuration required thanks to this patch. As explained in the first commit of the PR, having such endpoint of the agent (instead of having it on the consul server) allows to have a very good resiliency. In a large cluster, loosing consul leadership during incident is nearly inevitable, you don't want your loadbalancing stack to die whne this happens. |
Thanks for all of the context @kamaradclimber and @pierrecdn -- we agree this seems like a good idea and we'll review in detail soon. |
@pearkes Any news or suggestions regarding this PR ? Kind regards |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pierresouchay First off, sorry for the delay on this review. I have been heads down on an upcoming ACL overhaul.
In addition to the 3 minor comments I left, I would still like to see the API package updated to utilize the new endpoint in api/agent.go.
agent/agent_endpoint.go
Outdated
fmt.Fprint(resp, status) | ||
return nil, nil | ||
} | ||
resp.Header().Add("Content-Type", "application/json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for this as its done in http.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is actually needed: since we return non HTTP 200 Error codes, If we don't do that, the Mime Type won't be properly sent (because HTTP Status Code has already been sent)
I did not find a way to both specify the mime-type, having errors properly specified (ie: having correct JSON format) and having the proper error code.
What do you suggest?
agent/agent_endpoint.go
Outdated
fmt.Fprintf(resp, "ServiceId %s not found", serviceID) | ||
return nil, nil | ||
} | ||
resp.Header().Add("Content-Type", "application/json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dito
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
agent/agent_endpoint.go
Outdated
fmt.Fprint(resp, status) | ||
return nil, nil | ||
} | ||
resp.Header().Add("Content-Type", "application/json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dito
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same reason. At the time of this patch as least, there were no way I could send any JSON data with non HTTP 200
@mkeeler Sorry, I did not see your previous old comment regarding api/agent, trying to add this ASAP |
@mkeeler Are you happy with this latest patch? |
211c21c
to
dd6f8dc
Compare
5595e69
to
5c7fd01
Compare
Thanks for pursuing this and rebasing to make it mergeable again, we are still interested! Matt is away for a while on leave and we've juggling some other release work but it's still on the radar! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple notes on bad merge and doc formatting. Will discuss as mentioned but as I saw this I thought I'd mention it now so it doesn't add extra time after!
``` | ||
======= | ||
|
||
>>>>>>> origin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like bad merge junk
`Accept` starting with `text/plain`. | ||
|
||
Those endpoints return the aggregated values of all healthchecks for the | ||
service and will return the corresponding HTTP codes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "for the service instance(s)"
|
||
| Result | Meaning | | ||
| ------ | ------------------------------------------------------ | | ||
| `200` | All healthchecks of this service are passing | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: suggest we drop "of this service" or change it to "of every matching service instance" which seems unnecessary.
the aggregated status of given service | ||
* create aliases for a given service (thus, the healthcheck of alias uses | ||
http://localhost:8500/v1/agent/service/id/aliased_service_id healthcheck) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These docs and examples are really great thanks!
Can we split the endpoints to make the consistent with all the other endpoint docs though please?
For example if we have other endpoints like /catalog/service/:name
and /catalog/connect/:name
that are almost identical except for one semantic detail we just document one and then make a stub with links in the other etc: https://www.consul.io/api/catalog.html#list-nodes-for-connect-capable-service
I think that patter would work fine here although the response format is subtly different so maybe only the description park would be shared/linked between.
Once, split, can we follow the same format as all other endpoints. i.e.:
## Title
Description of purpose
[Extended notes]
| Method | Path | Produces |
| ------ | ---------------------------- | -------------------------- |
| `GET` | `/path` | `application/json` |
The table below shows this endpoint's support for
[blocking queries](/api/index.html#blocking-queries),
[consistency modes](/api/index.html#consistency-modes),
[agent caching](/api/index.html#agent-caching), and
[required ACLs](/api/index.html#acls).
| Blocking Queries | Consistency Modes | Agent Caching | ACL Required |
| ---------------- | ----------------- | ------------- | -------------- |
| `NO` | `none` | `none` | `service:read` |
### Sample Request
```text
$ curl \
http://127.0.0.1:8500/v1/agent/services
Sample Response
{
"redis": {
"ID": "redis",
"Service": "redis",
"Tags": [],
"Meta": {
"redis_version": "4.0"
},
"Port": 8000,
"Address": "",
"EnableTagOverride": false,
"Weights": {
"Passing": 10,
"Warning": 1
}
}
}
I suggest the format variants you make the table like we did for metrics (in this same file!):
Thanks!
@banks I've updated the docs and split the endpoints |
@Aestek Thank you |
This endpoint aggregate all checks related to <service id> on the agent and return an appropriate http code + the string describing the worst check. This allows to cleanly expose service status to other component, hiding complexity of multiple checks. This is especially useful to use consul to feed a loadbalancer which would deleguate healthchecking to consul agent. Exposing this endpoint on the agent is necessary to avoid a hit on consul servers and avoid decreasing resiliency (this endpoint will work even if there is no consul leader in the cluster). Fix hashicorp#2488, relates to hashicorp#802 Change-Id: Ib340c62bbbba46fd4256ed31474d8ffb1762d4df Signed-off-by: Grégoire Seux <g.seux@criteo.com>
In order to make it work, had to perform changes in http to return a payload while not returning HTTP 200
74f2d93
to
c2b7743
Compare
(rewrote git history so its simpler to rebase on our side) |
@Peakles Possible to have a look, there were a kind of Consensus around this PR, and we spent quite some time rebasing it constantly for more than 1 year. Furthermore, it does not seems that risky since it does not modify any existing endpoint |
Thanks for patience here folks. We do appreciate the effort, just as always a balancing act of spending time thinking through all the future implications and docs etc. and other work. I'm going to tentatively tag this for 1.4.1. so it's on our list. That's not a definite promise it will get in if we have to release early etc. but hopefully it just needs us to recall the context and give a final pass. |
@pierresouchay Now that I am back I am going to give this another look today. All your comments to my previous batch of requests sound good but its been a while so I am going to look it all over again. |
@mkeeler thank you so much! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks good. There is just the one question I have regarding usage of the CodeWithPayloadError struct.
Once that is worked out I see no barriers to getting this merged.
api/api.go
Outdated
@@ -775,7 +787,9 @@ func (c *Client) query(endpoint string, out interface{}, q *QueryOptions) (*Quer | |||
r.setQueryOptions(q) | |||
rtt, resp, err := requireOK(c.doRequest(r)) | |||
if err != nil { | |||
return nil, err | |||
if _, ok := err.(CodeWithPayloadError); !ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I am just missing it but how can the CodeWithPayloadError ever be returned by requireOk/doRequest.
Also if this is the only place CodeWithPayloadError is used here and its not actually needed then the struct should be moved into the main agent package and out of the api package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mkeeler Yes, you are right (many refactoring were performed, it needed to be in api package in previous interations). Fixed in next patch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I think it will still require someone from consul-docs to review this too.
🎆 |
This endpoint aggregate all checks related to on the agent
and return an appropriate http code + the string describing the worst
check.
This allows to cleanly expose service status to other component, hiding
complexity of multiple checks.
This is especially useful to use consul to feed a loadbalancer which
would deleguate healthchecking to consul agent.
Exposing this endpoint on the agent is necessary to avoid a hit on
consul servers and avoid decreasing resiliency (this endpoint will work
even if there is no consul leader in the cluster).
Fix #2488, relates to #802
Change-Id: Ib340c62bbbba46fd4256ed31474d8ffb1762d4df
Signed-off-by: Grégoire Seux g.seux@criteo.com