BUGFIX #7374 When a service does not exists in an alias, consider it failing #7384

pierresouchay · 2020-03-04T14:43:34Z

In current implementation of Consul, check alias cannot determine if a service exists or not. Because a service without any check is semantically considered as passing, so when no healthchecks are found for an agent, the check was considered as passing.

But this make little sense as the current implementation does not make any difference between:

a non-existing service (passing)
a service without any check (passing as well)

In order to make it work, we have to ensure that when a check did not find any healthcheck, the service does indeed exists. If it does not, lets consider the check as failing.

It should fix #7374

oto313 · 2020-03-06T07:28:37Z

@pierresouchay I dont develop in so I dont understand it deeply. But I see checking for service existence so I think it would fix my problem.

pierresouchay · 2020-03-17T17:56:59Z

@ShimmerGlass review?

…sider it failing In current implementation of Consul, check alias cannot determine if a service exists or not. Because a service without any check is semantically considered as passing, so when no healthchecks are found for an agent, the check was considered as passing. But this make little sense as the current implementation does not make any difference between: * a non-existing service (passing) * a service without any check (passing as well) In order to make it work, we have to ensure that when a check did not find any healthcheck, the service does indeed exists. If it does not, lets consider the check as failing.

hanshasselberg · 2020-05-20T08:27:56Z

I thought about the problem and I think you are right. This feature was introduced with #4320:

The use case here is that the health of a service may in fact depend on the health of another service, the most basic example being a sidecar proxy. A sidecar proxy should not be healthy if the service that it is proxying to is also not healthy, because any connection to the sidecar will invariably fail connecting to the upstream service instance.

With that in mind, a service should also not be healthy if there is no sidecar proxy! Now that I have made up my mind, I will take time to review your PR soon.

pierresouchay · 2020-05-20T08:37:30Z

@i0rek Thank you. This feature is actually great because it would allow us renaming services easily (the old service name would be an alias to the new one). But Allowing to have a dangling link to a service is a showstopper for us.

hanshasselberg · 2020-05-20T08:57:30Z

The implementation makes extra requests to determine if a service exists. This shouldn't be necessary because there is already a request to the servers to Health.NodeChecks.
This endpoint is too unspecific, but if we would add Health.ServiceChecks and make it return 404 in case a service doesn't exist, that would solve our problem.

BUT Health.NodeChecks returns [] for a non-existing node. And if we would make Health.ServiceChecks follow the same pattern it wouldn't be a big help, because we need an indication that the service doesn't exist.

Do you have other ideas? I will also discuss this internally.

pierresouchay · 2020-05-20T10:06:36Z

@i0rek Yes, I know the implementation is suboptimal, but the new request is Done ONLY when no check is found on service.

Some time ago, we implemented this: #3551

It allows to check for a service from its name (returns a collection) or by its ID, so you can:

/v1/agent/health/service/id/<myServiceId> => Present or absent
or /v1/agent/health/service/name/<myServiceName> => Collection

This is almost Ok, but has a big drawback: the agent itself is responding to the request (we used this for our Load-Balancers for instance, but because we know that our agents are running everywhere and no ACL on Consul's port where present), so in some environments where the agent HTTP is not enabled (or network ACLs are present), it would not work.

We could, however, implement something similar in health.

Regarding the return code, it really depends of what you want:

if your query the service by its name, IMHO, to follow Consul usual patterns, we might return empty collection, but if we add an endpoint in querying by ID, 404 might be suitable.

Regarding Health.ServiceChecks => if it would return all HealthChecks for a given service, we already have this info in /v1/health/service/ => the big downside is that the output can be huge - especially on large services but it includes both node and service checks.
/v1/health/services-on-node/ => with a similar structure as /v1/health/service/<serviceName> might do the trick.

There is also an alternative was by using /v1/health/service/<serviceName> by using meta filtering nodeName=targetNode that would probably might work as well.

I already thought about those alternatives while writing the PR, but I wanted to keep it as small as possible

hanshasselberg · 2020-05-20T10:28:18Z

/v1/health/services-on-node/ is what I meant. /v1/health/service/<serviceName> returns [] for non-existing services as well, the filtering doesn't help a lot there.

pierresouchay · 2020-05-20T10:37:25Z

/v1/health/services-on-node/ is what I meant. /v1/health/service/ returns [] for non-existing services as well, the filtering doesn't help a lot there.

The filtering could help actually:
Consider this:

curl -G http://localhost:8500/v1/health/service/< ServiceName> \
 --data-urlencode 'filter=((Node.Node == "the-node-name-you-want) AND (Service.ID == <ServiceIDYouAreLookingFor>"))'

=> if [] is returned => Service does not exists
=> if the Service and its ID exist => we are good, service exists (it returns both Node.Checks and Service.Checks, so we are good

The big advantage is that the call returns only the node we target, not the 1500 others (we are in this case).

Only problem: I tried this at first when I didd the PR, but IIRC, we don't know the ServiceName we are looking for!

hanshasselberg · 2020-05-20T20:15:29Z

Yeah, we don't have the service name. :(

pierresouchay · 2020-05-20T20:34:53Z

The simpler would probably to expose /v1/health/services-by-node/ (maybe with filter on serviceID) and return same kind of output as /v1/health/service/

When I worked on PR, that's was my best solution I could think of, but it is far more work than just checking on existence of service when there is doubt as I did here (and to be fair, it is easier for me, because I enforce all services to have at least 1 health check, so I would not pay the price that much in my infrastructure).
Since now, sidecar have healthchecks, it might be reasonable to consider this PR as most usages would not suffer of the second RPC to check for service existence) and keep further optimization for later

pierresouchay · 2020-06-03T13:41:29Z

@i0rek Do you think this additional call is a blocker for now?
Because the feature as currently implemented looks like a dangerous one to me

hanshasselberg

LGTM. Ideally the extra call is not needed with a better RPC endpoint for this type of problem. But since I don't have time to implement it now, I think it is better to error on the side of correctness. Which is why I think this is a good change with further potential for improvement.

Thanks @pierresouchay!

…ng (#7384) In current implementation of Consul, check alias cannot determine if a service exists or not. Because a service without any check is semantically considered as passing, so when no healthchecks are found for an agent, the check was considered as passing. But this make little sense as the current implementation does not make any difference between: * a non-existing service (passing) * a service without any check (passing as well) In order to make it work, we have to ensure that when a check did not find any healthcheck, the service does indeed exists. If it does not, lets consider the check as failing.

Previously we were setting the alias_service field of alias checks to the service name instead of the service id. This check would then pass because Consul would consider an alias for a non-existent service id to be okay. With hashicorp/consul#7384, now the check will fail if the service id doesn't exist and so those connect services will be considered unhealthy and be unroutable.

This fixes a unit test failure over in enterprise due to #7384

…ta (#8025) This fixes a unit test failure over in enterprise due to #7384

pierresouchay force-pushed the check_alias_fails_on_missing_service branch 4 times, most recently from f42227c to c9b5759 Compare March 5, 2020 17:05

pierresouchay mentioned this pull request Mar 5, 2020

Alias check - referenced check not found behavior #7374

Closed

pierresouchay force-pushed the check_alias_fails_on_missing_service branch 5 times, most recently from b83aba2 to 20f0ebf Compare March 28, 2020 09:13

pierresouchay force-pushed the check_alias_fails_on_missing_service branch from 20f0ebf to 7591ca5 Compare April 2, 2020 08:14

pierresouchay added 4 commits April 3, 2020 09:57

Be sure to test new method agent.State.ServiceExists(serviceID)

c85c3d8

Test fully end-to-end with agent RPC calls for alias checks

2e785c1

Remove agentName to NewTestAgent().

87d5f73

pierresouchay force-pushed the check_alias_fails_on_missing_service branch from 6f52904 to 87d5f73 Compare April 3, 2020 07:57

hanshasselberg self-assigned this May 12, 2020

pierresouchay mentioned this pull request May 23, 2020

Get checks by state criteo/consul-templaterb#65

Closed

hanshasselberg approved these changes Jun 4, 2020

View reviewed changes

hanshasselberg added the backport/1.8 label Jun 4, 2020

hanshasselberg merged commit 9813ae5 into hashicorp:master Jun 4, 2020

lkysow mentioned this pull request Jun 4, 2020

Fix bug with defining connect healthcheck hashicorp/consul-k8s#272

Merged

rboyer added a commit that referenced this pull request Jun 4, 2020

Ensure that the ServiceExists helper function normalizes entmeta

1749a8d

This fixes a unit test failure over in enterprise due to #7384

rboyer mentioned this pull request Jun 4, 2020

Ensure that the ServiceExists helper function normalizes entmeta #8025

Merged

hanshasselberg pushed a commit that referenced this pull request Jun 5, 2020

tests: ensure that the ServiceExists helper function normalizes entme…

9cfa4a3

…ta (#8025) This fixes a unit test failure over in enterprise due to #7384

hashicorp-ci pushed a commit that referenced this pull request Jun 5, 2020

tests: ensure that the ServiceExists helper function normalizes entme…

89fc983

…ta (#8025) This fixes a unit test failure over in enterprise due to #7384

freddygv mentioned this pull request Aug 6, 2020

Notify alias checks when aliased service is [de]registered #8456

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUGFIX #7374 When a service does not exists in an alias, consider it failing #7384

BUGFIX #7374 When a service does not exists in an alias, consider it failing #7384

pierresouchay commented Mar 4, 2020 •

edited

Loading

oto313 commented Mar 6, 2020

pierresouchay commented Mar 17, 2020

hanshasselberg commented May 20, 2020 •

edited

Loading

pierresouchay commented May 20, 2020

hanshasselberg commented May 20, 2020 •

edited

Loading

pierresouchay commented May 20, 2020

hanshasselberg commented May 20, 2020

pierresouchay commented May 20, 2020

hanshasselberg commented May 20, 2020

pierresouchay commented May 20, 2020

pierresouchay commented Jun 3, 2020

hanshasselberg left a comment

BUGFIX #7374 When a service does not exists in an alias, consider it failing #7384

BUGFIX #7374 When a service does not exists in an alias, consider it failing #7384

Conversation

pierresouchay commented Mar 4, 2020 • edited Loading

oto313 commented Mar 6, 2020

pierresouchay commented Mar 17, 2020

hanshasselberg commented May 20, 2020 • edited Loading

pierresouchay commented May 20, 2020

hanshasselberg commented May 20, 2020 • edited Loading

pierresouchay commented May 20, 2020

hanshasselberg commented May 20, 2020

pierresouchay commented May 20, 2020

hanshasselberg commented May 20, 2020

pierresouchay commented May 20, 2020

pierresouchay commented Jun 3, 2020

hanshasselberg left a comment

Choose a reason for hiding this comment

pierresouchay commented Mar 4, 2020 •

edited

Loading

hanshasselberg commented May 20, 2020 •

edited

Loading

hanshasselberg commented May 20, 2020 •

edited

Loading