Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUGFIX #7374 When a service does not exists in an alias, consider it failing #7384

Conversation

pierresouchay
Copy link
Contributor

@pierresouchay pierresouchay commented Mar 4, 2020

In current implementation of Consul, check alias cannot determine if a service exists or not. Because a service without any check is semantically considered as passing, so when no healthchecks are found for an agent, the check was considered as passing.

But this make little sense as the current implementation does not make any difference between:

  • a non-existing service (passing)
  • a service without any check (passing as well)

In order to make it work, we have to ensure that when a check did not find any healthcheck, the service does indeed exists. If it does not, lets consider the check as failing.

It should fix #7374

@pierresouchay pierresouchay force-pushed the check_alias_fails_on_missing_service branch 4 times, most recently from f42227c to c9b5759 Compare March 5, 2020 17:05
@oto313
Copy link

oto313 commented Mar 6, 2020

@pierresouchay I dont develop in so I dont understand it deeply. But I see checking for service existence so I think it would fix my problem.

@pierresouchay
Copy link
Contributor Author

@ShimmerGlass review?

@pierresouchay pierresouchay force-pushed the check_alias_fails_on_missing_service branch 5 times, most recently from b83aba2 to 20f0ebf Compare March 28, 2020 09:13
@pierresouchay pierresouchay force-pushed the check_alias_fails_on_missing_service branch from 20f0ebf to 7591ca5 Compare April 2, 2020 08:14
…sider it failing

In current implementation of Consul, check alias cannot determine
if a service exists or not. Because a service without any check
is semantically considered as passing, so when no healthchecks
are found for an agent, the check was considered as passing.

But this make little sense as the current implementation does not
make any difference between:
 * a non-existing service (passing)
 * a service without any check (passing as well)

In order to make it work, we have to ensure that when a check did
not find any healthcheck, the service does indeed exists. If it
does not, lets consider the check as failing.
@pierresouchay pierresouchay force-pushed the check_alias_fails_on_missing_service branch from 6f52904 to 87d5f73 Compare April 3, 2020 07:57
@hanshasselberg hanshasselberg self-assigned this May 12, 2020
@hanshasselberg
Copy link
Member

hanshasselberg commented May 20, 2020

I thought about the problem and I think you are right. This feature was introduced with #4320:

The use case here is that the health of a service may in fact depend on the health of another service, the most basic example being a sidecar proxy. A sidecar proxy should not be healthy if the service that it is proxying to is also not healthy, because any connection to the sidecar will invariably fail connecting to the upstream service instance.

With that in mind, a service should also not be healthy if there is no sidecar proxy! Now that I have made up my mind, I will take time to review your PR soon.

@pierresouchay
Copy link
Contributor Author

@i0rek Thank you. This feature is actually great because it would allow us renaming services easily (the old service name would be an alias to the new one). But Allowing to have a dangling link to a service is a showstopper for us.

@hanshasselberg
Copy link
Member

hanshasselberg commented May 20, 2020

The implementation makes extra requests to determine if a service exists. This shouldn't be necessary because there is already a request to the servers to Health.NodeChecks.
This endpoint is too unspecific, but if we would add Health.ServiceChecks and make it return 404 in case a service doesn't exist, that would solve our problem.

BUT Health.NodeChecks returns [] for a non-existing node. And if we would make Health.ServiceChecks follow the same pattern it wouldn't be a big help, because we need an indication that the service doesn't exist.

Do you have other ideas? I will also discuss this internally.

@pierresouchay
Copy link
Contributor Author

@i0rek Yes, I know the implementation is suboptimal, but the new request is Done ONLY when no check is found on service.

Some time ago, we implemented this: #3551

It allows to check for a service from its name (returns a collection) or by its ID, so you can:

/v1/agent/health/service/id/<myServiceId> => Present or absent
or /v1/agent/health/service/name/<myServiceName> => Collection

This is almost Ok, but has a big drawback: the agent itself is responding to the request (we used this for our Load-Balancers for instance, but because we know that our agents are running everywhere and no ACL on Consul's port where present), so in some environments where the agent HTTP is not enabled (or network ACLs are present), it would not work.

We could, however, implement something similar in health.

Regarding the return code, it really depends of what you want:

if your query the service by its name, IMHO, to follow Consul usual patterns, we might return empty collection, but if we add an endpoint in querying by ID, 404 might be suitable.

Regarding Health.ServiceChecks => if it would return all HealthChecks for a given service, we already have this info in /v1/health/service/ => the big downside is that the output can be huge - especially on large services but it includes both node and service checks.
/v1/health/services-on-node/ => with a similar structure as /v1/health/service/<serviceName> might do the trick.

There is also an alternative was by using /v1/health/service/<serviceName> by using meta filtering nodeName=targetNode that would probably might work as well.

I already thought about those alternatives while writing the PR, but I wanted to keep it as small as possible

@hanshasselberg
Copy link
Member

/v1/health/services-on-node/ is what I meant. /v1/health/service/<serviceName> returns [] for non-existing services as well, the filtering doesn't help a lot there.

@pierresouchay
Copy link
Contributor Author

/v1/health/services-on-node/ is what I meant. /v1/health/service/ returns [] for non-existing services as well, the filtering doesn't help a lot there.

The filtering could help actually:
Consider this:

curl -G http://localhost:8500/v1/health/service/< ServiceName> \
 --data-urlencode 'filter=((Node.Node == "the-node-name-you-want) AND (Service.ID == <ServiceIDYouAreLookingFor>"))'

=> if [] is returned => Service does not exists
=> if the Service and its ID exist => we are good, service exists (it returns both Node.Checks and Service.Checks, so we are good

The big advantage is that the call returns only the node we target, not the 1500 others (we are in this case).

Only problem: I tried this at first when I didd the PR, but IIRC, we don't know the ServiceName we are looking for!

@hanshasselberg
Copy link
Member

Yeah, we don't have the service name. :(

@pierresouchay
Copy link
Contributor Author

The simpler would probably to expose /v1/health/services-by-node/ (maybe with filter on serviceID) and return same kind of output as /v1/health/service/

When I worked on PR, that's was my best solution I could think of, but it is far more work than just checking on existence of service when there is doubt as I did here (and to be fair, it is easier for me, because I enforce all services to have at least 1 health check, so I would not pay the price that much in my infrastructure).
Since now, sidecar have healthchecks, it might be reasonable to consider this PR as most usages would not suffer of the second RPC to check for service existence) and keep further optimization for later

@pierresouchay
Copy link
Contributor Author

@i0rek Do you think this additional call is a blocker for now?
Because the feature as currently implemented looks like a dangerous one to me

Copy link
Member

@hanshasselberg hanshasselberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Ideally the extra call is not needed with a better RPC endpoint for this type of problem. But since I don't have time to implement it now, I think it is better to error on the side of correctness. Which is why I think this is a good change with further potential for improvement.

Thanks @pierresouchay!

@hanshasselberg hanshasselberg merged commit 9813ae5 into hashicorp:master Jun 4, 2020
hashicorp-ci pushed a commit that referenced this pull request Jun 4, 2020
…ng (#7384)

In current implementation of Consul, check alias cannot determine
if a service exists or not. Because a service without any check
is semantically considered as passing, so when no healthchecks
are found for an agent, the check was considered as passing.

But this make little sense as the current implementation does not
make any difference between:
 * a non-existing service (passing)
 * a service without any check (passing as well)

In order to make it work, we have to ensure that when a check did
not find any healthcheck, the service does indeed exists. If it
does not, lets consider the check as failing.
lkysow added a commit to hashicorp/consul-k8s that referenced this pull request Jun 4, 2020
Previously we were setting the alias_service field of alias checks to
the service name instead of the service id. This check would then pass
because Consul would consider an alias for a non-existent service id to
be okay. With hashicorp/consul#7384, now the
check will fail if the service id doesn't exist and so those connect
services will be considered unhealthy and be unroutable.
rboyer added a commit that referenced this pull request Jun 4, 2020
This fixes a unit test failure over in enterprise due to #7384
hanshasselberg pushed a commit that referenced this pull request Jun 5, 2020
…ta (#8025)

This fixes a unit test failure over in enterprise due to #7384
hashicorp-ci pushed a commit that referenced this pull request Jun 5, 2020
…ta (#8025)

This fixes a unit test failure over in enterprise due to #7384
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Alias check - referenced check not found behavior
3 participants