Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

watch: Unavailable many times #9725

Closed
MichaelGou1105 opened this issue May 15, 2018 · 3 comments
Closed

watch: Unavailable many times #9725

MichaelGou1105 opened this issue May 15, 2018 · 3 comments

Comments

@MichaelGou1105
Copy link

MichaelGou1105 commented May 15, 2018

the etcd cluster is use for the event etcd for kubernetes;
but my prometheus show:

grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"} 50
@MichaelGou1105
Copy link
Author

Deadlock problem, who knows how to solve it?

etcdserver/api/v3rpc/watch.go

	go func() {
		if rerr := sws.recvLoop(); rerr != nil {
			if isClientCtxErr(stream.Context().Err(), rerr) {
				plog.Debugf("failed to receive watch request from gRPC stream (%q)", rerr.Error())
			} else {
				plog.Warningf("failed to receive watch request from gRPC stream (%q)", rerr.Error())
			}
			errc <- rerr
		}
	}()

Recurring problems

  • start etcd
./etcd
  • benchmark watch in etcd v3
benchmark  watch  --streams 20  --watch-per-stream 50  --watched-key-total 1   --put-total 1000   --put-total 1
  • benchmark result

benchmark_1.log

  • curl the metrics

cmd:

curl http://127.0.0.1:2379/metrics |grep  Unavailable

result:

grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"} 17
  • etcd debug log
2018-05-15 20:24:43.842441 D | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
2018-05-15 20:24:43.842467 D | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
2018-05-15 20:24:43.842486 D | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
2018-05-15 20:24:43.842496 D | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
2018-05-15 20:24:43.842513 D | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
2018-05-15 20:24:43.842702 D | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
2018-05-15 20:24:43.842711 D | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
2018-05-15 20:24:43.842718 D | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
2018-05-15 20:24:43.842727 D | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
2018-05-15 20:24:43.842733 D | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
2018-05-15 20:24:43.842739 D | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
2018-05-15 20:24:43.842744 D | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
2018-05-15 20:24:43.842753 D | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
2018-05-15 20:24:43.842761 D | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
2018-05-15 20:24:43.842766 D | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
2018-05-15 20:24:43.842771 D | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
2018-05-15 20:24:43.842776 D | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
2018-05-15 20:24:43.842781 D | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
2018-05-15 20:24:43.842799 D | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
2018-05-15 20:24:43.842814 D | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")
2018-05-15 20:24:52.124531 D | etcdserver/api/v2http: [GET] /metrics remote:127.0.0.1:33948

@hexfusion
Copy link
Contributor

2018-05-15 20:24:43.842441 D | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")

This warning generally happens when your watch client closes. Since you have 20 records above and benchmark watch --streams 20 I believe this is expected as I imagine benchmark cleans up after it runs. You can replicate this by setting a watch with etcdctl then Ctrl + C to kill and watch your logs.

@stale
Copy link

stale bot commented Apr 7, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 7, 2020
@stale stale bot closed this as completed Apr 28, 2020
ironcladlou added a commit to ironcladlou/etcd that referenced this issue Aug 3, 2020
Before this patch, a client which cancels the context for a watch results in the
server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of
a gRPC `Unavailable` metric in association with the client watch cancellation.
The metric looks like this:

    grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}

So, the watch server has misidentified the error as a server error and then
propagates the mistake to metrics, leading to a false indicator that the leader
has been lost. This false signal then leads to false alerting.

This patch improves the behavior by:

1. Performing a deeper analysis during stream closure to more conclusively
determine that a leader has actually been lost before propagating a
ErrGRPCNoLeader error.

2. Returning a ErrGRPCWatchCanceled error if no conclusion can be drawn
regarding leader loss.

There remains an assumption that absence of leader loss evidence represents a
client cancellation, but in practice this seems less likely to break down
whereas client cancellations are frequent and expected.

This is a continuation of the work already done in etcd-io#11375.

Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166
ironcladlou added a commit to ironcladlou/etcd that referenced this issue Aug 3, 2020
Before this patch, a client which cancels the context for a watch results in the
server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of
a gRPC `Unavailable` metric in association with the client watch cancellation.
The metric looks like this:

    grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}

So, the watch server has misidentified the error as a server error and then
propagates the mistake to metrics, leading to a false indicator that the leader
has been lost. This false signal then leads to false alerting.

This patch improves the behavior by:

1. Performing a deeper analysis during stream closure to more conclusively
determine that a leader has actually been lost before propagating a
ErrGRPCNoLeader error.

2. Returning a ErrGRPCWatchCanceled error if no conclusion can be drawn
regarding leader loss.

There remains an assumption that absence of evidence of leader loss means a
client cancelled, but in practice this seems less likely to break down whereas
client cancellations are frequent and expected.

This is a continuation of the work already done in etcd-io#11375.

Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166
ironcladlou added a commit to ironcladlou/etcd that referenced this issue Aug 5, 2020
Before this patch, a client which cancels the context for a watch results in the
server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of
a gRPC `Unavailable` metric in association with the client watch cancellation.
The metric looks like this:

    grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}

So, the watch server has misidentified the error as a server error and then
propagates the mistake to metrics, leading to a false indicator that the leader
has been lost. This false signal then leads to false alerting.

This patch improves the behavior by:

1. Performing a deeper analysis during stream closure to more conclusively
determine that a leader has actually been lost before propagating a
ErrGRPCNoLeader error.

2. Returning a ErrGRPCWatchCanceled error if no conclusion can be drawn
regarding leader loss.

There remains an assumption that absence of evidence of leader loss means a
client cancelled, but in practice this seems less likely to break down whereas
client cancellations are frequent and expected.

This is a continuation of the work already done in etcd-io#11375.

Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166
ironcladlou added a commit to ironcladlou/etcd that referenced this issue Sep 29, 2020
Before this patch, a client which cancels the context for a watch results in the
server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of
a gRPC `Unavailable` metric in association with the client watch cancellation.
The metric looks like this:

    grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}

So, the watch server has misidentified the error as a server error and then
propagates the mistake to metrics, leading to a false indicator that the leader
has been lost. This false signal then leads to false alerting.

The commit 9c103dd introduced an interceptor which wraps
watch streams requiring a leader, causing those streams to be actively canceled
when leader loss is detected.

However, the error handling code assumes all stream context cancellations are
from the interceptor. This assumption is broken when the context was canceled
because of a client stream cancelation.

The core challenge is lack of information conveyed via `context.Context` which
is shared by both the send and receive sides of the stream handling and is
subject to cancellation by all paths (including the gRPC library itself). If any
piece of the system cancels the shared context, there's no way for a context
consumer to understand who cancelled the context or why.

To solve the ambiguity of the stream interceptor code specifically, this patch
introduces a custom context struct which the interceptor uses to expose a custom
error through the context when the interceptor decides to actively cancel a
stream. Now the consuming side can more safely assume a generic context
cancellation can be propagated as a cancellation, and the server generated
leader error is preserved and propagated normally without any special inference.

When a client cancels the stream, there remains a race in the error handling
code between the send and receive goroutines whereby the underlying gRPC error
is lost in the case where the send path returns and is handled first, but this
issue can be taken separately as no matter which paths wins, we can detect a
generic cancellation.

This is a replacement of etcd-io#11375.

Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166
ironcladlou added a commit to ironcladlou/etcd that referenced this issue Sep 29, 2020
Before this patch, a client which cancels the context for a watch results in the
server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of
a gRPC `Unavailable` metric in association with the client watch cancellation.
The metric looks like this:

    grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}

So, the watch server has misidentified the error as a server error and then
propagates the mistake to metrics, leading to a false indicator that the leader
has been lost. This false signal then leads to false alerting.

The commit 9c103dd introduced an interceptor which wraps
watch streams requiring a leader, causing those streams to be actively canceled
when leader loss is detected.

However, the error handling code assumes all stream context cancellations are
from the interceptor. This assumption is broken when the context was canceled
because of a client stream cancelation.

The core challenge is lack of information conveyed via `context.Context` which
is shared by both the send and receive sides of the stream handling and is
subject to cancellation by all paths (including the gRPC library itself). If any
piece of the system cancels the shared context, there's no way for a context
consumer to understand who cancelled the context or why.

To solve the ambiguity of the stream interceptor code specifically, this patch
introduces a custom context struct which the interceptor uses to expose a custom
error through the context when the interceptor decides to actively cancel a
stream. Now the consuming side can more safely assume a generic context
cancellation can be propagated as a cancellation, and the server generated
leader error is preserved and propagated normally without any special inference.

When a client cancels the stream, there remains a race in the error handling
code between the send and receive goroutines whereby the underlying gRPC error
is lost in the case where the send path returns and is handled first, but this
issue can be taken separately as no matter which paths wins, we can detect a
generic cancellation.

This is a replacement of etcd-io#11375.

Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166
ironcladlou added a commit to ironcladlou/etcd that referenced this issue Sep 30, 2020
Before this patch, a client which cancels the context for a watch results in the
server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of
a gRPC `Unavailable` metric in association with the client watch cancellation.
The metric looks like this:

    grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}

So, the watch server has misidentified the error as a server error and then
propagates the mistake to metrics, leading to a false indicator that the leader
has been lost. This false signal then leads to false alerting.

The commit 9c103dd introduced an interceptor which wraps
watch streams requiring a leader, causing those streams to be actively canceled
when leader loss is detected.

However, the error handling code assumes all stream context cancellations are
from the interceptor. This assumption is broken when the context was canceled
because of a client stream cancelation.

The core challenge is lack of information conveyed via `context.Context` which
is shared by both the send and receive sides of the stream handling and is
subject to cancellation by all paths (including the gRPC library itself). If any
piece of the system cancels the shared context, there's no way for a context
consumer to understand who cancelled the context or why.

To solve the ambiguity of the stream interceptor code specifically, this patch
introduces a custom context struct which the interceptor uses to expose a custom
error through the context when the interceptor decides to actively cancel a
stream. Now the consuming side can more safely assume a generic context
cancellation can be propagated as a cancellation, and the server generated
leader error is preserved and propagated normally without any special inference.

When a client cancels the stream, there remains a race in the error handling
code between the send and receive goroutines whereby the underlying gRPC error
is lost in the case where the send path returns and is handled first, but this
issue can be taken separately as no matter which paths wins, we can detect a
generic cancellation.

This is a replacement of etcd-io#11375.

Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166
ironcladlou added a commit to ironcladlou/etcd that referenced this issue Sep 30, 2020
Before this patch, a client which cancels the context for a watch results in the
server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of
a gRPC `Unavailable` metric in association with the client watch cancellation.
The metric looks like this:

    grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}

So, the watch server has misidentified the error as a server error and then
propagates the mistake to metrics, leading to a false indicator that the leader
has been lost. This false signal then leads to false alerting.

The commit 9c103dd introduced an interceptor which wraps
watch streams requiring a leader, causing those streams to be actively canceled
when leader loss is detected.

However, the error handling code assumes all stream context cancellations are
from the interceptor. This assumption is broken when the context was canceled
because of a client stream cancelation.

The core challenge is lack of information conveyed via `context.Context` which
is shared by both the send and receive sides of the stream handling and is
subject to cancellation by all paths (including the gRPC library itself). If any
piece of the system cancels the shared context, there's no way for a context
consumer to understand who cancelled the context or why.

To solve the ambiguity of the stream interceptor code specifically, this patch
introduces a custom context struct which the interceptor uses to expose a custom
error through the context when the interceptor decides to actively cancel a
stream. Now the consuming side can more safely assume a generic context
cancellation can be propagated as a cancellation, and the server generated
leader error is preserved and propagated normally without any special inference.

When a client cancels the stream, there remains a race in the error handling
code between the send and receive goroutines whereby the underlying gRPC error
is lost in the case where the send path returns and is handled first, but this
issue can be taken separately as no matter which paths wins, we can detect a
generic cancellation.

This is a replacement of etcd-io#11375.

Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166
ironcladlou added a commit to ironcladlou/etcd that referenced this issue Nov 18, 2020
Before this patch, a client which cancels the context for a watch results in the
server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of
a gRPC `Unavailable` metric in association with the client watch cancellation.
The metric looks like this:

    grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}

So, the watch server has misidentified the error as a server error and then
propagates the mistake to metrics, leading to a false indicator that the leader
has been lost. This false signal then leads to false alerting.

The commit 9c103dd introduced an interceptor which wraps
watch streams requiring a leader, causing those streams to be actively canceled
when leader loss is detected.

However, the error handling code assumes all stream context cancellations are
from the interceptor. This assumption is broken when the context was canceled
because of a client stream cancelation.

The core challenge is lack of information conveyed via `context.Context` which
is shared by both the send and receive sides of the stream handling and is
subject to cancellation by all paths (including the gRPC library itself). If any
piece of the system cancels the shared context, there's no way for a context
consumer to understand who cancelled the context or why.

To solve the ambiguity of the stream interceptor code specifically, this patch
introduces a custom context struct which the interceptor uses to expose a custom
error through the context when the interceptor decides to actively cancel a
stream. Now the consuming side can more safely assume a generic context
cancellation can be propagated as a cancellation, and the server generated
leader error is preserved and propagated normally without any special inference.

When a client cancels the stream, there remains a race in the error handling
code between the send and receive goroutines whereby the underlying gRPC error
is lost in the case where the send path returns and is handled first, but this
issue can be taken separately as no matter which paths wins, we can detect a
generic cancellation.

This is a replacement of etcd-io#11375.

Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166
ironcladlou added a commit to ironcladlou/etcd that referenced this issue Nov 18, 2020
Before this patch, a client which cancels the context for a watch results in the
server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of
a gRPC `Unavailable` metric in association with the client watch cancellation.
The metric looks like this:

    grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}

So, the watch server has misidentified the error as a server error and then
propagates the mistake to metrics, leading to a false indicator that the leader
has been lost. This false signal then leads to false alerting.

The commit 9c103dd introduced an interceptor which wraps
watch streams requiring a leader, causing those streams to be actively canceled
when leader loss is detected.

However, the error handling code assumes all stream context cancellations are
from the interceptor. This assumption is broken when the context was canceled
because of a client stream cancelation.

The core challenge is lack of information conveyed via `context.Context` which
is shared by both the send and receive sides of the stream handling and is
subject to cancellation by all paths (including the gRPC library itself). If any
piece of the system cancels the shared context, there's no way for a context
consumer to understand who cancelled the context or why.

To solve the ambiguity of the stream interceptor code specifically, this patch
introduces a custom context struct which the interceptor uses to expose a custom
error through the context when the interceptor decides to actively cancel a
stream. Now the consuming side can more safely assume a generic context
cancellation can be propagated as a cancellation, and the server generated
leader error is preserved and propagated normally without any special inference.

When a client cancels the stream, there remains a race in the error handling
code between the send and receive goroutines whereby the underlying gRPC error
is lost in the case where the send path returns and is handled first, but this
issue can be taken separately as no matter which paths wins, we can detect a
generic cancellation.

This is a replacement of etcd-io#11375.

Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants