You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is a for loop in the vplayer — which applies streamed binlog events from the vstreamer — where we process events and as we do we update the vreplication lag:
If the vplayer is throttled for some time, however, then we are stuck at the top of that for loop and never make it to the bottom of it where we update the lag value based on the just processed events:
Because we're not processing events for however long we're fully throttled, which is indefinite, we're not updating the vreplication lag. Let's say that the last time we did process an event the lag was 0 seconds. And let's say we're then fully throttled, and not able to process anymore events, for the next 15 minutes... the system and operator is not aware of the impending and growing vreplication lag and suddenly the value shoots up from 0 seconds to 900 seconds.
This is obviously wrong. It can lead to only becoming aware of the issue once it's a bigger problem — if made aware immediately you may want to explicitly lessen the throttling altogether or for vreplication or more specifically the vplayer — or cause unnecessary concern as the lag unexpectedly fluctuates wildly (perhaps you really do want vreplication to be deferred/throttled).
We currently have code in place which estimates the vreplication lag when we're not receiving any events from the vstreamer (perhaps we're not able to communicate or perhaps the sender/vstreamer is throttled):
We also need to do that when we're throttled. It may be as simple as this:
diff --git a/go/vt/vttablet/tabletmanager/vreplication/vplayer.go b/go/vt/vttablet/tabletmanager/vreplication/vplayer.go
index 31e26c30e8..0444038924 100644
--- a/go/vt/vttablet/tabletmanager/vreplication/vplayer.go
+++ b/go/vt/vttablet/tabletmanager/vreplication/vplayer.go
@@ -476,6 +476,12 @@ func (vp *vplayer) recordHeartbeat() error {
func (vp *vplayer) applyEvents(ctx context.Context, relay *relayLog) error {
defer vp.vr.dbClient.Rollback()
+ estimateLag := func() {
+ behind := time.Now().UnixNano() - vp.lastTimestampNs - vp.timeOffsetNs
+ vp.vr.stats.ReplicationLagSeconds.Store(behind / 1e9)
+ vp.vr.stats.VReplicationLags.Add(strconv.Itoa(int(vp.vr.id)), time.Duration(behind/1e9)*time.Second)
+ }
+
// If we're not running, set ReplicationLagSeconds to be very high.
// TODO(sougou): if we also stored the time of the last event, we
// can estimate this value more accurately.
@@ -489,6 +495,7 @@ func (vp *vplayer) applyEvents(ctx context.Context, relay *relayLog) error {
// Check throttler.
if checkResult, ok := vp.vr.vre.throttlerClient.ThrottleCheckOKOrWaitAppName(ctx, throttlerapp.Name(vp.throttlerAppName)); !ok {
_ = vp.vr.updateTimeThrottled(throttlerapp.VPlayerName, checkResult.Summary())
+ estimateLag()
continue
}
@@ -499,9 +506,7 @@ func (vp *vplayer) applyEvents(ctx context.Context, relay *relayLog) error {
// No events were received. This likely means that there's a network partition.
// So, we should assume we're falling behind.
if len(items) == 0 {
- behind := time.Now().UnixNano() - vp.lastTimestampNs - vp.timeOffsetNs
- vp.vr.stats.ReplicationLagSeconds.Store(behind / 1e9)
- vp.vr.stats.VReplicationLags.Add(strconv.Itoa(int(vp.vr.id)), time.Duration(behind/1e9)*time.Second)
+ estimateLag()
}
// Empty transactions are saved at most once every idleTimeout.
// This covers two situations:
vtgate version Version: 21.0.0-SNAPSHOT (Git revision 3cfb08c45ec995f347b95cb91a56b36a3c5b6b56 branch 'ws_logger') built on Thu Aug 8 23:36:29 EDT 2024 by matt@pslord.local using go1.22.5 darwin/arm64
Operating System and Environment details
N/A
Log Fragments
No response
The text was updated successfully, but these errors were encountered:
Overview of the Issue
There is a for loop in the
vplayer
— which applies streamed binlog events from thevstreamer
— where we process events and as we do we update the vreplication lag:vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go
Lines 485 to 574 in bf0c5f8
If the
vplayer
is throttled for some time, however, then we are stuck at the top of that for loop and never make it to the bottom of it where we update the lag value based on the just processed events:vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go
Lines 485 to 493 in bf0c5f8
Because we're not processing events for however long we're fully throttled, which is indefinite, we're not updating the vreplication lag. Let's say that the last time we did process an event the lag was 0 seconds. And let's say we're then fully throttled, and not able to process anymore events, for the next 15 minutes... the system and operator is not aware of the impending and growing vreplication lag and suddenly the value shoots up from 0 seconds to 900 seconds.
This is obviously wrong. It can lead to only becoming aware of the issue once it's a bigger problem — if made aware immediately you may want to explicitly lessen the throttling altogether or for vreplication or more specifically the
vplayer
— or cause unnecessary concern as the lag unexpectedly fluctuates wildly (perhaps you really do want vreplication to be deferred/throttled).We currently have code in place which estimates the vreplication lag when we're not receiving any events from the
vstreamer
(perhaps we're not able to communicate or perhaps the sender/vstreamer
is throttled):vitess/go/vt/vttablet/tabletmanager/vreplication/vplayer.go
Lines 499 to 505 in bf0c5f8
We also need to do that when we're throttled. It may be as simple as this:
Reproduction Steps
End result on
main
:End result with the proposed patch:
Binary Version
vtgate version Version: 21.0.0-SNAPSHOT (Git revision 3cfb08c45ec995f347b95cb91a56b36a3c5b6b56 branch 'ws_logger') built on Thu Aug 8 23:36:29 EDT 2024 by matt@pslord.local using go1.22.5 darwin/arm64
Operating System and Environment details
Log Fragments
No response
The text was updated successfully, but these errors were encountered: