-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slow heart make the cluster crash, and can not recover by restart #45929
Comments
More logs below. Click to expand
|
hello? |
I'm sorry to hear that you're having a bad experience. Slow heartbeats and node liveness failures can occur when a cluster is overloaded. In 19.1 and earlier, CPU overload could result in this sort of node liveness failure. With #39172 we separated network traffic for these critical system ranges which seems to mitigate the problem somewhat. IO saturation can also cause problems. It is generally bad practice to run a database at the point of resource saturation though I'll admit that more graceful degradation would be great. My concrete recommendation would be to lower the data rate until things are happy and then experiment to figure out the capacity of the cluster for your workload. In the face of overload, cockroach provides you the ability to scale out. We would love to continue to increase the robustness of the cluster to overload. Understanding your workload will help us to learn how to do that. Can you provide more information about the hardware on which this cluster is deployed, the resource utilization on the servers while this is running as well as the data rate and shape that you're pushing? What sort of storage device are you using? Would you be willing to collect screen shots of the admin UI dashboards while problems are occurring (Storage, Hardware, Runtime, Distributed, Replication)? |
@ajwerner |
Sorry I think I missed something earlier. There’s something weird going on in those logs with “duplicate addresses”. Have you swapped nodes storage and IP addresses in any way? Are the nodes being passed the correct |
@ajwerner
|
and the last of the log as follow:
|
and the start shell as follow:
|
I restart 5 node in cluster, and use join ips, not dns, the problem also have.
|
I’m not sure what’s going on and likely won’t have a lot of time to investigate your situation in the coming days. I encourage you to look closely at all of the network address of all of the servers. Are you connecting anything through a load balancer? Consider mapping out all of the servers, their IP addresses, and the flags your are passing to each of them to start. |
@ajwerner |
I have a question about, what situation r1 range can not get lease holder?
|
This seems due to the nodes not being able to communicate at startup. Providing a load balancer as the join address can be dangerous. I’m sorry to hear that using the individual IPs still did not work. When you did this, were the logs the same about closing the connection? Perhaps consider 1) backing up the store data for each node jus to be safe, 2) using the same servers to start a new cluster but with a different store directory on each node to try to debug the networking situation. |
@ajwerner We add the debug log, We found it's maybe the raft or range have problem. |
Do you have any methods or knobs let the cluster only use raft leader(not use raft lease holder)?
|
You need to get a quorum on the node liveness range to bootstrap leases to the other ranges. It seems like n4 is the leader and has the most up-to-date log for the range you showed. I’m not certain that this is the node liveness range. Try starting that node first. If you collect a debug zip and provide access to the source code (it seems that there are some modifications) it is possible that I try to help better understand the situation but no promises |
@ajwerner
I think the problem is,
When I restart the cluster and hope the cluster can recover by itself. But...
|
I’m not exactly sure what you mean by node liveness raft lease depends on node liveness. Most ranges indeed use node liveness based leases but the node liveness range uses a different, expiration-based leasing mechanism. |
I mean node liveness update need through txn, and txn need through raft, and need raft lease holder. |
@ajwerner I hack some code, the main mind have 2 points.
hack code:
func (nl *NodeLiveness) heartbeatInternal(
ctx context.Context, liveness *storagepb.Liveness, incrementEpoch bool,
) error {
// ......
{
maxOffset := nl.clock.MaxOffset()
if maxOffset == timeutil.ClocklessMaxOffset {
maxOffset = 0
}
update.Expiration = hlc.LegacyTimestamp(
nl.clock.Now().Add((nl.livenessThreshold + maxOffset).Nanoseconds(), 0))
// This guards against the system clock moving backwards. As long
// as the cockroach process is running, checks inside hlc.Clock
// will ensure that the clock never moves backwards, but these
// checks don't work across process restarts.
if liveness != nil && update.Expiration.Less(liveness.Expiration) {
// yxj hack
// return errors.Errorf("proposed liveness update expires earlier than previous record")
}
}
}
func (nl *NodeLiveness) GetLiveness(nodeID roachpb.NodeID) (*storagepb.Liveness, error) {
nl.mu.Lock()
defer nl.mu.Unlock()
// yxj hack
// return nl.getLivenessLocked(nodeID)
return &storagepb.Liveness{
NodeID: nodeID,
Epoch: 500,
Expiration: hlc.LegacyTimestamp(hlc.MaxTimestamp),
}, nil
} 2.src/github.com/cockroachdb/cockroach/pkg/storage/store_pool.go func MakeStorePoolNodeLivenessFunc(nodeLiveness *NodeLiveness) NodeLivenessFunc {
return func(nodeID roachpb.NodeID, now time.Time, threshold time.Duration) storagepb.NodeLivenessStatus {
// yxj hack
//liveness, err := nodeLiveness.GetLiveness(nodeID)
//if err != nil {
// return storagepb.NodeLivenessStatus_UNAVAILABLE
//}
//return liveness.LivenessStatus(now, threshold, nodeLiveness.clock.MaxOffset())
return storagepb.NodeLivenessStatus_LIVE
}
} 3.src/github.com/cockroachdb/cockroach/pkg/storage/storagepb/liveness.go // yxj hack
// "github.com/cockroachdb/cockroach/pkg/util/timeutil"
func (l *Liveness) IsLive(now hlc.Timestamp, maxOffset time.Duration) bool {
// yxj hack
//if maxOffset == timeutil.ClocklessMaxOffset {
// // When using clockless reads, we're live without a buffer period.
// maxOffset = 0
//}
//expiration := hlc.Timestamp(l.Expiration).Add(-maxOffset.Nanoseconds(), 0)
//return now.Less(expiration)
return true
}
func (l *Liveness) IsDead(now hlc.Timestamp, threshold time.Duration) bool {
// yxj hack
//deadAsOf := hlc.Timestamp(l.Expiration).GoTime().Add(threshold)
//return !now.GoTime().Before(deadAsOf)
return false
}
func (l *Liveness) LivenessStatus(
now time.Time, threshold, maxOffset time.Duration,
) NodeLivenessStatus {
// yxj hack
//nowHlc := hlc.Timestamp{WallTime: now.UnixNano()}
//if l.IsDead(nowHlc, threshold) {
// if l.Decommissioning {
// return NodeLivenessStatus_DECOMMISSIONED
// }
// return NodeLivenessStatus_DEAD
//}
//if l.Decommissioning {
// return NodeLivenessStatus_DECOMMISSIONING
//}
//if l.Draining {
// return NodeLivenessStatus_UNAVAILABLE
//}
//if l.IsLive(nowHlc, maxOffset) {
// return NodeLivenessStatus_LIVE
//}
//return NodeLivenessStatus_UNAVAILABLE
return NodeLivenessStatus_LIVE
} 4.src/github.com/cockroachdb/cockroach/pkg/storage/replica_range_lease.go func (r *Replica) leaseStatus(
lease roachpb.Lease, timestamp, minProposedTS hlc.Timestamp,
) storagepb.LeaseStatus {
status := storagepb.LeaseStatus{Timestamp: timestamp, Lease: lease}
var expiration hlc.Timestamp
if lease.Type() == roachpb.LeaseExpiration {
expiration = lease.GetExpiration()
// yxj hack
expiration = hlc.MaxTimestamp
} else {
var err error
status.Liveness, err = r.store.cfg.NodeLiveness.GetLiveness(lease.Replica.NodeID)
// yxj hack
status.Liveness.Epoch = lease.Epoch
if err != nil || status.Liveness.Epoch < lease.Epoch {
// If lease validity can't be determined (e.g. gossip is down
// and liveness info isn't available for owner), we can neither
// use the lease nor do we want to attempt to acquire it.
if err != nil {
if leaseStatusLogLimiter.ShouldLog() {
log.Warningf(context.TODO(), "can't determine lease status due to node liveness error: %s", err)
}
}
status.State = storagepb.LeaseState_ERROR
return status
}
if status.Liveness.Epoch > lease.Epoch {
status.State = storagepb.LeaseState_EXPIRED
return status
}
expiration = hlc.Timestamp(status.Liveness.Expiration)
}
maxOffset := r.store.Clock().MaxOffset()
if maxOffset == timeutil.ClocklessMaxOffset {
// No stasis when using clockless reads.
maxOffset = 0
}
stasis := expiration.Add(-int64(maxOffset), 0)
if timestamp.Less(stasis) {
status.State = storagepb.LeaseState_VALID
// If the replica owns the lease, additional verify that the lease's
// proposed timestamp is not earlier than the min proposed timestamp.
// yxj hack
//if lease.Replica.StoreID == r.store.StoreID() &&
// lease.ProposedTS != nil && lease.ProposedTS.Less(minProposedTS) {
// status.State = storagepb.LeaseState_PROSCRIBED
//}
} else if timestamp.Less(expiration) {
status.State = storagepb.LeaseState_STASIS
} else {
status.State = storagepb.LeaseState_EXPIRED
}
return status
} now, the world quietness...... |
@ajwerner
|
Happy to hear you got it recovered. Not sure what happened. I don't think this issue is readily actionable so I'm going to close it. If you find an issue with an easier reproduction, please open a new issue. |
@yangxuanjia How did you manage to bring online the cluster? Experiencing the same issue as you. |
Describe the problem
slow heart make the cluster crash, and can not recover by restart.
To Reproduce
Expected behavior
ok
Additional data / screenshots
Click to expand
Environment:
Additional context
we can't resolve it, the cluster we restart, but the impact also exist.
first range can't not found, gossip get nothing, rpc err, web can't not fetch any data.
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: