Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] Upgrade failure in master from b250-> b332 due to regression #18266

Closed
1 task done
kripasreenivasan opened this issue Jul 17, 2023 · 0 comments
Closed
1 task done
Assignees
Labels
kind/bug This issue is a bug long_running_universe priority/highest Highest priority issue

Comments

@kripasreenivasan
Copy link
Contributor

kripasreenivasan commented Jul 17, 2023

Jira Link: DB-7266

Description

Master universe upgrade failed from b250-> b332
A regression perhaps caused by the commit- d6d0cf6

from the core file

 frame #1: 0x000055a955c97a78 yb-tserver`yb::tablet::TransactionStatusResolver::Impl::StatusReceived(yb::Status, yb::tserver::GetTransactionStatusResponsePB const&, int) [inlined] google::protobuf::RepeatedPtrField<yb::AppStatusPB>::Get(this=0x0000050c7505a4a0, index=0) const at repeated_field.h:1989:32
    frame #2: 0x000055a955c97a78 yb-tserver`yb::tablet::TransactionStatusResolver::Impl::StatusReceived(yb::Status, yb::tserver::GetTransactionStatusResponsePB const&, int) [inlined] yb::tserver::GetTransactionStatusResponsePB::deadlock_reason(this=0x0000050c7505a430, index=0) const at tserver_service.pb.h:7631:27
    frame #3: 0x000055a955c97a78 yb-tserver`yb::tablet::TransactionStatusResolver::Impl::StatusReceived(this=0x0000050c7c8a5300, status=<unavailable>, response=0x0000050c7505a430, request_size=<unavailable>) at transaction_status_resolver.cc:262:20

Code

if (response.deadlock_reason(i).code() != AppStatusPB::OK) {
        // response contains a deadlock specific error.
        status_info.expected_deadlock_status = StatusFromPB(response.deadlock_reason(i));
}

The code does not check the size of response.deadlock_reason and tries to access it by an index unconditionally. However the structure is empty for this case-

(lldb) print response
(const yb::tserver::GetTransactionStatusResponsePB) $1 = {
  _internal_metadata_ = {
    google::protobuf::internal::InternalMetadataWithArenaBase<google::protobuf::UnknownFieldSet, google::protobuf::internal::InternalMetadataWithArena> = (ptr_ = 0x0000000000000000)
  }
  _has_bits_ = {
    has_bits_ = ([0] = 2)
  }
  _cached_size_ = 0
  status_ = {
    current_size_ = 1
    total_size_ = 4
    rep_ = 0x0000050c74514be0
  }
  status_hybrid_time_ = {
    current_size_ = 1
    total_size_ = 4
    rep_ = 0x0000050c7ebfb0c0
  }
  num_replicated_batches_ = {
    current_size_ = 0
    total_size_ = 0
    rep_ = nullptr
  }
  coordinator_safe_time_ = {
    current_size_ = 1
    total_size_ = 4
    rep_ = 0x0000050c7eba9080
  }
  aborted_subtxn_set_ = {
    google::protobuf::internal::RepeatedPtrFieldBase = {
      arena_ = nullptr
      current_size_ = 1
      total_size_ = 4
      rep_ = 0x0000050c7e1ad4c0
    }
  }
  deadlock_reason_ = {
    google::protobuf::internal::RepeatedPtrFieldBase = {
      arena_ = nullptr
      current_size_ = 0
      total_size_ = 0
      rep_ = nullptr
    }
  }
  error_ = nullptr
  propagated_hybrid_time_ = 6920587650673926144
}

response.deadlock_reason can be empty when the response is received from a tablet server, which is not yet updated and that tablet server knows nothing about deadlock_reason.

Thanks, @arybochkin !

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@kripasreenivasan kripasreenivasan added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Jul 17, 2023
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Jul 17, 2023
@yugabyte-ci yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Jul 18, 2023
basavaraj29 added a commit that referenced this issue Jul 18, 2023
…sponsePB proto message due to recent regression

Summary:
Commit d6d0cf6 introduced a new field `deadlock_reason` in the proto message `GetTransactionStatusResponsePB`. The commit didn't handle rolling upgrade scenarios gracefully. It expects that
`resp.deadlock_reason().size() == resp.status().size()` will always be true, which isn't the case. If the transaction coordinator is still on an older version, `deadlock_reason` field isn't populated. Accessing the field without checking
the size is leading to a segv fault.

This diff fixes the issue by explicitly check the size of `deadlock_reason` before accessing the field, ensuring that we handle upgrade scenarios gracefully.
Jira: DB-3597, DB-7266

Test Plan: Jenkins

Reviewers: rsami, arybochkin, sergei

Reviewed By: rsami, arybochkin

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D27023
@yugabyte-ci yugabyte-ci added priority/highest Highest priority issue and removed priority/medium Medium priority issue labels Jul 31, 2023
@yugabyte-ci yugabyte-ci removed the area/docdb YugabyteDB core features label Jun 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This issue is a bug long_running_universe priority/highest Highest priority issue
Projects
None yet
Development

No branches or pull requests

3 participants