Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: flake in acceptance/version-upgrade #55497

Closed
knz opened this issue Oct 13, 2020 · 2 comments · Fixed by #55524
Closed

roachtest: flake in acceptance/version-upgrade #55497

knz opened this issue Oct 13, 2020 · 2 comments · Fixed by #55524
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered).

Comments

@knz
Copy link
Contributor

knz commented Oct 13, 2020

Found here: https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_UnitTests/2360683?

Error message:

pq: internal error: StartableJob 598055747819634691 cannot be started without sqlliveness session

cc @ajwerner @lucy-zhang for triage.

@blathers-crl
Copy link

blathers-crl bot commented Oct 13, 2020

Hi @knz, please add a C-ategory label to your issue. Check out the label system docs.

While you're here, please consider adding an A- label to help keep our repository tidy.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@knz knz added C-test-failure Broken test (automatically or manually discovered). branch-master Failures and bugs on the master branch. labels Oct 13, 2020
@ajwerner
Copy link
Contributor

I'm inclined to consider it a release blocker. The issue is that we're not properly dealing with the mixed version state. We blocked the starting of the sqlliveness subsystem on the version bump but this means that StartableJobs created in the mixed version state may not work correctly as they won't have a session claim when the version flips. For now we're hitting an assertion, which is really bad as the user gets an error but the job actually is going to run. What we should do instead is kick off the sqlliveness subsystem (creating of the session) as soon we finish the sqlmigration. That way, StartableJobs created on the new nodes will have a claim and will work great when the version switches. Fortunately this is easy to accomplish.

@knz knz added branch-release-20.2 release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Oct 13, 2020
@ajwerner ajwerner removed branch-release-20.2 release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Oct 13, 2020
ajwerner added a commit to ajwerner/cockroach that referenced this issue Oct 14, 2020
This commit reworks the SQL liveness subsystem to start on the local node
following the migrations that it relies upon. This is both safe and ensures
that jobs created on the new binary will have a session ID, thus avoiding
the issue that startable jobs created on the local node may not have had
a session ID and thus would not have had a proper claim to run the job
locally.

Fixes cockroachdb#55497.

Release note (bug fix): Fixes a bug which could cause IMPORT, BACKUP or
RESTORE to experience an error when they occur concurrently to when the
cluster sets its version to upgraded.
ajwerner added a commit to ajwerner/cockroach that referenced this issue Oct 21, 2020
This commit reworks the SQL liveness subsystem to start on the local node
following the migrations that it relies upon. This is both safe and ensures
that jobs created on the new binary will have a session ID, thus avoiding
the issue that startable jobs created on the local node may not have had
a session ID and thus would not have had a proper claim to run the job
locally.

Fixes cockroachdb#55497.

Release note (bug fix): Fixes a bug which could cause IMPORT, BACKUP or
RESTORE to experience an error when they occur concurrently to when the
cluster sets its version to upgraded.
craig bot pushed a commit that referenced this issue Oct 21, 2020
55524: jobs,sqlliveness,server: start sqlliveness and pgserver after migrations r=spaskob a=ajwerner

This commit reworks the SQL liveness subsystem to start on the local node
following the migrations that it relies upon. This is both safe and ensures
that jobs created on the new binary will have a session ID, thus avoiding
the issue that startable jobs created on the local node may not have had
a session ID and thus would not have had a proper claim to run the job
locally.

Fixes #55497.

Release note (bug fix): Fixes a bug which could cause IMPORT, BACKUP or
RESTORE to experience an error when they occur concurrently to when the
cluster sets its version to upgraded.

55775: roachprod: add availability zone and network disk type support to azure r=miretskiy a=arulajmani

User facing change:
- New flag `--azure-availability-zone` lets the user supply which zone
a VM should be provisioned in.
- New flag `--azure-network-disk-type` lets the user choose which type
of network disk to use (ultra-disk, premium-ssd)

Previously, we didn't allow users to decide which availability zone a
VM was provisioned in. To achieve this, this patch adds support for
Network Security Groups. This was required as we had to update the
IP SKU from Basic (which is now deprecated) to Standard. By default,
this SKU blocks all inbound/outbound communication, which needed to be
overriden by creating network security rules that allow HTTPS/SSH/HTTP
connections. Network Security Groups are created under the resource
groups that manage VNets for a particular location the first time a
VNet is created.

Release note: None

55820: roachtest: update 20.2 predecessor version to 20.1.8 and create fixtures r=jlinder a=asubiotto

Release note: None

55824: cloud: update orchestrator configs to point to v20.1.8 r=jlinder a=asubiotto

Release note: None

Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
Co-authored-by: arulajmani <arulajmani@gmail.com>
Co-authored-by: Alfonso Subiotto Marques <alfonso@cockroachlabs.com>
@craig craig bot closed this as completed in e7bb9d0 Oct 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered).
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants