Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jobs: TestJobInfoUpgradeRegressionTests failed #106347

Closed
cockroach-teamcity opened this issue Jul 6, 2023 · 0 comments · Fixed by #106378
Closed

jobs: TestJobInfoUpgradeRegressionTests failed #106347

cockroach-teamcity opened this issue Jul 6, 2023 · 0 comments · Fixed by #106378
Assignees
Labels
A-disaster-recovery A-jobs branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-jobs
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jul 6, 2023

jobs.TestJobInfoUpgradeRegressionTests failed with artifacts on master @ 818aec861357579eb3a3e987cf5887f3cf112be4:

I230706 22:11:07.395057 1840 upgrade/upgradecluster/cluster.go:121  [T1,n1,client=127.0.0.1:41314,hostssl,user=root,migration-mgr] 1767  executing bump-cluster-version=1000022.2-77(fence) on nodes n{1}
I230706 22:11:07.404103 16892 server/migration.go:150  [T1,n1,bump-cluster-version] 1768  active cluster version setting is now 1000022.2-77(fence) (up from 1000022.2-76)
I230706 22:11:07.404575 1840 upgrade/upgrademanager/manager.go:657  [T1,n1,client=127.0.0.1:41314,hostssl,user=root,migration-mgr] 1769  executing operation validate-cluster-version=1000022.2-78
I230706 22:11:07.404985 1840 upgrade/upgradecluster/cluster.go:121  [T1,n1,client=127.0.0.1:41314,hostssl,user=root,migration-mgr] 1770  executing validate-cluster-version=1000022.2-78 on nodes n{1}
I230706 22:11:07.406167 1840 upgrade/upgrademanager/manager.go:657  [T1,n1,client=127.0.0.1:41314,hostssl,user=root,migration-mgr] 1771  executing operation bump-cluster-version=1000022.2-78
I230706 22:11:07.406594 1840 upgrade/upgradecluster/cluster.go:121  [T1,n1,client=127.0.0.1:41314,hostssl,user=root,migration-mgr] 1772  executing bump-cluster-version=1000022.2-78 on nodes n{1}
I230706 22:11:07.406999 16897 server/migration.go:150  [T1,n1,bump-cluster-version] 1773  active cluster version setting is now 1000022.2-78 (up from 1000022.2-77(fence))
I230706 22:11:07.421796 1840 upgrade/upgrademanager/manager.go:517  [T1,n1,client=127.0.0.1:41314,hostssl,user=root,migration-mgr] 1774  stepping through 1000022.2-80
I230706 22:11:07.421963 1840 upgrade/upgrademanager/manager.go:657  [T1,n1,client=127.0.0.1:41314,hostssl,user=root,migration-mgr] 1775  executing operation bump-cluster-version=1000022.2-79(fence)
I230706 22:11:07.422398 1840 upgrade/upgradecluster/cluster.go:121  [T1,n1,client=127.0.0.1:41314,hostssl,user=root,migration-mgr] 1776  executing bump-cluster-version=1000022.2-79(fence) on nodes n{1}
I230706 22:11:07.423272 16939 server/migration.go:150  [T1,n1,bump-cluster-version] 1777  active cluster version setting is now 1000022.2-79(fence) (up from 1000022.2-78)
I230706 22:11:07.424778 1840 upgrade/upgrademanager/manager.go:657  [T1,n1,client=127.0.0.1:41314,hostssl,user=root,migration-mgr] 1778  executing operation validate-cluster-version=1000022.2-80
I230706 22:11:07.425030 1840 upgrade/upgradecluster/cluster.go:121  [T1,n1,client=127.0.0.1:41314,hostssl,user=root,migration-mgr] 1779  executing validate-cluster-version=1000022.2-80 on nodes n{1}
I230706 22:11:07.450150 16863 jobs/adopt.go:261  [T1,n1] 1781  job 880184745739550721: resuming execution
I230706 22:11:07.448976 1840 upgrade/upgrademanager/manager.go:742  [T1,n1,client=127.0.0.1:41314,hostssl,user=root,migration-mgr] 1780  running Upgrade to 1000022.2-80: "backfill the system.job_info table with the payload and progress of each job in the system.jobs table"
I230706 22:11:07.457530 16865 jobs/registry.go:1606  [T1,n1] 1782  MIGRATION job 880184745739550721: stepping through state running
I230706 22:11:07.547964 16865 upgrade/upgrades/backfill_job_info_table_migration.go:81  [T1,n1,job=MIGRATION id=880184745739550721,upgrade=1000022.2-80] 1783  backfilling job_info, step0, batch0 done; resume after 0, done false
I230706 22:11:07.551043 16865 upgrade/upgrades/backfill_job_info_table_migration.go:81  [T1,n1,job=MIGRATION id=880184745739550721,upgrade=1000022.2-80] 1784  backfilling job_info, step0, batch1 done; resume after 880184745739550721, done true
I230706 22:11:07.632088 16865 upgrade/upgrades/backfill_job_info_table_migration.go:81  [T1,n1,job=MIGRATION id=880184745739550721,upgrade=1000022.2-80] 1785  backfilling job_info, step1, batch0 done; resume after 0, done false
I230706 22:11:07.649981 16865 upgrade/upgrades/backfill_job_info_table_migration.go:81  [T1,n1,job=MIGRATION id=880184745739550721,upgrade=1000022.2-80] 1786  backfilling job_info, step1, batch1 done; resume after 880184745739550721, done true
I230706 22:11:07.651325 16865 jobs/registry.go:1606  [T1,n1] 1787  MIGRATION job 880184745739550721: stepping through state succeeded
I230706 22:11:07.662184 1840 jobs/wait.go:145  [T1,n1,client=127.0.0.1:41314,hostssl,user=root,migration-mgr] 1788  waited for 1 [880184745739550721] queued jobs to complete 210.019003ms
I230706 22:11:07.662257 1840 upgrade/upgrademanager/manager.go:657  [T1,n1,client=127.0.0.1:41314,hostssl,user=root,migration-mgr] 1789  executing operation bump-cluster-version=1000022.2-80
I230706 22:11:07.662566 1840 upgrade/upgradecluster/cluster.go:121  [T1,n1,client=127.0.0.1:41314,hostssl,user=root,migration-mgr] 1790  executing bump-cluster-version=1000022.2-80 on nodes n{1}
I230706 22:11:07.662831 17135 server/migration.go:150  [T1,n1,bump-cluster-version] 1791  active cluster version setting is now 1000022.2-80 (up from 1000022.2-79(fence))
I230706 22:11:07.667852 1840 util/log/event_log.go:32  [T1,n1,client=127.0.0.1:41314,hostssl,user=root] 1792 ={"Timestamp":1688681461014926435,"EventType":"set_cluster_setting","Statement":"SET CLUSTER SETTING version = $1","Tag":"SET CLUSTER SETTING","User":"root","PlaceholderValues":["'1000022.2-80'"],"SettingName":"version","Value":"1000022.2-80"}
    job_info_storage_test.go:366: query 'SELECT count(*) FROM crdb_internal.system_jobs WHERE job_type = 'BACKUP'': expected:
        1
        
        got:
        0
        
W230706 22:11:07.756134 17097 kv/kvserver/intentresolver/intent_resolver.go:826  [-] 1793  failed to gc transaction record: could not GC completed transaction anchored at /Table/6/1/"version"/0: node unavailable; try another peer
I230706 22:11:07.756204 900 sql/stats/automatic_stats.go:572  [T1,n1] 1794  quiescing auto stats refresher
I230706 22:11:07.756382 10921 jobs/registry.go:1606  [T1,n1] 1795  KEY VISUALIZER job 100: stepping through state succeeded
W230706 22:11:07.758610 10921 jobs/adopt.go:531  [T1,n1] 1796  could not clear job claim: clear-job-claim: failed to send RPC: sending to all replicas failed; last error: ba: Scan [/Table/15/1/100,/Table/15/1/101), [txn: cac76053], [can-forward-ts] RPC error: node unavailable; try another peer
I230706 22:11:07.759080 901 sql/stats/automatic_stats.go:624  [T1,n1] 1797  quiescing stats garbage collector
I230706 22:11:07.759309 373 server/start_listen.go:103  [T1,n1] 1798  server shutting down: instructing cmux to stop accepting
I230706 22:11:07.762217 9363 jobs/registry.go:1606  [T1,n1] 1799  AUTO SPAN CONFIG RECONCILIATION job 880184732354183169: stepping through state succeeded
W230706 22:11:07.762427 11268 jobs/adopt.go:531  [T1,n1] 1800  could not clear job claim: clear-job-claim: node unavailable; try another peer
W230706 22:11:07.762529 650 sql/sqlliveness/slinstance/slinstance.go:334  [T1,n1] 1801  exiting heartbeat loop
W230706 22:11:07.764876 650 sql/sqlliveness/slinstance/slinstance.go:321  [T1,n1] 1804  exiting heartbeat loop with error: node unavailable; try another peer
I230706 22:11:07.762669 977 jobs/registry.go:1606  [T1,n1] 1802  AUTO SPAN CONFIG RECONCILIATION job 880184715132862465: stepping through state succeeded
W230706 22:11:07.764785 9363 jobs/adopt.go:531  [T1,n1] 1803  could not clear job claim: clear-job-claim: node unavailable; try another peer
E230706 22:11:07.765004 650 server/server_sql.go:514  [T1,n1] 1805  failed to run update of instance with new session ID: node unavailable; try another peer
E230706 22:11:07.765174 977 jobs/registry.go:1004  [T1,n1] 1806  error getting live session: node unavailable; try another peer
I230706 22:11:07.768845 58 server/server_controller_orchestration.go:263  [T1,n1] 1807  server controller shutting down ungracefully
I230706 22:11:07.769028 58 server/server_controller_orchestration.go:274  [T1,n1] 1808  waiting for tenant servers to report stopped
W230706 22:11:07.769212 58 server/server_sql.go:1712  [T1,n1] 1809  server shutdown without a prior graceful drain
--- FAIL: TestJobInfoUpgradeRegressionTests (9.81s)
Help

See also: How To Investigate a Go Test Failure (internal)

Same failure on other branches

/cc @cockroachdb/jobs @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-29520

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-disaster-recovery T-jobs labels Jul 6, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.2 milestone Jul 6, 2023
@stevendanna stevendanna added the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Jul 7, 2023
@stevendanna stevendanna self-assigned this Jul 7, 2023
craig bot pushed a commit that referenced this issue Jul 7, 2023
106236: admission: avoid recursive grant chain r=irfansharif a=irfansharif

Fixes #105185.
Fixes #105613.

In #97599 we introduced a non-blocking admission interface for below-raft, replication admission control. When doing so, we unintentionally violated the 'requester' interface -- when 'requester.granted()' is invoked, the granter expects to admit a single queued request. The code layering made it so that after granting one, when doing post-hoc token adjustments, if we observed the granter was exhausted but now no longer was so, we'd try to grant again. This resulted in admitting work recursively, with a callstack as deep as the admit chain.

Not only is that undesirable design-wise, it also triggered panics in the granter that wasn't expecting more than one request being admitted. Recursively we were incrementing the grant chain index, which overflowed (it was in int8, so happened readily with long enough admit chains), after which we panic-ed when using a negative index to access an array.

We add a test that fails without the changes. The failure can also be triggered by applying the diff below (which reverts to the older, buggy behavior):
```
dev test pkg/kv/kvserver -f TestFlowControlGranterAdmitOneByOne -v --show-logs
```

```diff
diff --git i/pkg/util/admission/granter.go w/pkg/util/admission/granter.go
index ba42213c375..7c526fbb3d8 100644
--- i/pkg/util/admission/granter.go
+++ w/pkg/util/admission/granter.go
`@@` -374,7 +374,7 `@@` func (cg *kvStoreTokenChildGranter) storeWriteDone(
 func (cg *kvStoreTokenChildGranter) storeReplicatedWorkAdmittedLocked(
        originalTokens int64, admittedInfo storeReplicatedWorkAdmittedInfo,
 ) (additionalTokens int64) {
-       return cg.parent.storeReplicatedWorkAdmittedLocked(cg.workClass, originalTokens, admittedInfo, false /* canGrantAnother */)
+       return cg.parent.storeReplicatedWorkAdmittedLocked(cg.workClass, originalTokens, admittedInfo, true /* canGrantAnother */)
 }
```
Release note: None

106378: upgrades: fix txn retry bug in upgrade batching r=adityamaru a=stevendanna

In #105750 we split the backfill of the job_type column across multiple transactions. This introduced a bug in which we would modify the resumeAfter variable that controlled the batching before the transaction succeeded. In the face of a transaction retry, this would result in some rows not having their job_type column populated.

This was caught in nightly tests that attempted to use the crdb_internal.system_jobs virtual index on the job_type column.

Here, we apply the same fix that we applied in #104752 for the same type of bug.

Fixes #106347
Fixes #106246

Release note (bug fix): Fixes a bug where a transaction retry during the backfill of the job_type column in the jobs table could result some job records with no job_type value.

106408: ci: remove `lint` job from GitHub CI r=rail a=rickystewart

With `staticcheck` and `unused` working identically under `lint` in Bazel and `make` now, it's time! Delete this file so that GitHub CI lint stops running. This is the *last* GitHub CI job. :) Now only Bazel builds and tests will run on PR's.

Epic: CRDB-15060
Release note: None

Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
Co-authored-by: Steven Danna <danna@cockroachlabs.com>
Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>
@craig craig bot closed this as completed in a7356bc Jul 7, 2023
blathers-crl bot pushed a commit that referenced this issue Jul 7, 2023
In #105750 we split the backfill of the job_type column across
multiple transactions. This introduced a bug in which we would modify
the resumeAfter variable that controlled the batching before the
transaction succeeded. In the face of a transaction retry, this would
result in some rows not having their job_type column populated.

This was caught in nightly tests that attempted to use the
crdb_internal.system_jobs virtual index on the job_type column.

Here, we apply the same fix that we applied in #104752 for the same
type of bug.

Fixes #106347
Fixes #106246

Release note (bug fix): Fixes a bug where a transaction retry during
the backfill of the job_type column in the jobs table could result
some job records with no job_type value.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery A-jobs branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-jobs
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants