Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpcdsvec failed #62520

Closed
cockroach-teamcity opened this issue Mar 24, 2021 · 13 comments · Fixed by #62785
Closed

roachtest: tpcdsvec failed #62520

cockroach-teamcity opened this issue Mar 24, 2021 · 13 comments · Fixed by #62785
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). GA-blocker O-roachtest O-robot Originated from a bot.

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).tpcdsvec failed on release-21.1@23e7cb53bf5baede071832b59bd92ea8164531a6:

The test failed on branch=release-21.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpcdsvec/run_1
	tpcdsvec.go:165,tpcdsvec.go:175,test_runner.go:767: compare vectorize=OFF to vectorize=ON:
		(1) secondary error attachment
		  | compare vectorize=ON to vectorize=OFF:
		  | (1)
		  | Wraps: (2) compare vectorize=ON to vectorize=OFF:
		  |   | unexpected diff:
		  |   |   []interface{}(Inverse(func1, []interface{}{
		  |   |   	s"23298",
		  |   | - 	s"0.25",
		  |   | + 	s"0.49",
		  |   |   	s"36",
		  |   |   	s"5.29",
		  |   |   	s"3.44",
		  |   | - 	s"146",
		  |   | + 	s"73",
		  |   | - 	s"63.66",
		  |   | + 	s"31.83",
		  |   | - 	s"69.92",
		  |   | + 	s"34.96",
		  |   |   }))
		  | Error types: (1) *fmt.wrapError (2) *errors.errorString
		Wraps: (2)
		Wraps: (3) compare vectorize=OFF to vectorize=ON:
		  | unexpected diff:
		  |   []interface{}(Inverse(func1, []interface{}{
		  |   	s"23298",
		  | - 	s"0.49",
		  | + 	s"0.25",
		  |   	s"36",
		  |   	s"5.29",
		  |   	s"3.44",
		  | - 	s"73",
		  | + 	s"146",
		  | - 	s"31.83",
		  | + 	s"63.66",
		  | - 	s"34.96",
		  | + 	s"69.92",
		  |   }))
		Error types: (1) *secondary.withSecondaryError (2) *fmt.wrapError (3) *errors.errorString

More

Artifacts: /tpcdsvec
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-release-21.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 24, 2021
@yuzefovich
Copy link
Member

Query 78 returned incorrect results both with and without stats. I don't think it is a beta blocker, so I'll remove the corresponding label and put GA blocker on before further triage.

@yuzefovich yuzefovich added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 25, 2021
@cockroach-teamcity
Copy link
Member Author

(roachtest).tpcdsvec failed on release-21.1@da3c3dcccd18129bcd963eeca66bc063fc8b9756:

The test failed on branch=release-21.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpcdsvec/run_1
	tpcdsvec.go:165,tpcdsvec.go:175,test_runner.go:767: compare vectorize=OFF to vectorize=ON:
		(1) secondary error attachment
		  | compare vectorize=ON to vectorize=OFF:
		  | (1)
		  | Wraps: (2) compare vectorize=ON to vectorize=OFF:
		  |   | unexpected diff:
		  |   |   []interface{}(Inverse(func1, []interface{}{
		  |   |   	s"23298",
		  |   | - 	s"0.25",
		  |   | + 	s"0.49",
		  |   |   	s"36",
		  |   |   	s"5.29",
		  |   |   	s"3.44",
		  |   | - 	s"146",
		  |   | + 	s"73",
		  |   | - 	s"63.66",
		  |   | + 	s"31.83",
		  |   | - 	s"69.92",
		  |   | + 	s"34.96",
		  |   |   }))
		  | Error types: (1) *fmt.wrapError (2) *errors.errorString
		Wraps: (2)
		Wraps: (3) compare vectorize=OFF to vectorize=ON:
		  | unexpected diff:
		  |   []interface{}(Inverse(func1, []interface{}{
		  |   	s"23298",
		  | - 	s"0.49",
		  | + 	s"0.25",
		  |   	s"36",
		  |   	s"5.29",
		  |   	s"3.44",
		  | - 	s"73",
		  | + 	s"146",
		  | - 	s"31.83",
		  | + 	s"63.66",
		  | - 	s"34.96",
		  | + 	s"69.92",
		  |   }))
		Error types: (1) *secondary.withSecondaryError (2) *fmt.wrapError (3) *errors.errorString

More

Artifacts: /tpcdsvec
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@yuzefovich
Copy link
Member

Hm, this issue is a bit puzzling. The last two nightly runs failed with the same problem (for some reason vectorize=on returned the value for ws_qty doubled). I'm not able to repro on a single or a 3 node local cluster, starting up the 3 node roachprod cluster.

It appears as if the buggy commit is introduced on Mar 23, 2021 because multiple previous runs were successful.

Alright, I can repro it on the roachprod cluster, looking.

@yuzefovich
Copy link
Member

yuzefovich commented Mar 25, 2021

Hm, I'm relatively confident that somehow #62365 is to blame (I tried several times before and after this patch; scattering the tables, and "before" I didn't see an incorrect answer in multiple runs whereas while "after" I often saw the incorrect result). I'm quite puzzled how that change would have an effect on correctness, more likely we have a bug lurking somewhere that got exposed.

@yuzefovich
Copy link
Member

Hm, my current hypothesis is that it has something to do with nulls. Somewhat reduced query that reproduces the issue for me is

WITH
  ws
    AS (
      SELECT
        d_year AS ws_sold_year, ws_item_sk, ws_bill_customer_sk AS ws_customer_sk, sum(ws_quantity) AS ws_qty
      FROM
        web_sales
        LEFT JOIN web_returns ON wr_order_number = ws_order_number AND ws_item_sk = wr_item_sk
        JOIN date_dim ON ws_sold_date_sk = d_date_sk
      WHERE
        wr_order_number IS NULL
      GROUP BY
        d_year, ws_item_sk, ws_bill_customer_sk
    ),
  ss
    AS (
      SELECT
        d_year AS ss_sold_year, ss_item_sk, ss_customer_sk, sum(ss_quantity) AS ss_qty
      FROM
        store_sales
        LEFT JOIN store_returns ON sr_ticket_number = ss_ticket_number AND ss_item_sk = sr_item_sk
        JOIN date_dim ON ss_sold_date_sk = d_date_sk
      WHERE
        sr_ticket_number IS NULL
      GROUP BY
        d_year, ss_item_sk, ss_customer_sk
    )
SELECT
  ss_customer_sk, ws_qty
FROM
  ss LEFT JOIN ws ON ws_sold_year = ss_sold_year AND ws_item_sk = ss_item_sk AND ws_customer_sk = ss_customer_sk
WHERE
  (COALESCE(ws_qty, 0) > 0) AND ss_sold_year = 1998
ORDER BY
  ss_customer_sk, ss_qty DESC, ws_qty
LIMIT
  1
OFFSET
  14;

@yuzefovich
Copy link
Member

I wonder whether it is related to this failure. I'll switch my attention there.

@yuzefovich
Copy link
Member

#62642 doesn't fix it if #62534 is reverted.

@yuzefovich
Copy link
Member

yuzefovich commented Mar 26, 2021

#62644 doesn't seem to fix it either if #62534 is reverted :/

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpcdsvec failed on release-21.1@9caf4f8d29ed00aa01188a081465d9061c5fea64:

						JOIN date_dim ON ss_sold_date_sk = d_date_sk
					WHERE
						sr_ticket_number IS NULL
					GROUP BY
						d_year, ss_item_sk, ss_customer_sk
				)
		SELECT
			ss_customer_sk,
			round(
				ss_qty
				/ (COALESCE(ws_qty, 0) + COALESCE(cs_qty, 0)),
				2
			)
				AS ratio,
			ss_qty AS store_qty,
			ss_wc AS store_wholesale_cost,
			ss_sp AS store_sales_price,
			COALESCE(ws_qty, 0) + COALESCE(cs_qty, 0)
				AS other_chan_qty,
			COALESCE(ws_wc, 0) + COALESCE(cs_wc, 0)
				AS other_chan_wholesale_cost,
			COALESCE(ws_sp, 0) + COALESCE(cs_sp, 0)
				AS other_chan_sales_price
		FROM
			ss
			LEFT JOIN ws ON
					ws_sold_year = ss_sold_year
					AND ws_item_sk = ss_item_sk
					AND ws_customer_sk = ss_customer_sk
			LEFT JOIN cs ON
					cs_sold_year = ss_sold_year
					AND cs_item_sk = ss_item_sk
					AND cs_customer_sk = ss_customer_sk
		WHERE
			(COALESCE(ws_qty, 0) > 0 OR COALESCE(cs_qty, 0) > 0)
			AND ss_sold_year = 1998
		ORDER BY
			ss_customer_sk,
			ss_qty DESC,
			ss_wc DESC,
			ss_sp DESC,
			other_chan_qty,
			other_chan_wholesale_cost,
			other_chan_sales_price,
			ratio
		LIMIT
			100;
		;
		
		vectorize=ON: [same as previous]

More

Artifacts: /tpcdsvec
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@asubiotto
Copy link
Contributor

Are we sure this is not a beta blocker? This issue makes queries return incorrect results, that's pretty bad, no?

@yuzefovich
Copy link
Member

My current best explanation is that there is a bug lurking somewhere that got exposed by #62365. I'm assuming the bug itself has already been present in the previous release, so I don't think it is a beta blocker.

True, it results in incorrect query result, but it seems to be in pretty rare circumstances.

@yuzefovich
Copy link
Member

Ok, a bit more progress. I narrowed the problem down to the vectorized merge joiner by using the join hint for the ws CTE in the comment above (namely, LEFT JOIN replaced with LEFT HASH JOIN always gives the correct answer whereas LEFT MERGE JOIN can give an incorrect one).

@yuzefovich
Copy link
Member

Ok, I figured it out. The bug was introduced in bcdda33 and is only present in 21.1 test releases.

@craig craig bot closed this as completed in d145e9f Mar 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). GA-blocker O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants