Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal error when pushdown agg into partitionUnion #29705

Closed
wshwsh12 opened this issue Nov 11, 2021 · 12 comments · Fixed by #30231
Closed

Internal error when pushdown agg into partitionUnion #29705

wshwsh12 opened this issue Nov 11, 2021 · 12 comments · Fixed by #30231
Assignees
Labels
affects-5.0 This bug affects 5.0.x versions. affects-5.1 This bug affects 5.1.x versions. affects-5.2 This bug affects 5.2.x versions. affects-5.3 This bug affects 5.3.x versions. component/tablepartition This issue is related to Table Partition of TiDB. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. severity/major sig/sql-infra SIG: SQL Infra type/bug The issue is confirmed as a bug.

Comments

@wshwsh12
Copy link
Contributor

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

drop table if exists t;
create table t(id int) partition by hash(id) partitions 4;
insert into t values(1);
SELECT COUNT(1) FROM ( SELECT COUNT(1) FROM t b GROUP BY id) a;

2. What did you expect to see? (Required)

MySQL [test]> SELECT COUNT(1) FROM ( SELECT COUNT(1)  FROM t b GROUP BY id) a;
+----------+
| COUNT(1) |
+----------+
|        1 |
+----------+
1 row in set (0.000 sec)

3. What did you see instead (Required)

ERROR 1105 (HY000): Internal error: UnionExec chunk column count mismatch, req: 1, result: 2

4. What is your TiDB version? (Required)

master

@wshwsh12 wshwsh12 added type/bug The issue is confirmed as a bug. sig/planner SIG: Planner labels Nov 11, 2021
@wshwsh12
Copy link
Contributor Author

Explain reuslt:

+----------------------------------+----------+-----------+-----------------------+-------------------------------------------------------------------------------------+
| id                               | estRows  | task      | access object         | operator info                                                                       |
+----------------------------------+----------+-----------+-----------------------+-------------------------------------------------------------------------------------+
| HashAgg_18                       | 1.00     | root      |                       | funcs:count(1)->Column#4                                                            |
| └─HashAgg_20                     | 24001.00 | root      |                       | group by:test.t.id, funcs:count(1)->Column#7                                        |
|   └─PartitionUnion_21            | 24001.00 | root      |                       |                                                                                     |
|     ├─HashAgg_24                 | 8000.00  | root      |                       | group by:test.t.id, funcs:firstrow(test.t.id)->test.t.id, funcs:count(1)->Column#8  |
|     │ └─TableReader_29           | 10000.00 | root      |                       | data:TableFullScan_28                                                               |
|     │   └─TableFullScan_28       | 10000.00 | cop[tikv] | table:b, partition:p0 | keep order:false, stats:pseudo                                                      |
|     ├─HashAgg_32                 | 1.00     | root      |                       | group by:test.t.id, funcs:firstrow(test.t.id)->test.t.id, funcs:count(1)->Column#9  |
|     │ └─TableReader_37           | 1.00     | root      |                       | data:TableFullScan_36                                                               |
|     │   └─TableFullScan_36       | 1.00     | cop[tikv] | table:b, partition:p1 | keep order:false, stats:pseudo                                                      |
|     ├─HashAgg_40                 | 8000.00  | root      |                       | group by:test.t.id, funcs:firstrow(test.t.id)->test.t.id, funcs:count(1)->Column#10 |
|     │ └─TableReader_45           | 10000.00 | root      |                       | data:TableFullScan_44                                                               |
|     │   └─TableFullScan_44       | 10000.00 | cop[tikv] | table:b, partition:p2 | keep order:false, stats:pseudo                                                      |
|     └─HashAgg_48                 | 8000.00  | root      |                       | group by:test.t.id, funcs:firstrow(test.t.id)->test.t.id, funcs:count(1)->Column#11 |
|       └─TableReader_53           | 10000.00 | root      |                       | data:TableFullScan_52                                                               |
|         └─TableFullScan_52       | 10000.00 | cop[tikv] | table:b, partition:p3 | keep order:false, stats:pseudo                                                      |
+----------------------------------+----------+-----------+-----------------------+-------------------------------------------------------------------------------------+
15 rows in set (1.732 sec)

I debug tidb-server, and find that PartitionUnion's schema only have one Column.

@wshwsh12
Copy link
Contributor Author

wshwsh12 commented Nov 16, 2021

Need release in 5.1.3/5.2.3

@qw4990 qw4990 added sig/sql-infra SIG: SQL Infra and removed sig/planner SIG: Planner labels Nov 17, 2021
@qw4990
Copy link
Contributor

qw4990 commented Nov 17, 2021

PTAL @mjonss

@mjonss
Copy link
Contributor

mjonss commented Nov 17, 2021

Fails on old/default static pruning mode, works with set @@tidb_partition_prune_mode='dynamic'.

@mjonss
Copy link
Contributor

mjonss commented Nov 17, 2021

Regression from 68e32bb / #27798 / #27797

@mjonss
Copy link
Contributor

mjonss commented Nov 17, 2021

Also seems like the test case in #27798 does not trigger the issue reported in those bugs, but manually running it on a standalone tidb-server does.
I.e. if I revert the code change in planner/core/rule_aggregation_push_down.go, the test still passes...

@lcwangchao lcwangchao self-assigned this Nov 25, 2021
@lcwangchao
Copy link
Collaborator

lcwangchao commented Nov 25, 2021

This problem is caused by LogicalAggregation.PruneColumns . See:

if len(la.AggFuncs) == 0 || (!allFirstRow && allRemainFirstRow) {
// If all the aggregate functions are pruned, we should add an aggregate function to maintain the info of row numbers.
// For all the aggregate functions except `first_row`, if we have an empty table defined as t(a,b),
// `select agg(a) from t` would always return one row, while `select agg(a) from t group by b` would return empty.
// For `first_row` which is only used internally by tidb, `first_row(a)` would always return empty for empty input now.
var err error
var newAgg *aggregation.AggFuncDesc
if allFirstRow {
newAgg, err = aggregation.NewAggFuncDesc(la.ctx, ast.AggFuncFirstRow, []expression.Expression{expression.NewOne()}, false)
} else {
newAgg, err = aggregation.NewAggFuncDesc(la.ctx, ast.AggFuncCount, []expression.Expression{expression.NewOne()}, false)
}
if err != nil {
return err
}
la.AggFuncs = append(la.AggFuncs, newAgg)
col := &expression.Column{
UniqueID: la.ctx.GetSessionVars().AllocPlanColumnID(),
RetType: newAgg.RetTp,
}
la.schema.Columns = append(la.schema.Columns, col)
}

Sometimes LogicalAggregation will add one new column after pruning to make a correct result. But its parent, LogicalUnionAll does not handle this situation.

Some related PRs:

@lcwangchao lcwangchao added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Nov 26, 2021
@lcwangchao
Copy link
Collaborator

lcwangchao commented Nov 26, 2021

It is easy to fix it just forcing LogicalAggregation's schema the same with it's children after pruning, but I don't know whether this way has some potential problems.

I think it has related to the semantics for PruneColumns . We may need to reconfirm the semantics of PruneColumns:

  • Are operators allowed to add columns after Prune ? In fact, it is more intuitive not to allow additional columns.
  • Should we force the columns after pruning to be consistent with the parameters ? If this is not forced, the current implementation logic of LogicalUnionAll may also have problems. For example, the original columns is c1, c2, c3. After pruning with argument c1,c2 child [0] remains c1, c2, c3, but child [1] becomes c1, c2.

@lcwangchao
Copy link
Collaborator

lcwangchao commented Nov 26, 2021

Some ways to fix it:

  • Modify the Prune logic of LogicalUnionAll to be compatible with LogicalAggration. This approach is relatively simple, but it means allowing the child operator's Prune to provide a loose constraint, such as allowing them to keep columns that need to be cropped or add some additional columns. At this time, we still need to consider the above problem that child operator columns are not compatible with each other. For example, child [0] is LogicalAggregation and adds an extra column, but child [1] is a normal DataSource without adding a column.

  • Modify the Prune logic of LogicalUnionAll to increase Projection We can check the types of the children when pruning. If it is not some compatible type (for example, Projection, DataSource, etc.), we can create a Projection as the new child and attach the old child to it. This can ensure that the children to have deterministic behaviors. However, be careful that this Projection should not be optimized by other logics. In addition, an additional executor will also be added which may affect the performance

  • Modify the logic of UnionExec Logical UnionAll children are still allowed to some extra columns when pruning, but the schema exposed by LogicalUnionAll will not contain these columns. At this time, UnionExec needs to deal with the problem that its own columns are not the same as the number of columns of the child operator, and only take out the columns it needs.

  • Modify the logic of Aggration related Executors We can modify the aggration related executors to make it can handle "empty columns" and "all firstrow columns" cases correctly. In this way LogicalAggration do not need to add extra column any more. But the problem is some times aggregation will be push down to tikv, so the code in tikv should be modified either.

  • Separate aggFuncs from schema. Columns, so that it can not correspond one to one Another way is to seperate aggFuncs from schema. Columns. In theory, they need not to be one-to-one correspondence. LogicalAggration can add extra aggFuncs but it will not add extra columns. It still need some modifications in tikv.

@time-and-fate
Copy link
Member

Also seems like the test case in #27798 does not trigger the issue reported in those bugs, but manually running it on a standalone tidb-server does. I.e. if I revert the code change in planner/core/rule_aggregation_push_down.go, the test still passes...

The reason behind this is that the default tidb_partition_prune_mode is dynamic in the test and the bug occurs in the static mode. I'll fix this in the PR for this issue.

@mjonss
Copy link
Contributor

mjonss commented Nov 30, 2021

/component tablepartition

@ti-chi-bot ti-chi-bot added the component/tablepartition This issue is related to Table Partition of TiDB. label Nov 30, 2021
@github-actions
Copy link

github-actions bot commented Dec 1, 2021

Please check whether the issue should be labeled with 'affects-x.y' or 'fixes-x.y.z', and then remove 'needs-more-info' label.

@time-and-fate time-and-fate added affects-5.0 This bug affects 5.0.x versions. affects-5.1 This bug affects 5.1.x versions. affects-5.2 This bug affects 5.2.x versions. affects-5.3 This bug affects 5.3.x versions. labels Dec 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-5.0 This bug affects 5.0.x versions. affects-5.1 This bug affects 5.1.x versions. affects-5.2 This bug affects 5.2.x versions. affects-5.3 This bug affects 5.3.x versions. component/tablepartition This issue is related to Table Partition of TiDB. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. severity/major sig/sql-infra SIG: SQL Infra type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants