Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plan, stats: fix inconsistent row count estimation #7233

Merged
merged 5 commits into from
Aug 6, 2018

Conversation

alivxxx
Copy link
Contributor

@alivxxx alivxxx commented Aug 1, 2018

What have you changed? (mandatory)

Sometimes, the estimated count of a smaller set of expression could be smaller than that of a superset, and it is not consistent. This PR fixes it by preferring the estimation of the superset because it could use more stats info.

What is the type of the changes? (mandatory)

  • Improvement (non-breaking change which is an improvement to an existing feature)

How has this PR been tested? (mandatory)

Unit test.

Does this PR affect documentation (docs/docs-cn) update? (mandatory)

No.

Does this PR affect tidb-ansible update? (mandatory)

No.

Does this PR need to be added to the release notes? (mandatory)

No.

Refer to a related PR or issue link (optional)

Benchmark result if necessary (optional)

Add a few positive/negative examples (optional)

PTAL @coocood @zz-jason @winoros

@alivxxx alivxxx changed the title plan, stats: fix inconsistent estimation plan, stats: fix inconsistent row count estimation Aug 1, 2018
@alivxxx alivxxx added component/statistics type/enhancement The issue or PR belongs to an enhancement. labels Aug 1, 2018
// (1): The stats type, always prefer the primary key or index.
// (2): The number of expression that it covers, the more the better.
// (3): The number of columns that it contains, the less the better.
if (bestTp == colType && set.tp < colType) || bestCount < bits || (bestCount == bits && bestNumCols > set.numCols) {
Copy link
Contributor Author

@alivxxx alivxxx Aug 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is required by this PR, because after the change in logical_plans.go, it would cause failure in TestIndexRead, it would sometimes choose index (b,c).

@@ -41,11 +41,13 @@ type AnalyzeExec struct {
tasks []*analyzeTask
}

// MaxBucketSize is the maximum number of bucket that a histogram could contain.
var MaxBucketSize = int64(256)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a good design. Do not export a variable.

@@ -357,3 +358,13 @@ func (e *AnalyzeColumnsExec) buildStats() (hists []*statistics.Histogram, cms []
}
return hists, cms, nil
}

// SetMaxBucketSize sets the `maxBucketSize`.
func SetMaxBucketSize(size int64) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should only used in test?

winoros
winoros previously approved these changes Aug 2, 2018
Copy link
Member

@winoros winoros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@winoros winoros added the status/LGT1 Indicates that a PR has LGTM 1. label Aug 3, 2018
@alivxxx
Copy link
Contributor Author

alivxxx commented Aug 6, 2018

PTAL @coocood @zz-jason

// (1): The stats type, always prefer the primary key or index.
// (2): The number of expression that it covers, the more the better.
// (3): The number of columns that it contains, the less the better.
if (bestTp == colType && set.tp < colType) || bestCount < bits || (bestCount == bits && bestNumCols > set.numCols) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think set.tp != colType is better, because the type is not a scalar.

@@ -35,6 +35,8 @@ type exprSet struct {
mask int64
// ranges contains all the ranges we got.
ranges []*ranger.Range
// numCols is the number of columns contained in the index or column(which is always 1).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is always 1, why don't use a const?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is always 1 for the column, while it could also greater than 1 for the index.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha.

Copy link
Contributor

@zhexuany zhexuany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alivxxx
Copy link
Contributor Author

alivxxx commented Aug 6, 2018

/run-all-tests

@alivxxx alivxxx added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Aug 6, 2018
Copy link
Member

@zz-jason zz-jason left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zz-jason zz-jason merged commit 44e6c3c into pingcap:master Aug 6, 2018
@alivxxx alivxxx deleted the stats branch August 6, 2018 09:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/statistics status/LGT2 Indicates that a PR has LGTM 2. type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants