Skip to content

Conversation

@faizaanmadhani
Copy link
Contributor

@faizaanmadhani faizaanmadhani commented Nov 3, 2022

This commit adds support to manually create partial statistics
at the extremes of indexes for individual columns. Columns must
be the key of their index, otherwise an error will be returned.
Partial statistics are stored in system.table_statistics and are
identified by a predicate indicating their coverage. Additionally,
this PR also adds a partial_predicate column to the output to
SHOW STATISTICS.

Epic: CRDB-19449

Release note (sql change): users can now manually create partial
single-column statistics at the extreme values on
columns that are prefixes of their index. The output of SHOW
STATISTICS now includes a column indicating the partial predicate
for a partial statistic, or NULL for a full statistic.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@faizaanmadhani faizaanmadhani force-pushed the faizaan/manual-partial-stats-jobs branch 15 times, most recently from 24be4da to 623f005 Compare November 11, 2022 16:59
@faizaanmadhani faizaanmadhani changed the title sql: add partial statistics USING EXTREMES to system.table_statistics sql: support collecting and adding single-column partial table statistics USING EXTREMES to system.table_statistics Nov 11, 2022
@faizaanmadhani faizaanmadhani force-pushed the faizaan/manual-partial-stats-jobs branch from 623f005 to cc480a2 Compare November 11, 2022 17:12
@faizaanmadhani faizaanmadhani marked this pull request as ready for review November 11, 2022 17:14
@faizaanmadhani faizaanmadhani requested a review from a team as a code owner November 11, 2022 17:14
@faizaanmadhani faizaanmadhani requested a review from a team November 11, 2022 17:14
@faizaanmadhani faizaanmadhani requested a review from a team as a code owner November 11, 2022 17:14
@faizaanmadhani faizaanmadhani force-pushed the faizaan/manual-partial-stats-jobs branch from cc480a2 to 25dac52 Compare November 11, 2022 17:25
Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

Reviewed 10 of 10 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @faizaanmadhani and @michae2)


pkg/jobs/jobspb/jobs.proto line 981 at r1 (raw file):

  bool delete_other_stats = 8;

  // If true, will collect partial table statistics at extreme values

nit: end comment with a period.


pkg/sql/create_stats.go line 266 at r1 (raw file):

		var multiColEnabled bool
		// Disable multi-column stats if partial statistics at the extremes are requested
		// TODO (faizaanmadhani): Add support for multi-column stats

nit: end with period.


pkg/sql/create_stats.go line 269 at r1 (raw file):

		if n.Options.UsingExtremes {
			multiColEnabled = false
		} else {

nit: you don't need an else here, I'd just say

if !n.Options.UsingExtremes {
  multiColEnabled = ...
}

pkg/sql/distsql_plan_stats.go line 46 at r1 (raw file):

const histogramSamples = 10000

// constructContraintsForExtremes(

looks like a leftover


pkg/sql/distsql_plan_stats.go line 61 at r1 (raw file):

)

func (dsp *DistSQLPlanner) createPartialStatsPlan(

It looks like a lot of the code in this function is duplicated from createStatsPlan. Can you extract the common logic into a helper function that is called by both?


pkg/sql/distsql_plan_stats.go line 75 at r1 (raw file):

		return nil, err
	}
	// Map the ColumnIDs to their histograms if they exist

nit: end with period.

here and comments below


pkg/sql/distsql_plan_stats.go line 88 at r1 (raw file):

	}

	if len(reqStat.columns) > 1 {

nit: check this at the top of the function so you don't waste time computing the tableStatsMap if it's not needed

In general you should check for valid input at the top of functions if needed


pkg/sql/distsql_plan_stats.go line 90 at r1 (raw file):

	if len(reqStat.columns) > 1 {
		// TODO (faizaanmadhani): Add support for creating multi-column stats
		return nil, errors.Newf("multi-column partial statistics are not currently supported")

Let's use pgerror.Newf with pgcode FeatureNotSupported


pkg/sql/distsql_plan_stats.go line 122 at r1 (raw file):

	sb.Init(planCtx.EvalContext(), planCtx.ExtendedEvalCtx.Codec, desc, scan.index)

	var sketchSpec, invSketchSpec []execinfrapb.SketchSpec

nit: move this closer to where it's used below


pkg/sql/distsql_plan_stats.go line 124 at r1 (raw file):

	var sketchSpec, invSketchSpec []execinfrapb.SketchSpec

	histogram, ok := tableStatsMap[column.GetID()]

It looks like this is the only place you're using the tableStatsMap, so I wouldn't bother creating a map at all -- just loop through the stats here to find the one stat you care about. You also don't need to bother fetching from the cache until you get here either.


pkg/sql/distsql_plan_stats.go line 126 at r1 (raw file):

	histogram, ok := tableStatsMap[column.GetID()]
	if !ok {
		return nil, errors.Newf("column %s does not have a prior statistic", column.GetName())

Let's try to find an appropriate pgcode for these errors. MaybeObjectNotInPrerequisiteState?


pkg/sql/distsql_plan_stats.go line 150 at r1 (raw file):

		return nil, err
	}
	scan.isFull = false

nit: the default value of booleans is false, so you don't need to set this explicitly


pkg/sql/distsql_plan_stats.go line 209 at r1 (raw file):

		invSketchSpec = append(invSketchSpec, spec)
	} else {
		// Append the new sketch specification to the slice of specifications.

nit: this kind of comment is not especially useful -- it's easy to see what's happening just from looking at the code.


pkg/sql/distsql_plan_stats.go line 574 at r1 (raw file):

func constructUpperBoundSpan(histogram []cat.HistogramBucket) constraint.Span {
	var ubSpan constraint.Span
	if len(histogram) == 0 {

but you already checked this before calling this function. Do you want to allow an empty histogram or not?


pkg/sql/distsql_plan_stats.go line 592 at r1 (raw file):

}

// convertUsingExtremesToPredicate returns string of a predicate identifying

nit: convertUsingExtremesToPredicate -> constructUsingExtremesPredicate


pkg/sql/distsql_plan_stats.go line 597 at r1 (raw file):

	lower constraint.Span, upper constraint.Span, columnName string,
) string {
	predicate := fmt.Sprintf("%s <= %s OR %s >= %s", columnName, lower.EndKey().Value(0).String(), columnName, upper.StartKey().Value(0).String())

should this be <= and >= or < and >? Or maybe you should check whether the boundaries are constraint.IncludeBoundary or constraint.ExcludeBoundary?


pkg/sql/scan.go line 236 at r1 (raw file):

func (n *scanNode) initDescSpecificCol(colCfg scanColumnsConfig, prefixCol catalog.Column) error {
	n.colCfg = colCfg
	indexes := n.desc.AllIndexes()

I think we probably want ActiveIndexes


pkg/sql/scan.go line 242 at r1 (raw file):

	// table where prefixCol is the key column.
	foundIndex := false
	for _, idx := range indexes {

We should probably check that this is a forward index (i.e., GetType() == IndexDescriptor_FORWARD)


pkg/sql/scan.go line 253 at r1 (raw file):

	}
	if !foundIndex {
		return errors.Newf("table %s does not contain an index with %s as a prefix column", n.desc.GetName(), prefixCol.GetName())

Let's use a pgcode error for this with pgerror.Newf. Maybe pgcode.InvalidColumnReference or pgcode.InvalidParameterValue?


pkg/sql/scan.go line 260 at r1 (raw file):

		return err
	}
	// Set up the rest of the scanNode

nit: end with period.


pkg/sql/stats/new_stat.go line 98 at r1 (raw file):

	if partialPredicate == "" {
		predicateValue = nil
	} else {

nit: you don't need else here, since the default constructed value should already be nil. just do:

if partialPredicate != "" {
  ...
}

pkg/sql/stats/new_stat.go line 113 at r1 (raw file):

					"avgSize",
					histogram,
          "partialPredicate"

formatting seems messed up


pkg/sql/stats/table_statistic.proto line 59 at r1 (raw file):

  // The average row size of the columns in ColumnIDs.
  uint64 avg_size = 10;
  // Is a predicate representing the range of a partial statistic,

nit: "Is a predicate" -> "A predicate"


pkg/sql/logictest/testdata/logic_test/distsql_stats line 1904 at r1 (raw file):

u_defaults       {d,c,b,a}

# Test single column partial statistics creation

nit: end with a period here and below.


pkg/sql/logictest/testdata/logic_test/distsql_stats line 2028 at r1 (raw file):


statement error pq: table xy does not contain an index with y as a prefix column
CREATE STATISTICS xy_y_partial ON y FROM xy USING EXTREMES;

it would be good to add a couple of tests with null values too -- e.g., if the only value in a table is null, or if there exists a null value along with the other values

@faizaanmadhani faizaanmadhani force-pushed the faizaan/manual-partial-stats-jobs branch 2 times, most recently from 79327b0 to 7d39975 Compare November 16, 2022 00:39
Copy link
Collaborator

@DrewKimball DrewKimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! I'm not super familiar with the stats code, but I left a few questions and nits. By the way, you can respond to comments you've addressed with Done. in order to mark them as resolved.

Reviewed 3 of 10 files at r1, 7 of 7 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @faizaanmadhani, @michae2, and @rytaft)


pkg/sql/create_stats.go line 273 at r2 (raw file):

			return nil, err
		}
		deleteOtherStats = true

Do we want to be deleting old stats when we collect partial statistics?

Also, will there ever be a case where we don't specify column names for partial stats? It looks from the RFC like this is possible, but that we only collect for the first column in each index, is that correct?


pkg/sql/distsql_plan_stats.go line 234 at r2 (raw file):

	extremesPredicate := constructUsingExtremesPredicate(lowerBoundSpan, upperBoundSpan, column.GetName())
	var extremesSpans constraint.Spans
	extremesSpans.InitSingleSpan(&lowerBoundSpan)

Shouldn't upperBoundSpan be added here?


pkg/sql/distsql_plan_stats.go line 461 at r2 (raw file):

	tableDesc := tabledesc.NewBuilder(&details.Table).BuildImmutableTable()
	if details.UsingExtremes {
		// Currently, we limit the number of requested for partial statistics

[nit] requested -> requests


pkg/sql/scan.go line 242 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

We should probably check that this is a forward index (i.e., GetType() == IndexDescriptor_FORWARD)

Should we also check that the index isn't partial? I guess the index predicate could match the partial stats predicate, but that seems complicated to handle.

Similarly, I think the span generation logic needs to know if the index is ascending or descending - we should either handle it there or filter out the case here.


pkg/sql/rowexec/sample_aggregator.go line 492 at r2 (raw file):

			for i, c := range si.spec.Columns {
				columnIDs[i] = s.sampledCols[c]

[super nit] did you mean to add this newline?


pkg/sql/rowexec/sample_aggregator.go line 496 at r2 (raw file):

			// Delete old stats that have been superseded.
			if err := stats.DeleteOldStatsForColumns(

Similar question to above, do we want to be deleting old stats after gathering partial statistics?


pkg/sql/span/span_builder.go line 246 at r2 (raw file):

// spans. A span.Splitter can be used to generate more specific
// family spans from constraints, datums and encdatums.
func (s *Builder) SpansFromConstraintSpan(

[nit] Should we just unexport appendSpansFromConstraintSpan? It seems like that's what you really want to use - you'd just call it once for each constraint span since it already has appending behavior.


pkg/sql/logictest/testdata/logic_test/distsql_stats line 2028 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

it would be good to add a couple of tests with null values too -- e.g., if the only value in a table is null, or if there exists a null value along with the other values

Probably also want tests with a partial index and a descending index.

@faizaanmadhani faizaanmadhani force-pushed the faizaan/manual-partial-stats-jobs branch from 7d39975 to 650b723 Compare November 16, 2022 16:21
@faizaanmadhani faizaanmadhani force-pushed the faizaan/manual-partial-stats-jobs branch from 15ceba1 to b337823 Compare November 17, 2022 00:44
Copy link
Contributor Author

@faizaanmadhani faizaanmadhani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the reviews! I've addressed all the comments. Adding support for some of the things we discussed, like making sure we would only use the latest full stat collection when doing the next partial stat collection required updating the show statistics code to include a partial_predicate column to avoid some errors being thrown, which I intended to do later but have done in this PR. Tests for it have been included as well.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball, @michae2, and @rytaft)


pkg/sql/distsql_plan_stats.go line 461 at r2 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

[nit] requested -> requests

Done.


pkg/sql/rowexec/sample_aggregator.go line 496 at r2 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

Similar question to above, do we want to be deleting old stats after gathering partial statistics?

Done.


pkg/sql/stats/new_stat.go line 113 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

formatting seems messed up

Done.


pkg/sql/logictest/testdata/logic_test/distsql_stats line 2028 at r1 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

Probably also want tests with a partial index and a descending index.

Done.

@faizaanmadhani faizaanmadhani force-pushed the faizaan/manual-partial-stats-jobs branch from b337823 to 4bfe553 Compare November 17, 2022 02:00
Copy link
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of this needs to be conditioned on the V23_1AddPartialStatisticsPredicateCol version gate. I'll try to identify all the spots.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball and @rytaft)

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 6 files at r3, 10 of 10 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball and @faizaanmadhani)


-- commits line 10 at r4:
this should also be added to the release note


pkg/sql/distsql_plan_stats.go line 81 at r4 (raw file):

	// since we only support one reqStat at a time.
	for _, s := range reqStats {
		sampler.MaxFractionIdle = details.MaxFractionIdle

You can remove this since you're doing it above the loop


pkg/sql/distsql_plan_stats.go line 171 at r4 (raw file):

) (*PhysicalPlan, error) {

	// We can only process one statistic

nit: end with period.

Also -- should you return an error if len(reqStats) != 1?


pkg/sql/distsql_plan_stats.go line 420 at r4 (raw file):

	}

	// Estimate the expected number of rows based on existing stats in the cache.

This comment should be moved inside createAndAttachSamplers, just above the declaration of the rowsExpected variable


pkg/sql/distsql_plan_stats.go line 469 at r4 (raw file):

		if len(reqStats) > 1 {
			return nil, errors.New("cannot process multiple partial statistics at once")
		}

Let's put this check at the top of createPartialStatsPlan


pkg/sql/distsql_plan_stats.go line 518 at r4 (raw file):

		predicate = fmt.Sprintf("%s < %s OR %s > %s", columnName, extremesSpans.Get(0).EndKey().Value(0).String(), columnName, extremesSpans.Get(1).StartKey().Value(0).String())
	} else {
		predicate = fmt.Sprintf("%s < %s OR %s > %s", columnName, extremesSpans.Get(1).StartKey().Value(0).String(), columnName, extremesSpans.Get(0).EndKey().Value(0).String())

I realized thanks to your new tests that this predicate is technically not correct right now, since the spans include null values. Either we should change the predicate to col < %s OR col > %s OR col IS NULL, or we should update the spans to exclude NULL values. I'm not sure which is best, but I don't really see any harm in collecting nulls, so for now I think you can just update the predicate here to include IS NULL.


pkg/sql/distsql_plan_stats.go line 524 at r4 (raw file):

// constructExtremesSpans returns a constraint.Spans consisting of a lowerbound and upperbound span
// covering the extremes of an index

nit: comments should be <= 80 columns wide and end with a period.


pkg/sql/distsql_plan_stats.go line 532 at r4 (raw file):

	lowerBound := histogram[0].UpperBound
	upperBound := constraint.MakeKey(histogram[len(histogram)-1].UpperBound)
	// Pick the latest lowerBound that is not null,

nit: latest -> earliest ?


pkg/sql/distsql_plan_stats.go line 534 at r4 (raw file):

	// Pick the latest lowerBound that is not null,
	// but if none exist, return error
	for _, hist := range histogram {

instead of copying each bucket, this would be more efficient:

for i := range histogram {
  if histogram[i].UpperBound.Compare ...
}

or

for i := range histogram {
  hist := &histogram[i]
  if hist.UpperBound.Compare ...
}

pkg/sql/scan.go line 234 at r4 (raw file):

}

// initDescSpecificCol initializes the column structures with the index that the provided column

nit: comments should be <= 80 columns wide


pkg/sql/scan.go line 235 at r4 (raw file):

// initDescSpecificCol initializes the column structures with the index that the provided column
// is the prefix for, and returns that index.

doesn't look like it returns the index....


pkg/sql/show_stats.go line 97 at r4 (raw file):

							"nullCount",
							"avgSize",
						  "partialPredicate",

nit: formatting is messed up


pkg/sql/rowexec/sample_aggregator.go line 497 at r4 (raw file):

			// if the new statistic is not partial
			if si.spec.PartialPredicate == "" {
				if err := stats.DeleteOldStatsForColumns(

FYI -- Inside this DeleteOldStatsForColumns function, I think we'll probably also want to add a check for the partial predicate (i.e., the few old stats we keep should be full, not partial). This won't really matter until you add automatic partial stats, but just wanted to mention it now so I don't forget later. No need to make the change now, but just FYI.


pkg/sql/stats/forecast.go line 248 at r4 (raw file):

			NullCount:     uint64(nullCount),
			AvgSize:       uint64(avgSize),
			IsForecast:    true,

Is this needed? We're already using the ForecastStatsName for this purpose


pkg/sql/stats/stats_cache.go line 720 at r4 (raw file):

	"nullCount",
	"avgSize",
  "partialPredicate",

nit: formatting messed up


pkg/sql/stats/table_statistic.proto line 64 at r4 (raw file):

  // A boolean which is true if the statistic has been
  // forecasted, and false if not
  bool is_forecast = 12;

Like I mentioned above -- I'm not convinced we need this


pkg/sql/logictest/testdata/logic_test/distsql_stats line 2028 at r1 (raw file):

Previously, faizaanmadhani (Faizaan Madhani) wrote…

Done.

I don't think you added any tests with partial indexes


pkg/sql/logictest/testdata/logic_test/distsql_stats line 2044 at r4 (raw file):

# Test descending indexes.
statement ok
create table d_desc (a int, b int, index (a desc, b));

nit: use all caps for SQL keywords. here and below


pkg/sql/logictest/testdata/logic_test/distsql_stats line 2094 at r4 (raw file):

# Test if requesting a partial stat again uses the previous full stat and not the previous partial stat.
statement ok
CREATE STATISTICS sdnp2 ON a FROM d_desc USING EXTREMES

I don't think this is testing what you want, since the previous partial stat didn't actually scan any non-null values.

@faizaanmadhani faizaanmadhani force-pushed the faizaan/manual-partial-stats-jobs branch 2 times, most recently from 86ef2db to 3f61eb9 Compare November 17, 2022 20:06
Copy link
Contributor Author

@faizaanmadhani faizaanmadhani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball and @rytaft)


-- commits line 10 at r4:

Previously, rytaft (Rebecca Taft) wrote…

this should also be added to the release note

Done.


pkg/sql/distsql_plan_stats.go line 81 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

You can remove this since you're doing it above the loop

Done.


pkg/sql/distsql_plan_stats.go line 171 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: end with period.

Also -- should you return an error if len(reqStats) != 1?

Right now I'm doing it on line 467 before calling this function. As per your comment below, I've moved it inside this function 🙂


pkg/sql/distsql_plan_stats.go line 420 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

This comment should be moved inside createAndAttachSamplers, just above the declaration of the rowsExpected variable

Done.


pkg/sql/distsql_plan_stats.go line 469 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Let's put this check at the top of createPartialStatsPlan

Done.


pkg/sql/distsql_plan_stats.go line 518 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

I realized thanks to your new tests that this predicate is technically not correct right now, since the spans include null values. Either we should change the predicate to col < %s OR col > %s OR col IS NULL, or we should update the spans to exclude NULL values. I'm not sure which is best, but I don't really see any harm in collecting nulls, so for now I think you can just update the predicate here to include IS NULL.

Done.


pkg/sql/distsql_plan_stats.go line 534 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

instead of copying each bucket, this would be more efficient:

for i := range histogram {
  if histogram[i].UpperBound.Compare ...
}

or

for i := range histogram {
  hist := &histogram[i]
  if hist.UpperBound.Compare ...
}

Done.


pkg/sql/scan.go line 235 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

doesn't look like it returns the index....

Yes, that was an accident. I wrote the comment when I started editing the function but missed changing it when I was done.


pkg/sql/stats/table_statistic.proto line 64 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Like I mentioned above -- I'm not convinced we need this

Removed.


pkg/sql/logictest/testdata/logic_test/distsql_stats line 2028 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

I don't think you added any tests with partial indexes

Yup, sorry. Should have added an error test for partial indexes. It's been added now.


pkg/sql/logictest/testdata/logic_test/distsql_stats line 2094 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

I don't think this is testing what you want, since the previous partial stat didn't actually scan any non-null values.

Yeah, I see. I removed this test and made a similar one higher up to an earlier table without null values, where the previous partial stat does retrieve not null columns.


pkg/sql/stats/forecast.go line 248 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Is this needed? We're already using the ForecastStatsName for this purpose

Oh, I didn't realize that :( !

Hence, I added this additional field to account for it. I'll use ForecastStatsName instead.

@faizaanmadhani faizaanmadhani force-pushed the faizaan/manual-partial-stats-jobs branch from 3f61eb9 to 7b8919c Compare November 17, 2022 20:25
Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job working through all the comments! No other comments for now, but I'll take another look once you add checks for V23_1AddPartialStatisticsPredicateCol.

Reviewed 8 of 8 files at r5, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball and @faizaanmadhani)


pkg/sql/distsql_plan_stats.go line 171 at r4 (raw file):

Previously, faizaanmadhani (Faizaan Madhani) wrote…

Right now I'm doing it on line 467 before calling this function. As per your comment below, I've moved it inside this function 🙂

Thanks. I think you can get rid of this comment now.

@faizaanmadhani faizaanmadhani force-pushed the faizaan/manual-partial-stats-jobs branch 2 times, most recently from c8445a9 to e3e1cc6 Compare November 19, 2022 00:43
Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 7 of 7 files at r6, 1 of 1 files at r7, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball, @faizaanmadhani, and @michae2)


pkg/sql/alter_table.go line 1396 at r7 (raw file):

					"nullCount",
					"avgSize",
          "partialPredicate",

nit: formatting is messed up


pkg/sql/stats/stats_cache.go line 618 at r7 (raw file):

	var err error
	if sc.Settings.Version.IsActive(ctx, clusterversion.V23_1AddPartialStatisticsPredicateCol) {
		tsp, err = NewTableStatisticProto(datums, true)

nit: true /* partialPredColVerActive */ (add the comment here and next to false below)

@faizaanmadhani faizaanmadhani force-pushed the faizaan/manual-partial-stats-jobs branch 2 times, most recently from 295cbca to 6f7dd2e Compare November 21, 2022 16:22
This commit adds support to manually create partial statistics
at the extremes of indexes for individual columns. Columns must
be the key of their index, otherwise an error will be returned.
Partial statistics are stored in `system.table_statistics` and are
identified by a predicate indicating their coverage. Additionally,
this PR also adds a partial_predicate column to the output to
`SHOW STATISTICS`.

Epic: CRDB-19449

Release note (sql change): users can now manually create partial
single-column statistics at the extreme values on
columns that are prefixes of their index. The output of SHOW
STATISTICS now includes a column indicating the partial predicate
for a partial statistic, or NULL for a full statistic.
@faizaanmadhani faizaanmadhani force-pushed the faizaan/manual-partial-stats-jobs branch from 6f7dd2e to 0c2c583 Compare November 21, 2022 20:36
@faizaanmadhani
Copy link
Contributor Author

TFTRs!! 🎉

bors r=rytaft

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @faizaanmadhani, and @michae2)


pkg/sql/stats/stats_cache.go line 621 at r9 (raw file):

	} else {
		tsp, err = NewTableStatisticProto(datums, false /* partialPredColVerActive */)
	}

(No need to stop bors from running since this is just a nit, but you might fix in a future PR...)

nit: instead of passing true and false -- you can just pass partialPredicateVerColActive

Copy link
Contributor Author

@faizaanmadhani faizaanmadhani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @michae2, and @rytaft)


pkg/sql/stats/stats_cache.go line 621 at r9 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

(No need to stop bors from running since this is just a nit, but you might fix in a future PR...)

nit: instead of passing true and false -- you can just pass partialPredicateVerColActive

I see, I'll fix this in my next merging histograms PR!

@craig
Copy link
Contributor

craig bot commented Nov 21, 2022

Build succeeded:

Copy link
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I'm coming to the late! I realize this has already been merged, but I've noticed a few small things that we should fix in another PR.

Nice work, Faizaan! This is awesome!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/sql/distsql_plan_stats.go line 227 at r9 (raw file):

	// cache.
	for _, t := range tableStats {
		if len(t.ColumnIDs) == 1 && column.GetID() == t.ColumnIDs[0] && t.PartialPredicate == "" && t.Name != jobspb.ForecastStatsName {

There's currently an odd situation with STRING and ARRAY types, where they can have two stats for the same column, one for an inverted index and one for a normal (forward) index. We only want to look at histograms for forward indexes. The way to check for this is to add && (!colinfo.ColumnTypeIsInvertedIndexable(column.GetType()) || t.HistogramType().Family() != types.BytesFamily) to the conditions.


pkg/sql/distsql_plan_stats.go line 284 at r9 (raw file):

	}
	var sketchSpec, invSketchSpec []execinfrapb.SketchSpec
	if reqStat.inverted {

If reqStat.inverted is true, we should return an error for now. We're already skipping over inverted indexes in scan.initDescSpecificCol(colCfg, column). Supporting USING EXTREMES on inverted indexes will require some more work.


pkg/sql/distsql_plan_stats.go line 515 at r9 (raw file):

	var predicate string
	if index.GetKeyColumnDirection(0) == catpb.IndexColumn_ASC {
		predicate = fmt.Sprintf("%s < %s OR %s > %s OR %s is NULL", columnName, extremesSpans.Get(0).EndKey().Value(0).String(), columnName, extremesSpans.Get(1).StartKey().Value(0).String(), columnName)

When saving the column names and literal values to disk, we have to be extra careful to write them in a way that will be parsable later on (possibly even by a different version of the software). We have a special way to do this, by calling foo.Format(tree.NewFmtCtx(tree.FmtSerializable)) on any object foo that is a NodeFormatter. This should be guaranteed to be parsable (foo.String() is not).

I believe all tree.Datum types are NodeFormatter. And I think you can call column.ColName() to get a tree.Node that is also a NodeFormatter.


pkg/sql/stats/new_stat.go line 95 at r9 (raw file):

	}

	if !settings.Version.IsActive(ctx, clusterversion.V23_1AddPartialStatisticsPredicateCol) {

This isn't quite right. If we've collected partial stats, we definitely don't want to insert them without a partialPredicate, as if they were full stats. It would probably be better to disallow partial stats collection altogether.

I'm going to do a little more research on upgrades. It's possible we might still be able to insert into the new version of the table even if we haven't crossed the version gate.

Copy link
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/sql/stats/new_stat.go line 95 at r9 (raw file):

Previously, michae2 (Michael Erickson) wrote…

This isn't quite right. If we've collected partial stats, we definitely don't want to insert them without a partialPredicate, as if they were full stats. It would probably be better to disallow partial stats collection altogether.

I'm going to do a little more research on upgrades. It's possible we might still be able to insert into the new version of the table even if we haven't crossed the version gate.

After a little more research: before the version gate the table still has the old schema. So we cannot insert the predicate, and we cannot simply omit it, either. I think we should throw an error here if there is a predicate.

To make it less expensive, we should also add an additional version check earlier, before we've actually collected the statistics. Probably somewhere in distsql_plan_stats.go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants