-
Notifications
You must be signed in to change notification settings - Fork 4.8k
HIVE-29368: more conservative NDV combining by PessimisticStatCombiner #6244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
633951c
199c441
f0022f7
bd86e3c
75dbdf8
0ddef8c
8b361de
1285297
40fc7ff
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -832,6 +832,7 @@ public static ColStatistics getColStatistics(ColumnStatisticsObj cso, String col | |
| cs.setNumNulls(csd.getBinaryStats().getNumNulls()); | ||
| } else if (colTypeLowerCase.equals(serdeConstants.TIMESTAMP_TYPE_NAME)) { | ||
| cs.setAvgColLen(JavaDataModel.get().lengthOfTimestamp()); | ||
| cs.setCountDistint(csd.getTimestampStats().getNumDVs()); | ||
| cs.setNumNulls(csd.getTimestampStats().getNumNulls()); | ||
| Long lowVal = (csd.getTimestampStats().getLowValue() != null) ? csd.getTimestampStats().getLowValue() | ||
| .getSecondsSinceEpoch() : null; | ||
|
|
@@ -862,6 +863,7 @@ public static ColStatistics getColStatistics(ColumnStatisticsObj cso, String col | |
| cs.setHistogram(csd.getDecimalStats().getHistogram()); | ||
| } else if (colTypeLowerCase.equals(serdeConstants.DATE_TYPE_NAME)) { | ||
| cs.setAvgColLen(JavaDataModel.get().lengthOfDate()); | ||
| cs.setCountDistint(csd.getDateStats().getNumDVs()); | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am unsure if this was deliberately not added or an unintended omission. It does seem to improve stats' calculations of multiple .q test files, especially after more conservative NDV handling by PessimisticStatCombiner |
||
| cs.setNumNulls(csd.getDateStats().getNumNulls()); | ||
| Long lowVal = (csd.getDateStats().getLowValue() != null) ? csd.getDateStats().getLowValue() | ||
| .getDaysSinceEpoch() : null; | ||
|
|
@@ -2086,11 +2088,7 @@ private static List<Long> extractNDVGroupingColumns(List<ColStatistics> colStats | |
| // compute product of distinct values of grouping columns | ||
| for (ColStatistics cs : colStats) { | ||
| if (cs != null) { | ||
| long ndv = cs.getCountDistint(); | ||
| if (cs.getNumNulls() > 0) { | ||
| ndv = StatsUtils.safeAdd(ndv, 1); | ||
| } | ||
| ndvValues.add(ndv); | ||
| ndvValues.add(getGroupingColumnNdv(cs, parentStats)); | ||
| } else { | ||
| if (parentStats.getColumnStatsState().equals(Statistics.State.COMPLETE)) { | ||
| // the column must be an aggregate column inserted by GBY. We | ||
|
|
@@ -2109,4 +2107,20 @@ private static List<Long> extractNDVGroupingColumns(List<ColStatistics> colStats | |
|
|
||
| return ndvValues; | ||
| } | ||
|
|
||
| private static long getGroupingColumnNdv(ColStatistics cs, Statistics parentStats) { | ||
| long ndv = cs.getCountDistint(); | ||
|
|
||
| if (ndv == 0L) { | ||
| // Typically, ndv == 0 means "NDV unknown", and no safe GROUPBY adjustments are possible | ||
| // However, there is a special exception for "constant NULL" columns. They are intentionally generated | ||
| // with NDV values of 0 and numNulls == numRows, while their actual NDV is 1 | ||
| if (cs.getNumNulls() >= parentStats.getNumRows()) { | ||
| ndv = 1L; | ||
| } | ||
| } else if (cs.getNumNulls() > 0L) { | ||
| ndv = StatsUtils.safeAdd(ndv, 1L); | ||
| } | ||
| return ndv; | ||
| } | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -41,9 +41,15 @@ public void add(ColStatistics stat) { | |
| if (stat.getAvgColLen() > result.getAvgColLen()) { | ||
| result.setAvgColLen(stat.getAvgColLen()); | ||
| } | ||
| if (stat.getCountDistint() > result.getCountDistint()) { | ||
| result.setCountDistint(stat.getCountDistint()); | ||
| } | ||
|
|
||
| // NDVs can only be accurately combined if full information about columns, query branches and | ||
| // their relationships is available. Without that info, there is only one "truly conservative" | ||
| // value of NDV which is 0, which means that the NDV is unknown. It forces optimizer | ||
| // to make the most conservative decisions possible, which is the exact goal of | ||
| // PessimisticStatCombiner. It does inflate statistics in multiple cases, but at the same time it | ||
| // also ensures than the query execution does not "blow up" due to too optimistic stats estimates | ||
| result.setCountDistint(0L); | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This could appear counter-intuitive at first, however, when combining statistics of different logical branches of the same column, and having no reliable information about their interdependencies (i.e. in a "truly pessimistic" scenario), every other option appears to introduce undesired under-estimations, which often lead to catastrophic query failures. For example, a simple column generated by an CASE..WHEN clause with three constants produces an NDV of 1 by the original code, while, in most cases, the "true" NDV is 3. If such a column participates in a GROUP BY condition later on, its estimated number of records naturally becomes "1". Even this seemingly small under-estimation could lead to bad decision of converting to a mapjoin or not, especially over large data sets. Alternatively, trying to "total up" NDV values of the same columns could cause over-estimation of the true NDV of such a column, which, it its turn, could lead to a severe underestimation of records matching an "IN" filter, ultimately producing equally bad results as the previous case |
||
|
|
||
| if (stat.getNumNulls() > result.getNumNulls()) { | ||
| result.setNumNulls(stat.getNumNulls()); | ||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unsure if this was deliberately not added or an unintended omission. It does seem to improve stats' calculations of multiple .q test files, especially after more conservative NDV handling by PessimisticStatCombiner