-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Percentile speed up and standard error update #1780
Conversation
@wangcj05 The computation should be significantly faster now. Is this fast enough, or do we need to implement the standard normal method? Before merging, I would like to include some notes in the documentation about the sample size needed for percentile and its standard error calculation from the paper. |
@dgarrett622 Thanks for updates. I have checked the test, it reduced the time to ~170s now. Original test took about ~130s. I think the changes would be good for now. But I would like to keep the issue open. The only thing I'm worried about is the nonlinear increase in time. We may need to add a benchmark test in the future to prevent this behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some comments for you consideration.
@ Out, result, float, the percentile | ||
@ In, percent, float/list/numpy.array, the percentile(s) that needs to be computed (between 0.01 and 1.0) | ||
@ Out, result, float/list, the percentile(s) | ||
""" | ||
|
||
# only do the argsort once for all requested percentiles | ||
idxs = np.argsort(np.asarray(list(zip(pbWeight,arrayIn)))[:,1]) | ||
# Inserting [0.0,arrayIn[idxs[0]]] is needed when few samples are generated and | ||
# a percentile that is < that the first pb weight is requested. Otherwise the median | ||
# is returned. | ||
sortedWeightsAndPoints = np.insert(np.asarray(list(zip(pbWeight[idxs],arrayIn[idxs]))),0,[0.0,arrayIn[idxs[0]]],axis=0) | ||
weightsCDF = np.cumsum(sortedWeightsAndPoints[:,0]) | ||
if not hasattr(percent, '__iter__'): | ||
# percent is int or float | ||
result = self._computeSingleWeightedPercentile(percent, weightsCDF, sortedWeightsAndPoints) | ||
else: | ||
# percent is iterable | ||
result = [self._computeSingleWeightedPercentile(pct, weightsCDF, sortedWeightsAndPoints) for pct in percent] | ||
|
||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the compute for single percentile is using function "_computeSingleWeightedPercentile". In this case, I would like to assume the _computeWeightedPercentile function will accept list of percents instead to have the option of float or list. If you change assumption for percent, you can avoid the if...else... statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"_computeSingleWeightedPercentile" is there to avoid copy/pasting the same code in the if..else.. statement. If I had written this originally, I would have used numpy.interp which would take float, list, or array and would have avoided writing "_computeSingleWeightedPercentile" and the if..else.. statement. The only reason I did not implement this is that it changes the way the percentiles are calculated (linear interpolation instead of the midpoint between the two nearest data points) and would require updating many gold files. I think numpy.interp would speed the calculation because the for loop would be eliminated.
I can change the "_computeWeightedPercentile" function to accept a list of percents instead of the accepting float or list. If you think it would be worth it to do the numpy.interp method, I could try that as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"_computeWeightedPercentile" has been changed to accept a list of percents. With the new refactoring, "_computeSingleWeightedPercentile" is no longer necessary.
quantile = [] | ||
for label, group in targDa.groupby(self.pivotParameter): | ||
qtl = self._computeWeightedPercentile(group.values, targWeight, percent=percent) | ||
quantile.append(qtl) | ||
da = xr.DataArray(quantile, dims=(self.pivotParameter, 'percent'), coords={'percent': percent, self.pivotParameter: self.pivotValue}) | ||
else: | ||
da = xr.DataArray(quantile,dims=('percent'),coords={'percent':percent}) | ||
quantile = self._computeWeightedPercentile(targDa.values, targWeight, percent=percent) | ||
da = xr.DataArray(quantile, dims=('percent'), coords={'percent': percent}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the actually improvement is happened in the the calculation of quantiles. Which is very good for our code to have a better performance here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the calculation of quantiles originally required an argsort every time a new percentile was calculated. I moved the argsort outside of this so that the data is sorted once and then multiple quantiles are calculated from that sorted data. This scaled roughly linearly with the number of percents requested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The calculation of quantiles was slowed down by using "_computeWeightedPercentile" instead of using native xarray.DataSet.quantile() methods. I changed this so that if there are no probability weights present or if all the probability weights are equal native xarray.DataSet.quantile() is used. This speeds up the calculation of percentiles significantly. At this time, xarray does not have support for weighted quantiles, so if probability weights are present and they are not all the same, we have to use "_computeWeightedPercentile." The new formulation of "_computeWeightedPercentile" is faster than before, but is still slower than native xarray.DataSet.quantile().
# all values are the same | ||
for pct in percent: | ||
subPercentileSte.append(0.0) | ||
subPercentileSte = np.array([0.0]*len(percent)) | ||
else: | ||
# get KDE | ||
kde = stats.gaussian_kde(group.values, weights=targWeight) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said, we are not improving the calculation for kde here. We are looping over both targets and time here. Which is my concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot avoid this when there are probability weights present. This is because xarray does not currently support weighted quantiles. Without going too far down the rabbit hole, here are some of the issues being worked to enable that functionality:
pydata/xarray#1371
numpy/numpy#8935
numpy/numpy#9211
This is why we have to handle weighted percentiles on our own, which requires looping over targets and time for percentile and percentile standard error. I have made it so that we no longer have to loop over a third parameter (percent) as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By switching to the standard normal formulation of the standard error, we no longer have to loop through targets, time, or percent. This speeds up the calculation significantly. I left the KDE implementation commented out in the code in case we want to revisit it.
I discovered some additional enhancements for speed when using _computeWeightedPercentile. It will involve using xarray.DataSet methods directly when probability weights are provided but are all equal. This will likely require new gold files for some tests. Since some gold files will already need to be updated, I'll update _computeWeightedPercentile with numpy.interp to (hopefully) provide a faster computation. The KDE implementation is more accurate, however, it takes far too long to compute for even moderately sized samples. I'll implement the standard normal formulation of the standard error for speed. |
…speed up _computeWeightedPercentile
…bility weights equal
I made a number of changes that significantly speed up the percentile and percentile standard error calculation. Whenever possible native xarray.DataSet.quantile() is used. This yields a slightly different value than before for many cases. I had to update the gold files for many tests. I suspect this will impact a number of plugin tests that use percentile and median. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dgarrett622 Some comments for your consideration. Let's place this PR on hold until next Monday.
\nb Determining the percentile requires many samples, especially if the requested percentile is in | ||
the tail of the distribution. The standard error of the percentile requires a large number of | ||
samples for accuracy due to the asymptotic variance method used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you update these lines to reflect the changes in the code? The asymptotic variance method is commented out in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes in the code are a simplification of the asymptotic variance method. We are using the asymptotic variance method with the assumption of a normal distribution. The benefit is that this assumption gives an explicit equation that we can use. I don't think we need to make changes in the documentation, but if you would like me to remove the asymptotic variance description, I can.
# This step returns the index of the array which is < than the percentile, because | ||
# the insertion create another entry, this index should shift to the bigger side | ||
indexL = utils.first(np.asarray(weightsCDF >= percent).nonzero())[0] | ||
# This step returns the indices (list of index) of the array which is > than the percentile | ||
indexH = utils.first(np.asarray(weightsCDF > percent).nonzero()) | ||
try: | ||
# if the indices exists that means the desired percentile lies between two data points | ||
# with index as indexL and indexH[0]. Calculate the midpoint of these two points | ||
result = 0.5*(sortedWeightsAndPoints[indexL,1]+sortedWeightsAndPoints[indexH[0],1]) | ||
except IndexError: | ||
result = sortedWeightsAndPoints[indexL,1] | ||
weightsCDF /= weightsCDF[-1] | ||
result = np.interp(percent, weightsCDF, sortedWeightsAndPoints[:, 1]).tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the difference between the percentile calculation is: originally we compute the average between two data points, the new modification changes it to use the interpolation which is also based on the values of weights CDF. I do have a question here. For example, we have some discussion on how to compute the weighted median in the past. Some suggestion from (https://en.wikipedia.org/wiki/Weighted_median) using the mean of lower and upper weighted median. Others argue to use the interpolated value to represent the median, as you did in your modification. I'm not the expert in the definition about the weighted median, or weighted percentile. You may take a look at the literatures, and tell us what's your thought, and we can make a decision on Monday meeting. @dgarrett622
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll discuss my reasoning on Monday.
Adding the user specified 'interpolation' parameter to median and percentile affected _computeWeightedPercentile which is used in EconomicRatio. I rolled the fixes for #1775 into this PR since EconomicRatio required updates with this new 'interpolation' parameter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several of comments for your consideration
@@ -88,6 +88,11 @@ class cls. | |||
#percent is a string type because otherwise we can't tell 95.0 from 95 | |||
# which matters because the number is used in output. | |||
scalarSpecification.addParam("percent", InputTypes.StringListType) | |||
# percentile has additional "interpolation" parameter | |||
scalarSpecification.addParam("interpolation", InputTypes.StringType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scalarSpecification.addParam("interpolation", InputTypes.StringType) | ||
if scalar == 'median': | ||
# median has additional "interpolation" parameter | ||
scalarSpecification.addParam("interpolation", InputTypes.StringType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above comment
if child.parameterValues['interpolation'] in ['linear', 'midpoint']: | ||
interpolation = child.parameterValues['interpolation'] | ||
else: | ||
self.raiseAWarning("Unrecognized 'interpolation' in {}, prefix '{}' using 'linear' instead".format(tag, prefix)) | ||
interpolation = 'linear' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you use makeEnumType in the getInputSpecification to define the type for interpolation, you can avoid the if ... else ... check list.
if 'interpolation' not in child.parameterValues: | ||
interpolation = 'linear' | ||
else: | ||
if child.parameterValues['interpolation'] in ['linear', 'midpoint']: | ||
interpolation = child.parameterValues['interpolation'] | ||
else: | ||
interpolation = 'linear' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This checks can be avoid, see above comment
@@ -589,32 +617,55 @@ def _computeSigma(self,arrayIn,variance,pbWeight=None): | |||
""" | |||
return np.sqrt(variance) | |||
|
|||
def _computeWeightedPercentile(self,arrayIn,pbWeight,percent=0.5): | |||
def _computeWeightedPercentile(self,arrayIn,pbWeight,interpolation,percent=[0.5]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use interpolation='linear' instead
else: | ||
# currently no other options, but need to return something | ||
result = [0.0 for pct in percent] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a default for interpolation, I think you do not need the else condition, right?
@@ -68,6 +68,7 @@ class cls. | |||
econSpecification.addParam("threshold", InputTypes.StringType) | |||
elif econ in ["expectedShortfall", "valueAtRisk"]: | |||
econSpecification.addParam("threshold", InputTypes.FloatType) | |||
econSpecification.addParam("interpolation", InputTypes.StringType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please update the type using makeEnumType and addd description using "descr"
if 'interpolation' not in child.parameterValues: | ||
interpolation = 'linear' | ||
else: | ||
if child.parameterValues['interpolation'] in ['linear', 'midpoint']: | ||
interpolation = child.parameterValues['interpolation'] | ||
else: | ||
self.raiseAWarning("Unrecognized interpolation in {}, prefix '{}' using 'linear' instead".format(tag, prefix)) | ||
interpolation = 'linear' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be avoided
@dgarrett622 One more comment, please update some tests to reflect the option of "interpolation=midpoint". In this case, you will test both situations. Also an email to the user group need to be drafted to reflect the changes. |
…eAtRisk, and expectedShortfall
Email draft for users: Dear users, PR #1780 makes a number of changes to the BasicStatistics and EconomicRatio PostProcessors that change some of the calculated statistics. This PR introduces the following changes:
|
@@ -1,2 +1,2 @@ | |||
skew_x,skew_y,vc_x,vc_y,percentile_5_x,percentile_95_x,percentile_5_y,percentile_95_y,mean_x,mean_y,kurt_x,kurt_y,median_x,median_y,max_x,max_y,min_x,min_y,samp_x,samp_y,var_x,var_y,sigma_x,sigma_y,nsen_x_x,nsen_x_y,nsen_y_x,nsen_y_y,sen_x_x,sen_x_y,sen_y_x,sen_y_y,pear_x_x,pear_x_y,pear_y_x,pear_y_y,cov_x_x,cov_x_y,cov_y_x,cov_y_y,vsen_x_x,vsen_x_y,vsen_y_x,vsen_y_y | |||
-0.0691572462104,-0.0180192261137,0.491941113434,0.46467884342,17.732771122,181.912710136,25.5790457803,184.240434035,102.005526367,105.895162869,-0.139930546516,0.184012286626,100.962661272,105.082835069,256.130010971,291.892206598,-41.8819801615,-67.3356620259,1000.0,1000.0,2518.10387863,2421.35264613,50.1807122173,49.2072418057,1.0,-0.0477758254403,-0.042627293499,1.0,1.0,-0.0460209710208,-0.0442527414791,1.0,1.0,-0.0451281966535,-0.0451281966535,1.0,2518.10387863,-111.432999959,-111.432999959,2421.35264613,1.0,-0.0460209710208,-0.0442527414791,1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dgarrett622 I have a question here. Since you have switched the calculation to "midpoint", why the calculation is different from the current devel version? I mean why do we need to regold it also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wangcj05 The calculation now uses xarray.DataSet.quantile instead of _computeWeightedPercentile and produces a slightly different result, even when using interpolation="midpoint". I think using xarray to calculate statistics whenever we can is better than using user defined functions that are slower and more prone to errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wangcj05 I looked a little more into this discrepancy. When using _computeWeightedPercentile for a variable with N values, the smallest value is appended with weight zero. Adding in this additional value appears to cause the discrepancy with xarray. _computeWeightedPercentile works with N+1 values where xarray works with N values. For many requested percentiles both methods return the same value. However because the two methods use arrays of different length, they may return different values. I get the same value for the 5th percentile using each method and differing values for the 95th percentile when using the data from this test. As another example, when requesting a very small percentile (1e-9) _computeWeightedPercentile will return the smallest value where xarray returns the average of the two smallest values.
The differences in this particular example are small, on the order of 0.06% difference between the values. As written, the majority of cases will be calculated using xarray. The only time _computeWeightedPercentile would be used is if the probability weights are not all the same (xarray does not currently support weighted quantile calculation, I think it is because numpy does not have it implemented). I still think using the standard library xarray wherever possible is the best way forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. Thanks for checking it.
@dgarrett622 Could you resolve the conflicts in this PR? |
Job Test qsubs sawtooth on 2cf63ff : invalidated by @wangcj05 |
Job Test CentOS 7 on 2cf63ff : invalidated by @wangcj05 |
Job Test qsubs sawtooth on 2cf63ff : invalidated by @dgarrett622 failing test does not use PostProcessor, run again |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR is good.
Checklist is good. |
@dgarrett622 I have merged your PR, could you send the drafted email to raven user group? |
@wangcj05 email sent. |
* Adapt HERON to use ravenframework * Fix heron executable script * Modify get_raven_loc * Retrain linear_rom.pk * Fix more import statements * Regold Labels test due to idaholab/raven#1780 PR * Add Analytic.xslsx to Cashflows test * Regold Cashflow tests * Regold Multirun Opt
Pull Request Description
What issue does this change request address? (Use "#" before the issue to link it, i.e., #42.)
Closes #1779
Closes #1775
What are the significant changes in functionality due to this change request?
For Change Control Board: Change Request Review
The following review must be completed by an authorized member of the Change Control Board.
<internalParallel>
to True.raven/tests/framework/user_guide
andraven/docs/workshop
) have been changed, the associated documentation must be reviewed and assured the text matches the example.