Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Percentile speed up and standard error update #1780

Merged
merged 19 commits into from
Apr 6, 2022

Conversation

dgarrett622
Copy link
Collaborator

@dgarrett622 dgarrett622 commented Feb 24, 2022


Pull Request Description

What issue does this change request address? (Use "#" before the issue to link it, i.e., #42.)

Closes #1779
Closes #1775

What are the significant changes in functionality due to this change request?
  • Use xarray.DataSet.quantile() method whenever possible (median and percentile when no probability weights given and when probability weights are all equal)
  • Speed up calculation of weighted percentiles in _computeWeightedPercentile (xarray native methods do not support weighted quantiles/percentiles)
  • Add user parameter 'interpolation' to median, percentile, and value at risk
  • Change percentile standard error to standard normal formulation because KDE implementation increases computation time too much when requesting multiple percentiles on multiple variables
  • Makes EconomicRatio compatible with BasicStatistics
  • Change value at risk standard error to standard normal formulation because KDE implementation increases computation time too much

For Change Control Board: Change Request Review

The following review must be completed by an authorized member of the Change Control Board.

  • 1. Review all computer code.
  • 2. If any changes occur to the input syntax, there must be an accompanying change to the user manual and xsd schema. If the input syntax change deprecates existing input files, a conversion script needs to be added (see Conversion Scripts).
  • 3. Make sure the Python code and commenting standards are respected (camelBack, etc.) - See on the wiki for details.
  • 4. Automated Tests should pass, including run_tests, pylint, manual building and xsd tests. If there are changes to Simulation.py or JobHandler.py the qsub tests must pass.
  • 5. If significant functionality is added, there must be tests added to check this. Tests should cover all possible options. Multiple short tests are preferred over one large test. If new development on the internal JobHandler parallel system is performed, a cluster test must be added setting, in XML block, the node <internalParallel> to True.
  • 6. If the change modifies or adds a requirement or a requirement based test case, the Change Control Board's Chair or designee also needs to approve the change. The requirements and the requirements test shall be in sync.
  • 7. The merge request must reference an issue. If the issue is closed, the issue close checklist shall be done.
  • 8. If an analytic test is changed/added is the the analytic documentation updated/added?
  • 9. If any test used as a basis for documentation examples (currently found in raven/tests/framework/user_guide and raven/docs/workshop) have been changed, the associated documentation must be reviewed and assured the text matches the example.

@dgarrett622 dgarrett622 added the task This tag should be used for any new capability, improvement or enanchment label Feb 24, 2022
@dgarrett622 dgarrett622 requested a review from wangcj05 February 24, 2022 16:07
@dgarrett622 dgarrett622 self-assigned this Feb 24, 2022
@dgarrett622
Copy link
Collaborator Author

@wangcj05 The computation should be significantly faster now. Is this fast enough, or do we need to implement the standard normal method?

Before merging, I would like to include some notes in the documentation about the sample size needed for percentile and its standard error calculation from the paper.

@wangcj05
Copy link
Collaborator

@dgarrett622 Thanks for updates. I have checked the test, it reduced the time to ~170s now. Original test took about ~130s. I think the changes would be good for now. But I would like to keep the issue open. The only thing I'm worried about is the nonlinear increase in time. We may need to add a benchmark test in the future to prevent this behavior.

Copy link
Collaborator

@wangcj05 wangcj05 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some comments for you consideration.

Comment on lines 598 to 615
@ Out, result, float, the percentile
@ In, percent, float/list/numpy.array, the percentile(s) that needs to be computed (between 0.01 and 1.0)
@ Out, result, float/list, the percentile(s)
"""

# only do the argsort once for all requested percentiles
idxs = np.argsort(np.asarray(list(zip(pbWeight,arrayIn)))[:,1])
# Inserting [0.0,arrayIn[idxs[0]]] is needed when few samples are generated and
# a percentile that is < that the first pb weight is requested. Otherwise the median
# is returned.
sortedWeightsAndPoints = np.insert(np.asarray(list(zip(pbWeight[idxs],arrayIn[idxs]))),0,[0.0,arrayIn[idxs[0]]],axis=0)
weightsCDF = np.cumsum(sortedWeightsAndPoints[:,0])
if not hasattr(percent, '__iter__'):
# percent is int or float
result = self._computeSingleWeightedPercentile(percent, weightsCDF, sortedWeightsAndPoints)
else:
# percent is iterable
result = [self._computeSingleWeightedPercentile(pct, weightsCDF, sortedWeightsAndPoints) for pct in percent]

return result
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the compute for single percentile is using function "_computeSingleWeightedPercentile". In this case, I would like to assume the _computeWeightedPercentile function will accept list of percents instead to have the option of float or list. If you change assumption for percent, you can avoid the if...else... statement.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"_computeSingleWeightedPercentile" is there to avoid copy/pasting the same code in the if..else.. statement. If I had written this originally, I would have used numpy.interp which would take float, list, or array and would have avoided writing "_computeSingleWeightedPercentile" and the if..else.. statement. The only reason I did not implement this is that it changes the way the percentiles are calculated (linear interpolation instead of the midpoint between the two nearest data points) and would require updating many gold files. I think numpy.interp would speed the calculation because the for loop would be eliminated.

I can change the "_computeWeightedPercentile" function to accept a list of percents instead of the accepting float or list. If you think it would be worth it to do the numpy.interp method, I could try that as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"_computeWeightedPercentile" has been changed to accept a list of percents. With the new refactoring, "_computeSingleWeightedPercentile" is no longer necessary.

Comment on lines 987 to 994
quantile = []
for label, group in targDa.groupby(self.pivotParameter):
qtl = self._computeWeightedPercentile(group.values, targWeight, percent=percent)
quantile.append(qtl)
da = xr.DataArray(quantile, dims=(self.pivotParameter, 'percent'), coords={'percent': percent, self.pivotParameter: self.pivotValue})
else:
da = xr.DataArray(quantile,dims=('percent'),coords={'percent':percent})
quantile = self._computeWeightedPercentile(targDa.values, targWeight, percent=percent)
da = xr.DataArray(quantile, dims=('percent'), coords={'percent': percent})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the actually improvement is happened in the the calculation of quantiles. Which is very good for our code to have a better performance here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the calculation of quantiles originally required an argsort every time a new percentile was calculated. I moved the argsort outside of this so that the data is sorted once and then multiple quantiles are calculated from that sorted data. This scaled roughly linearly with the number of percents requested.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The calculation of quantiles was slowed down by using "_computeWeightedPercentile" instead of using native xarray.DataSet.quantile() methods. I changed this so that if there are no probability weights present or if all the probability weights are equal native xarray.DataSet.quantile() is used. This speeds up the calculation of percentiles significantly. At this time, xarray does not have support for weighted quantiles, so if probability weights are present and they are not all the same, we have to use "_computeWeightedPercentile." The new formulation of "_computeWeightedPercentile" is faster than before, but is still slower than native xarray.DataSet.quantile().

# all values are the same
for pct in percent:
subPercentileSte.append(0.0)
subPercentileSte = np.array([0.0]*len(percent))
else:
# get KDE
kde = stats.gaussian_kde(group.values, weights=targWeight)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said, we are not improving the calculation for kde here. We are looping over both targets and time here. Which is my concern.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot avoid this when there are probability weights present. This is because xarray does not currently support weighted quantiles. Without going too far down the rabbit hole, here are some of the issues being worked to enable that functionality:

pydata/xarray#1371
numpy/numpy#8935
numpy/numpy#9211

This is why we have to handle weighted percentiles on our own, which requires looping over targets and time for percentile and percentile standard error. I have made it so that we no longer have to loop over a third parameter (percent) as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By switching to the standard normal formulation of the standard error, we no longer have to loop through targets, time, or percent. This speeds up the calculation significantly. I left the KDE implementation commented out in the code in case we want to revisit it.

@dgarrett622
Copy link
Collaborator Author

I discovered some additional enhancements for speed when using _computeWeightedPercentile. It will involve using xarray.DataSet methods directly when probability weights are provided but are all equal. This will likely require new gold files for some tests. Since some gold files will already need to be updated, I'll update _computeWeightedPercentile with numpy.interp to (hopefully) provide a faster computation.

The KDE implementation is more accurate, however, it takes far too long to compute for even moderately sized samples. I'll implement the standard normal formulation of the standard error for speed.

@dgarrett622 dgarrett622 changed the title Percentile and percentile standard error speed up/default behavior Percentile speed up and standard error update Feb 28, 2022
@dgarrett622
Copy link
Collaborator Author

I made a number of changes that significantly speed up the percentile and percentile standard error calculation. Whenever possible native xarray.DataSet.quantile() is used. This yields a slightly different value than before for many cases. I had to update the gold files for many tests. I suspect this will impact a number of plugin tests that use percentile and median.

Copy link
Collaborator

@wangcj05 wangcj05 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dgarrett622 Some comments for your consideration. Let's place this PR on hold until next Monday.

Comment on lines +167 to +169
\nb Determining the percentile requires many samples, especially if the requested percentile is in
the tail of the distribution. The standard error of the percentile requires a large number of
samples for accuracy due to the asymptotic variance method used.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update these lines to reflect the changes in the code? The asymptotic variance method is commented out in the code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in the code are a simplification of the asymptotic variance method. We are using the asymptotic variance method with the assumption of a normal distribution. The benefit is that this assumption gives an explicit equation that we can use. I don't think we need to make changes in the documentation, but if you would like me to remove the asymptotic variance description, I can.

Comment on lines 607 to 609
# This step returns the index of the array which is < than the percentile, because
# the insertion create another entry, this index should shift to the bigger side
indexL = utils.first(np.asarray(weightsCDF >= percent).nonzero())[0]
# This step returns the indices (list of index) of the array which is > than the percentile
indexH = utils.first(np.asarray(weightsCDF > percent).nonzero())
try:
# if the indices exists that means the desired percentile lies between two data points
# with index as indexL and indexH[0]. Calculate the midpoint of these two points
result = 0.5*(sortedWeightsAndPoints[indexL,1]+sortedWeightsAndPoints[indexH[0],1])
except IndexError:
result = sortedWeightsAndPoints[indexL,1]
weightsCDF /= weightsCDF[-1]
result = np.interp(percent, weightsCDF, sortedWeightsAndPoints[:, 1]).tolist()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the difference between the percentile calculation is: originally we compute the average between two data points, the new modification changes it to use the interpolation which is also based on the values of weights CDF. I do have a question here. For example, we have some discussion on how to compute the weighted median in the past. Some suggestion from (https://en.wikipedia.org/wiki/Weighted_median) using the mean of lower and upper weighted median. Others argue to use the interpolated value to represent the median, as you did in your modification. I'm not the expert in the definition about the weighted median, or weighted percentile. You may take a look at the literatures, and tell us what's your thought, and we can make a decision on Monday meeting. @dgarrett622

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll discuss my reasoning on Monday.

@dgarrett622
Copy link
Collaborator Author

Adding the user specified 'interpolation' parameter to median and percentile affected _computeWeightedPercentile which is used in EconomicRatio. I rolled the fixes for #1775 into this PR since EconomicRatio required updates with this new 'interpolation' parameter.

@dgarrett622 dgarrett622 requested a review from wangcj05 March 8, 2022 15:48
@moosebuild
Copy link

All jobs on ba23d17 : invalidated by @wangcj05

retest

Copy link
Collaborator

@wangcj05 wangcj05 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several of comments for your consideration

@@ -88,6 +88,11 @@ class cls.
#percent is a string type because otherwise we can't tell 95.0 from 95
# which matters because the number is used in output.
scalarSpecification.addParam("percent", InputTypes.StringListType)
# percentile has additional "interpolation" parameter
scalarSpecification.addParam("interpolation", InputTypes.StringType)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use makeEnumType to create a type for interpolation, and also please add descriptions using "descr", here is an example:
image

scalarSpecification.addParam("interpolation", InputTypes.StringType)
if scalar == 'median':
# median has additional "interpolation" parameter
scalarSpecification.addParam("interpolation", InputTypes.StringType)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above comment

Comment on lines 358 to 362
if child.parameterValues['interpolation'] in ['linear', 'midpoint']:
interpolation = child.parameterValues['interpolation']
else:
self.raiseAWarning("Unrecognized 'interpolation' in {}, prefix '{}' using 'linear' instead".format(tag, prefix))
interpolation = 'linear'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you use makeEnumType in the getInputSpecification to define the type for interpolation, you can avoid the if ... else ... check list.

Comment on lines 372 to 378
if 'interpolation' not in child.parameterValues:
interpolation = 'linear'
else:
if child.parameterValues['interpolation'] in ['linear', 'midpoint']:
interpolation = child.parameterValues['interpolation']
else:
interpolation = 'linear'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This checks can be avoid, see above comment

@@ -589,32 +617,55 @@ def _computeSigma(self,arrayIn,variance,pbWeight=None):
"""
return np.sqrt(variance)

def _computeWeightedPercentile(self,arrayIn,pbWeight,percent=0.5):
def _computeWeightedPercentile(self,arrayIn,pbWeight,interpolation,percent=[0.5]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use interpolation='linear' instead

Comment on lines 642 to 644
else:
# currently no other options, but need to return something
result = [0.0 for pct in percent]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a default for interpolation, I think you do not need the else condition, right?

@@ -68,6 +68,7 @@ class cls.
econSpecification.addParam("threshold", InputTypes.StringType)
elif econ in ["expectedShortfall", "valueAtRisk"]:
econSpecification.addParam("threshold", InputTypes.FloatType)
econSpecification.addParam("interpolation", InputTypes.StringType)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please update the type using makeEnumType and addd description using "descr"

Comment on lines 245 to 252
if 'interpolation' not in child.parameterValues:
interpolation = 'linear'
else:
if child.parameterValues['interpolation'] in ['linear', 'midpoint']:
interpolation = child.parameterValues['interpolation']
else:
self.raiseAWarning("Unrecognized interpolation in {}, prefix '{}' using 'linear' instead".format(tag, prefix))
interpolation = 'linear'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be avoided

@wangcj05
Copy link
Collaborator

@dgarrett622 One more comment, please update some tests to reflect the option of "interpolation=midpoint". In this case, you will test both situations. Also an email to the user group need to be drafted to reflect the changes.

@dgarrett622
Copy link
Collaborator Author

Email draft for users:

Dear users,

PR #1780 makes a number of changes to the BasicStatistics and EconomicRatio PostProcessors that change some of the calculated statistics. This PR introduces the following changes:

  • median and percentile in BasicStatistics now have an additional parameter, "interpolation," that can take the values "linear" or "midpoint." This parameter influences how RAVEN interpolates between data points to calculate the median or percentile. The default is "linear." To get the previous behavior, use "midpoint."
  • valueAtRisk and expectedShortfall in EconomicRatio have the same "interpolation" parameter as median and percentile.
  • Due to the long computational time to calculate the standard error for percentile and valueAtRisk, the standard error method has been changed. The new method is much faster, however, gives slightly different values.
  • xarray has been used wherever possible to speed up the computations. This may result in slightly different values for median, percentile, and valueAtRisk.

@@ -1,2 +1,2 @@
skew_x,skew_y,vc_x,vc_y,percentile_5_x,percentile_95_x,percentile_5_y,percentile_95_y,mean_x,mean_y,kurt_x,kurt_y,median_x,median_y,max_x,max_y,min_x,min_y,samp_x,samp_y,var_x,var_y,sigma_x,sigma_y,nsen_x_x,nsen_x_y,nsen_y_x,nsen_y_y,sen_x_x,sen_x_y,sen_y_x,sen_y_y,pear_x_x,pear_x_y,pear_y_x,pear_y_y,cov_x_x,cov_x_y,cov_y_x,cov_y_y,vsen_x_x,vsen_x_y,vsen_y_x,vsen_y_y
-0.0691572462104,-0.0180192261137,0.491941113434,0.46467884342,17.732771122,181.912710136,25.5790457803,184.240434035,102.005526367,105.895162869,-0.139930546516,0.184012286626,100.962661272,105.082835069,256.130010971,291.892206598,-41.8819801615,-67.3356620259,1000.0,1000.0,2518.10387863,2421.35264613,50.1807122173,49.2072418057,1.0,-0.0477758254403,-0.042627293499,1.0,1.0,-0.0460209710208,-0.0442527414791,1.0,1.0,-0.0451281966535,-0.0451281966535,1.0,2518.10387863,-111.432999959,-111.432999959,2421.35264613,1.0,-0.0460209710208,-0.0442527414791,1.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dgarrett622 I have a question here. Since you have switched the calculation to "midpoint", why the calculation is different from the current devel version? I mean why do we need to regold it also?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangcj05 The calculation now uses xarray.DataSet.quantile instead of _computeWeightedPercentile and produces a slightly different result, even when using interpolation="midpoint". I think using xarray to calculate statistics whenever we can is better than using user defined functions that are slower and more prone to errors.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangcj05 I looked a little more into this discrepancy. When using _computeWeightedPercentile for a variable with N values, the smallest value is appended with weight zero. Adding in this additional value appears to cause the discrepancy with xarray. _computeWeightedPercentile works with N+1 values where xarray works with N values. For many requested percentiles both methods return the same value. However because the two methods use arrays of different length, they may return different values. I get the same value for the 5th percentile using each method and differing values for the 95th percentile when using the data from this test. As another example, when requesting a very small percentile (1e-9) _computeWeightedPercentile will return the smallest value where xarray returns the average of the two smallest values.

The differences in this particular example are small, on the order of 0.06% difference between the values. As written, the majority of cases will be calculated using xarray. The only time _computeWeightedPercentile would be used is if the probability weights are not all the same (xarray does not currently support weighted quantile calculation, I think it is because numpy does not have it implemented). I still think using the standard library xarray wherever possible is the best way forward.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. Thanks for checking it.

@wangcj05
Copy link
Collaborator

@dgarrett622 Could you resolve the conflicts in this PR?

@moosebuild
Copy link

Job Test qsubs sawtooth on 2cf63ff : invalidated by @wangcj05

@moosebuild
Copy link

Job Test CentOS 7 on 2cf63ff : invalidated by @wangcj05

@moosebuild
Copy link

Job Test qsubs sawtooth on 2cf63ff : invalidated by @dgarrett622

failing test does not use PostProcessor, run again

Copy link
Collaborator

@wangcj05 wangcj05 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR is good.

@wangcj05
Copy link
Collaborator

wangcj05 commented Apr 6, 2022

Checklist is good.

@wangcj05 wangcj05 merged commit 8d60b5f into idaholab:devel Apr 6, 2022
@wangcj05
Copy link
Collaborator

wangcj05 commented Apr 6, 2022

@dgarrett622 I have merged your PR, could you send the drafted email to raven user group?

@dgarrett622
Copy link
Collaborator Author

@wangcj05 email sent.

dylanjm added a commit to dylanjm/HERON that referenced this pull request Apr 6, 2022
PaulTalbot-INL pushed a commit to idaholab/HERON that referenced this pull request Apr 12, 2022
* Adapt HERON to use ravenframework

* Fix heron executable script

* Modify get_raven_loc

* Retrain linear_rom.pk

* Fix more import statements

* Regold Labels test due to idaholab/raven#1780 PR

* Add Analytic.xslsx to Cashflows test

* Regold Cashflow tests

* Regold Multirun Opt
@dgarrett622 dgarrett622 deleted the speedy_delivery branch April 19, 2022 21:24
@wangcj05 wangcj05 added the RAVENv2.2 for RAVENv2.2 Release label Jun 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RAVENv2.2 for RAVENv2.2 Release task This tag should be used for any new capability, improvement or enanchment
Projects
None yet
3 participants