Update filters in AEP class #173

nbodini · 2021-09-07T15:28:28Z

This PR updates the filters applied in the plant_analysis class.
Specifically, the proposed changes will:

remove the frozen/unresponsive sensor filter as that is not relevant when dealing with reanalysis atmospheric data.
add a range filter for temperature.
re-implement outlier detection filters (which were lost when the option for multiple time resolutions was added to the code). Specifically:
a) at monthly resolution, a robust linear regression approach is implemented to detect outliers. This is the same approach that was originally used in the class.
b) at daily/hourly resolution, a bin filter approach is used. This is consistent with what is used, at a turbine level, in the TIE calculation. A robust linear regression approach is not appropriate at these time resolutions as the relationship between wind speed and power is not linear at such fine scales.
add uncertainty quantification connected to the outlier filtering process, through the parameter uncertainty_outlier.
update the plot_reanalysis_gross_energy_data function so that it now reflects the time resolution used in the main AEP calculation (previously, it was always assuming monthly resolution) and detects outliers with the approach specific for that time resolution (robust linear regression vs bin filter).
update the examples to reflect the changes above. Note that the mean AEP predicted in the examples is now a bit (<5%) larger than what it was in the previous version, but this is expected as a few outliers (with a lower than normal energy production) are now removed from the calculation, and this results in a slightly larger AEP.
update test results to reflect the numerical changes.

codecov-commenter · 2021-09-07T17:22:53Z

Codecov Report

Merging #173 (757d227) into develop (082d719) will decrease coverage by 0.16%.
The diff coverage is 73.68%.

@@             Coverage Diff             @@
##           develop     #173      +/-   ##
===========================================
- Coverage    66.66%   66.50%   -0.17%     
===========================================
  Files           23       24       +1     
  Lines         1758     1803      +45     
===========================================
+ Hits          1172     1199      +27     
- Misses         586      604      +18

Impacted Files	Coverage Δ
operational_analysis/methods/plant_analysis.py	`94.63% <70.58%> (-1.38%)`	⬇️
...analysis/methods/turbine_long_term_gross_energy.py	`96.95% <100.00%> (ø)`
operational_analysis/toolkits/filters.py	`86.56% <100.00%> (-5.98%)`	⬇️
operational_analysis/__init__.py	`68.96% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 082d719...757d227. Read the comment docs.

nbodini · 2021-10-04T23:01:20Z

After way too many commits today, I have a question for the rest of the team (@jordanperr, @RHammond2, @ejsimley).

I have setup the code so that the outlier detection algorithm is NOT the default (outlier_detection=False, line 84 in the plant_analysis class).
If I run the code (with outlier_detection=False and random.seed(42) to discard the random aspect of the Monte Carlo process) on the Engie data with many Monte Carlo iterations (let's say 2,000), then I get the same results (in terms of mean AEP etc.) as the previous version of the class (as was asked by Jordan the last time we talked about this). To me, this means that the code is working as expected.
However, if I run the comparison only using a small number of Monte Carlo iterations (let's say 10, as in the tests), then the results are significantly different from the previous version of the class, even when setting random.seed(42). And this causes the tests to keep failing.

The MANY commits from today were me trying to understand which new part of the plant_analysis class is causing this issue. And it looks like this is coming from lines 792-3 ("outlier_threshold": np.random.randint(self.uncertainty_outlier[0] * 10,(self.uncertainty_outlier[1]) * 10, self.num_sim) / 10.).
The very weird thing is that those line should have no impact on the calculation when the code is run with outlier_detection=False, and in fact there is no impact on the results when using many Monte Carlo iterations. However, they cause this weird issue when considering only few iterations.

Right now, I have commented out those two lines, and the test are basically all passing (except for a VERY small discrepancy in the results of the check_simulation_results_gbm_daily test with 12.794527 vs 12.794528 for the mean AEP and 10.609839 vs 10.609942 for the AEP stdev). If I put back those two lines, the test results (with just 10 iterations) will be way off for ALL the regression setups (as an example, something like 12.09 vs 11.40 for the mean AEP in the monthly linear regression case). But again, those differences are not there if the Monte Carlo simulation is run with many more iterations.

Do you have an idea on why this is happening? Is there something that can impact the order of the random iterations even with a set random.seed?

ejsimley

Overall I think this looks good. Please see my comments in the code where I suggest making some minor changes and adding a bit to the documentation.

I also have a minor comment on the example 02_plant_aep_analysis notebook. In the description for "Step 8: Set up Monte Carlo inputs" I'd suggest specifying that outlier_detection is False by default, so outlier detection is not included in this example. Also, thanks for updating this list!

operational_analysis/methods/plant_analysis.py

nbodini · 2021-12-27T13:11:17Z

I have addressed @ejsimley's comments, and tested the bin filter on the 6 PRUF projects which have data at daily/hourly resolution. Results are stored here. For each wind plant, I have plotted the wind plant power curve (at daily and, where possible, hourly resolution) calculated using the bin filter, using the standard deviation setup. For each case, I have included a plot for each of the following thresholds: 1, 2 and 3 standard deviations.

I have also fixed the bin filter in the TIE class (and relative example) to use the standard deviation option, too. This will slightly change the results in the TIE test, which I haven't updated yet.

Update expected values for check_simulation_results_gam_daily_outliers (this test is being introduced in this PR. The old numbers were obtained with the 'scalar' version of the bin filter and therefore need to be updated).

Fixing one typo

nbodini · 2022-01-03T23:40:33Z

I was able to fix the last issues. The only one that remains now is for the TIE test. The expected values for that test have changes after I modified how the bin filter is applied in the TIE class. Are we OK updating these values in the test? I know we had some discussion about this in the past (@jordanperr), so happy to get other ideas.

ejsimley

Hi Nicola, this looks good to me overall! I just had a few small comments in the code. Also, here are some comments on the notebook examples:

Notebook 02: Step 8: Again, thanks for updating the Monte Carlo parameters and their descriptions! Although it is not an argument, I think it would be worth mentioning that the operational data are randomly resampled each iteration using bootstrapping to help quantify uncertainty in the results as well.
Notebook 02: Step 8: For the "uncertainty_nan_energy" parameter, I might suggest being a little more specific and saying that this is the threshold for removing days/months based on the fraction of NaNs in the data.
Notebook 02b: Thanks for adding some additional discussion at the end of the "Comparison 1" section. The only thing I can think of adding would be if there is anything about the results for the different ML methods that you think is worth commenting on. For example, are there any reasons why someone would choose certain methods over others?
Notebook 03: In the "TIE calculation without uncertainty quantification" section, I think the text "The long-term TIE value of 13.7..." should be changed to "13.6" to match what gets printed.
The filtered power curve plots look like they are removing way too many points, especially at low wind speeds. Any idea what is going on here?
In the "TIE calculation including uncertainty quantification" section, can you clear the output of the ta.run cell, so all the outputs don't show up on GitHub and the documentation? For example, see how it is currently saved in the main or develop branch.

One more general comment (no action needed now) is that after the reanalysis start/end date PR is merged, the results in the notebooks will change slightly and the test results will change, so we'll need to update the tests and probably the notebook descriptions slightly once that happens. I'm fine with merging this PR in first though.

operational_analysis/methods/plant_analysis.py

operational_analysis/methods/turbine_long_term_gross_energy.py

ejsimley · 2022-01-05T17:12:02Z

@nbodini, also, I think it's fine to update the test results as long as they look reasonable to you.

nbodini · 2022-01-05T19:15:00Z

Thanks for your comments, @ejsimley.

Re: choice between different ML algorithms in notebook 2b, we haven't really investigate that in much detail (yet). The only consideration I could think of is the following: "Finally, we note how the GBM and ETR regression models are more computationally expensive than the GAM regression model. However, as ensemble-based models, they are expected to be capable of better modeling complex relationships."

For the power curves in the Drive folder, not 100% sure I get what you mean. Especially at hourly resolution, a lot of points are discarded when the bin filter threshold is 1, but on the other hand when the threshold is 3, way fewer points are discarded.. but I am probably missing something, so let me know!

ejsimley · 2022-01-05T19:25:02Z

@nbodini Regarding the comment "The filtered power curve plots look like they are removing way too many points, especially at low wind speeds", I totally agree that the plots in the Drive folder you shared look good. I should have clarified that I was referring to the plots generated in the example 03 notebook. There it looks to me like too many points are being filtered out at low wind speed/power. Do you know why this might be the case?

I thought 6 decimals were required to be equal as default, but it seems like it's 7...

nbodini · 2022-01-05T22:12:27Z

@ejsimley I have now fixed everything, but...

I thought about this issue with the power curve, and I think the issue is not specific to this example, but actually applies to all power curves. And it's probably something we didn't think about before (at least, I haven't).
Let's assume the horizontal 'width' of the power curve stays the same throughout region II of the power curve. As an example, let's consider that at the bottom (i.e., for low power) of region II, most points are between 4 and 6 m/s. Towards the top of region II, let's consider that most points are between 11 and 13 m/s.
Now, the standard deviation (on which the bin filter is now based) will be smaller for bins in the lower part of the power curve, and higher in the upper part, just because numbers are larger (again, assuming the 'horizontal width' stays the same). The bin filter will then flag more points in the lower part of region II (because it will have a stricter threshold, which is defined as mean +- N*stdev) than in the upper part of region II.

This is not a trivial aspect, and probably one of the reasons why Mike had used the stdev setting of the bin filter instead.. what do you think?

ejsimley

Your latest changes all look good to me @nbodini. I only noticed one thing. Since we decided to change the default "UQ" argument to True in the TIE class, some of the notebooks need to be updated to reflect that. In example 03 in section 1.1, when the TurbineLongTermGrossEnergy object is created, now we need to explicitly say UQ=False. Same comment applies when TIE is computed in the example 05 notebook.

Also, I thought about the bin filter issue for example 03 a little more and looked into the results. First of all, I think the reason the filter seems to be so much stricter than the plots you shared in Drive is that those plots were for the wind plant power curves vs. reanalysis wind speeds instead of turbine power curves vs. SCADA wind speeds. Since there is more scatter in the reanalysis data plots, the std. dev. values are higher, and a wider range of wind speeds are kept after filtering. I guess that's obvious, but I just forgot about that when I was comparing the two cases.

Then I think what's happening for the TIE power curves in example 03 compared to the TIE power curves from the gap analysis paper is that the La Haute Borne data are much cleaner (there are fewer outlier points due to underperformance or turbine shutdowns, etc.). So naturally the std. dev. will be much lower for the LHB data and a narrower range of wind speeds is kept after filtering. That being said, I verified that only ~2-3% of data points get removed by the bin filter in example 03, which seems reasonable to me. So I'm thinking that the default parameters are still ok, but it would still be a good idea to adjust them depending on the project (for example, if we included the LHB data in a paper, we might consider increasing the thresholds)

ejsimley

This looks good to me now. @RHammond2 were you planning to take a look as well, or should we just merge it when we're ready?

RHammond2

Overall, this looks good from my perspective with only three minor things (2 with comments and one general) that I'll put here for convenience.

There was a 30 day assumption for month length, is this always going to be correct for reanalysis data, or should it be dynamic for the month? I might be missing some background on that decision, so apologies if this has already been addressed.
Docstring parameter descriptions don't need to be indented all the way to the colon, just two spaces in from the start of the parameter name for continuation lines.
Would you be able to run pre-commit on this, there were a couple of formatting changes that seemed slightly odd, so I just want to double check on that. You will also need to change the following in .pre-comit-config.yaml due to flake8 changing repositories.
Remove lines 39-41 (looks like below):

    -   id: flake8
        additional_dependencies: [flake8-docstrings]
        exclude: ^tests/

And add the following:

-   repo: https://github.com/pycqa/flake8
    rev: '4.0.1'
    hooks:
    -   id: flake8

Then run the following in your terminal:

pre-commit autoupdate
git add .pre-commit-config.yaml
pre-commit run --files operational_analysis/methods/plant_analysis.py operational_analysis/methods/turbine_long_term_gross_energy.py

I can also make the change for the last item if you'd prefer, or if it's easier @ejsimley prior to the v2.3 release, we can have a small PR making these edits, but I might be in favor of having it done in the proper PR so the changes can be attributed more accurately.

operational_analysis/methods/plant_analysis.py

operational_analysis/methods/turbine_long_term_gross_energy.py

nbodini · 2022-01-12T01:42:25Z

Thanks for your comments, @RHammond2. I have incorporated the first two comments, and I would leave the third one to you if you don't mind (I am afraid I would mess up something I am not familiar with..). Thanks!

RHammond2 · 2022-01-12T19:08:39Z

@nbodini that works for me, and so I've just updated the pre-commit file and the two python methods files this PR modifies, which puts everything where it should be from my perspective.

ejsimley · 2022-01-12T21:46:41Z

Thanks Rob and Nicola for the quick review and edits!

@nbodini, would you like to merge this? Or let me know if you'd like me to do that.

@RHammond2, good catch with the flake8 changes. For those who have been using the pre-commit hooks already, is there anything we need to do to update pre-commit, or did your latest commit take care of everything for now? For example, I wasn't sure if everyone needed to run the commands:

pre-commit autoupdate
git add .pre-commit-config.yaml

Update filters in AEP class

9ee622c

nbodini requested review from RHammond2 and ejsimley September 7, 2021 15:28

Update test results

d8baec9

nbodini added 13 commits October 4, 2021 11:53

Set default option to no outlier detection

acd9921

Update plant_analysis.py

ce457bc

Update plant_analysis.py

4f5a167

Update plant_analysis.py

e8d0087

Update plant_analysis.py

25620a1

Update plant_analysis.py

c81e628

Update plant_analysis.py

dc3ca74

Update plant_analysis.py

6829159

Update plant_analysis.py

f3fa591

Update plant_analysis.py

36e241e

Update plant_analysis.py

b634046

Update plant_analysis.py

f4e40c4

Update plant_analysis.py

3b9983a

nbodini added 4 commits October 16, 2021 11:04

Update based on Jordan's feedback from weekly standup

000349c

Slight change in test results (6th decimal digit was different)

64834d5

Add test for filters

22351dd

Update int_pruf_plant_analysis.py

2769222

ejsimley requested changes Nov 1, 2021

View reviewed changes

nbodini added 2 commits December 27, 2021 13:48

Incorporate Eric's comments

07c47bc

Update CHANGELOG.md

6922131

nbodini added 3 commits January 4, 2022 00:13

Merge remote-tracking branch 'upstream/develop' into AEP_new_filters

8df4891

Update int_pruf_plant_analysis.py

b7cd025

Update expected values for check_simulation_results_gam_daily_outliers (this test is being introduced in this PR. The old numbers were obtained with the 'scalar' version of the bin filter and therefore need to be updated).

Update int_pruf_plant_analysis.py

2300d33

Fixing one typo

nbodini added 2 commits January 4, 2022 15:21

Update 02_plant_aep_analysis.ipynb

c16ea7a

Update 02b_augmented_plant_aep_analysis.ipynb

94c66ae

nbodini mentioned this pull request Jan 4, 2022

Enhancement/issue 170 #180

Merged

ejsimley requested changes Jan 5, 2022

View reviewed changes

operational_analysis/methods/plant_analysis.py Outdated Show resolved Hide resolved

operational_analysis/methods/plant_analysis.py Outdated Show resolved Hide resolved

operational_analysis/methods/turbine_long_term_gross_energy.py Outdated Show resolved Hide resolved

First half of Eric's comments

0fb6c61

Respond do Eric's comments

9666c6d

nbodini added 3 commits January 5, 2022 20:28

Properly rounding results of modified tests

d279bee

Update int_turbine_long_term_gross_energy.py

f1e0980

Update int_turbine_long_term_gross_energy.py

7c46d46

I thought 6 decimals were required to be equal as default, but it seems like it's 7...

ejsimley requested changes Jan 6, 2022

View reviewed changes

Fixing examples with TIE (UQ = False)

fa7f1a6

ejsimley approved these changes Jan 7, 2022

View reviewed changes

RHammond2 reviewed Jan 11, 2022

View reviewed changes

operational_analysis/methods/plant_analysis.py Outdated Show resolved Hide resolved

operational_analysis/methods/turbine_long_term_gross_energy.py Outdated Show resolved Hide resolved

Rob's comments

757d227

update the autoformatting

eee8699

RHammond2 self-requested a review January 12, 2022 19:08

RHammond2 approved these changes Jan 12, 2022

View reviewed changes

nbodini merged commit f2211db into NatLabRockies:develop Jan 12, 2022

Update filters in AEP class #173

Update filters in AEP class #173

Uh oh!

Conversation

nbodini commented Sep 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Sep 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

nbodini commented Oct 4, 2021

Uh oh!

ejsimley left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nbodini commented Dec 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nbodini commented Jan 3, 2022

Uh oh!

ejsimley left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ejsimley commented Jan 5, 2022

Uh oh!

nbodini commented Jan 5, 2022

Uh oh!

ejsimley commented Jan 5, 2022

Uh oh!

nbodini commented Jan 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ejsimley left a comment

Choose a reason for hiding this comment

Uh oh!

ejsimley left a comment

Choose a reason for hiding this comment

Uh oh!

RHammond2 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nbodini commented Jan 12, 2022

Uh oh!

RHammond2 commented Jan 12, 2022

Uh oh!

ejsimley commented Jan 12, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nbodini commented Sep 7, 2021 •

edited

Loading

codecov-commenter commented Sep 7, 2021 •

edited

Loading

nbodini commented Dec 27, 2021 •

edited

Loading

nbodini commented Jan 5, 2022 •

edited

Loading