Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Values outside QC test domain are incorrectly flagged as in error #191

Closed
AdamRJensen opened this issue May 30, 2023 · 6 comments · Fixed by #214
Closed

Values outside QC test domain are incorrectly flagged as in error #191

AdamRJensen opened this issue May 30, 2023 · 6 comments · Fixed by #214
Labels
enhancement New feature or request
Milestone

Comments

@AdamRJensen
Copy link
Member

I've found that the QC checks in pvanalytics flag values outside the testing domain as erroneously, which I strongly believe is against the purpose of the testing domain.

As an example, the BSRN comparison and diffuse ratio tests are stated to only be performed for GHI>50 W/m^2. The reason for this threshold is that the uncertainty at lower irradiance is too high to perform the test reliably. This is NOT saying that values lower than 50 W/m^2 should be flagged.

If values lower than 50 W/m^2 should be flagged, it certainly does not make sense for the BSRN lower limits on the individual components to be -4 W/m^2. For example, the below data point which matches the closure equation perfectly (i.e., GHI = DHI+DNI*cos(Z)) is flagged as erroneously because it's below 50 W/m^2.

    qcrad_consistency_flag = check_irradiance_consistency_qcrad(
        solar_zenith=pd.Series(80),
        ghi=40,
        dhi=40,
        dni=0)

I have a series of QC checks that I would like to add, but this issue is a blocker for now.

This issue is related to #188 (comment) and #167 (comment)

@AdamRJensen AdamRJensen changed the title Values outside QC test domain are incorrectly flagged as erroneously Values outside QC test domain are incorrectly flagged as in error May 30, 2023
@cwhanse
Copy link
Member

cwhanse commented May 30, 2023

@AdamRJensen if the BSRN tests aren't applied for GHI < 50 W/m2, what do the tests return for such values? I assume they don't roll up to a True/False flag, or do they?

I don't think False (in pvanalytics) should be interpreted as erroneous, it means not True, where True means the value passes the test. We could say more in the docstrings than "True where value passes limits test."

@AdamRJensen
Copy link
Member Author

@AdamRJensen if the BSRN tests aren't applied for GHI < 50 W/m2, what do the tests return for such values? I assume they don't roll up to a True/False flag, or do they?

I don't think False (in pvanalytics) should be interpreted as erroneous, it means not True, where True means the value passes the test. We could say more in the docstrings than "True where value passes limits test."

^I think this is a problematic way of implementing QC tests and where the real issue lies!

I think my concern is best explained using an example. Let's consider two hypothetical tests from two different authors aiming to identify two different issues. The first test is suitable for SZA < 75 and the second test is suitable for SZA >= 75. Given that their domains do not overlap, one of the two flags will always be marked False (following the pvanalytics convention). Consequently, there are three possible outcomes (True, False), (False, True), or (False, False). Once I've applied all my various QC tests I want to combine them and assess the final decision of the data point. As I only want to use data points that didn't fail any tests, I use the "AND" operator, but the end result for the example above will always be False no matter the data point.

I think the flawed part of the pvanalytics QC implementation is that it follows a strategy of (1) determining if a data point is valid (defaults to False), instead of (2) determining whether a data point is suspicious (defaults to True). QC tests are developed to find suspicious data points and not to validate data, thus strategy (2) should be applied (flagging of suspect data).

@cwhanse I'm having hard time conveying this information and would be happy to have a quick chat about it.

@kandersolar
Copy link
Member

Put another way, the problem is that we are trying to fit three states (pass, fail, N/A) into a two-state variable.

I don't think False (in pvanalytics) should be interpreted as erroneous, it means not True, where True means the value passes the test. We could say more in the docstrings than "True where value passes limits test."

I suspect most users will not grasp this distinction. I think most will assume "False = Bad", and I'm skeptical that including careful wording in the docstrings would be enough to disabuse them of this notion.

My first thought was to return np.nan for inputs outside the test domain, but pandas doesn't accept mixed boolean/nan masks with .loc (which I assume is the primary use case for these outputs): ValueError: Cannot mask with non-boolean array containing NA / NaN values.

In the absence of first-class ternary functionality in pandas, I think I'd support switching to a convention of False means failed, True means passed or N/A. But maybe we should poll users to how they want to be able to use these functions.

@cwhanse
Copy link
Member

cwhanse commented Jun 1, 2023

This would be a major design change, and I see the arguments in its favor. @kperrynrel can you discuss this with your colleagues?

The proposal is to return True for values outside the filter boundary, i.e., values for which the test doesn't apply. Currently, filter functions return False for such values.

@AdamRJensen
Copy link
Member Author

@kperrynrel @cwhanse Here's an idea for perhaps a middle ground and, if anything, a solution that can let the addition of new tests proceed without breaking causing any breaking changes.

I propose to introduce an argument that lets the user decide what fill value to use outside the test domain. The existing tests return False for values outside the test domain, so this can be the default fill value, thus retaining the current functionality. It would, at the same time, provide flexibility and let users choose another fill value, e.g., True (as per my proposal) or nan.

The concept is based on a discussion with @kandersolar, though he may not endorse it.

@cwhanse
Copy link
Member

cwhanse commented Jul 13, 2023

Good idea - fill_value, out_of_bounds, something like that would work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants