Feature/exact values option #153

nannau · 2020-05-22T00:43:36Z

Hello from a user of PyKrige!

This is my attempt to implement the feature in issue #30. I've added an exact_values flag as described in the documentation of execute() in both ok.py and uk.py. This feature is tagged in v2.0 milestones.

As the documentation notes:

In the future, the code may include an extra 'exact_values' boolean
flag that can be adjusted to specify whether to treat the measurements
as 'exact'. Setting the flag to false would indicate that the variogram
should not be forced to be zero at zero distance (i.e., when evaluated
at data points). Instead, the uncertainty in the point will be equal to
the nugget. This would mean that the diagonal of the kriging matrix
would be set to the nugget instead of to zero.

To achieve this, I've added this flag to execute(..., exact_values=True) as well as _get_kriging_matrix(n, exact_values). In _get_kriging_matrix are a few layers of checking to see if the model being used is linear or not, as the nugget parameter shifts location in variogram_model_parameters. If the exact_values flag is True (by default), then nothing is changed and the kriging is an exact interpolator. If not, and exact_values = False, then it accounts for the variance and the diagonal of the kriging matrix is replaced with the nugget.

I've also added some basic tests for the new flag, and am aware that there are other cases to add as well. In particular, I'm not quite sure how/if to include test data so I opted not to. My understanding is that by changing to a non-exact interpolation, it could break the tests that assert some conformity to existing test data. I'm open to implementing this with some guidance.

I'm hope this is close to what you guys had in mind - I just wanted to put it out there since this change appears to be working well for me.

v1.5.0

nannau · 2020-05-22T18:00:17Z

I'm not sure what base this should compare with, I just chose branch v2_init which I interpreted to mean the v2.0 branch.

MuellerSeb · 2020-05-23T12:17:47Z

Hey there! Thanks for that PR! Nice to see this code is alive!

I got some questions:

I was digging through literature and found two approaches for implementing the nugget as a possible representation of observational errors. One is to do it this way (replacing the diagonal of the kriging matrix), the other one was to do this on the right hand side of the kriging equation. But this is contradictory for me. Do you have any references for that? (Would be good to have that in the documentation)
can you show the difference in the results with a simple example? It would be nice, to have it in the gallery (maybe similar to the 1D example, with a reduced number of conditions).
Since this is just a little change, I would say, you can target the develop branch, since the v2 branch will be build up from the ground.
another nice feature would be, to provide different error variances for each observation and fill the diagonal with them (but that is not that urgent)

MuellerSeb · 2020-05-23T12:56:12Z

The failing tests are because of #154

nannau · 2020-05-25T21:08:18Z

Ok - those points sound good. I will switch to the develop branch, and get back to you on the other points. Unfortunately, I was just using the documentation as a reference, so I don't have external verification. But I will do some searching and asking around.

MuellerSeb · 2020-05-25T21:28:39Z

Cool! That sounds great!

I also got some references:

Wolfram Rühaak: "3-D interpolation of subsurface temperature data with measurement error using kriging"
H. Wackernagel: "Multivariate Geostatistics" (There is a chapter called "Kriging with known measurement error variance", but I struggled with the formulation [I think there should be a minus on the diagonal])

nannau · 2020-05-25T22:38:21Z

I just found this, too, with regards to setting the nugget as the diagonal of the covariance matrix. I should add a caveat that I am not a geostatistics expert.

Mohamaddi: "An analytic comparison of regularization methods for Gaussian Processes"

Specifically,

The most common approach to deal with covariance matrix ill-conditioning
is to introduce a “nugget” [7, 25, 15, 1], that is to say add a small positive
scalar to the diagonal. As a matrix regularization method, it is also known
as Tikhonov regularization or Ridge regression.

But I haven't been able to find anything regarding the formulation of it, and unfortunately can't find a copy of "Multivariate Geostatistics" anywhere.

In my searching, I also found that this "ill-conditioned" covariance matrix could be dealt with by adding a small positive number to the diagonal (Andrianakis and Challenor, 2012).

From (Andrianakis and Challenor, 2012):

We have addressed this difficulty by adding a small number (10^-6) to the diagonal of the correlation matrix. With this addition, the approximating functions do not exactly interpolate the observed function values; however, they retain their flexibility and predict observations very closely

Which serves as further reason to allow a value that is not necessarily the nugget.

nannau · 2020-05-26T00:17:08Z

Here's a simple example of how it works in a 1-D case. Is this what you had in mind?

from pykrige.ok import OrdinaryKriging
import matplotlib.pyplot as plt
import numpy as np

plt.style.use("ggplot")

x = np.linspace(0, 12.5, 50)
y = np.sin(x)*np.exp(-.25*x)

# select a variogram model with non-zero nugget for data used
uk = OrdinaryKriging(x, np.zeros(x.shape), y, variogram_model="linear")

# use exact input locations and the locations we wish to interpolate
# to demonstrate the influence of a non-zero variance at these locations
y_pred, y_std = uk.execute("grid", x, np.array([0.0]), exact_values=False)
# the variance/standard deviation in the exact case is redundant and is all zero
y_pred_exact, y_std_exact = uk.execute("grid", x, np.array([0.0]), exact_values=True)


y_pred = np.squeeze(y_pred)
y_std = np.squeeze(y_std)

y_pred_exact = np.squeeze(y_pred_exact)
y_std_exact = np.squeeze(y_std_exact)

fig, ax = plt.subplots(1, 1, figsize=(10, 4))

ax.scatter(x, y, label="Input Data")
ax.plot(x, y_pred_exact, label="Exact Prediction")
ax.plot(x, y_pred, label="Non Exact Prediction")

ax.fill_between(
    x,
    y_pred - 3 * y_std,
    y_pred + 3 * y_std,
    alpha=0.3,
    label="Confidence interval of Non Exact Prediction",
)
ax.legend(loc=9)
plt.show()

MuellerSeb · 2020-05-26T11:19:16Z

pykrige/uk.py

-        np.fill_diagonal(a, 0.0)
+
+        if self.variogram_model != "linear" and not exact_values:
+            np.fill_diagonal(a, self.variogram_model_parameters[2])


Could you check, what happens, when you fill the diagonal with the negative nugget? This is what was done in the paper of "3-D interpolation of subsurface temperature data with measurement error using kriging" (Wolfram Rühaak, 2014)
And this is what bothers me: Some literature says you should add the error-variance, other subtract it...

Ah, I see your point. I was reading a similar formulation in "Basic Steps in Geostatistics: The Variogram and Kriging" (Oliver and Webster) in Equation 4.11.

and comparing this with (Wolfram Rühaak, 2014)

PyKrige look like its been implemented in the (Wolfram Rühaak, 2014) way, so I think we want to subtract it like you asked, if I'm understanding this correctly. But I think these negatives are redundant.

In the implementation of solving the system to get the linear estimators in PyKrige it looks like b on the RHS is set to be negative,

PyKrige/pykrige/ok.py

Line 649 in 1a889a7

b[:n, 0] = -self.variogram_function(self.variogram_model_parameters, bd)

When finding λ = inv(A) b, the contents of A and b are defined in (Wolfram Rühaak, 2014) and in PyKrige

PyKrige/pykrige/ok.py

Line 635 in 1a889a7

a_inv = scipy.linalg.inv(a)

, as being negative, meaning λ is positive.

But in 4.11 in Oliver and Webster, it looks like they've defined the contents of A and b to be positive, which to me seems like the variance/nugget parameter would be added to the diagonal to make the solution for the system of equations (λ = inv(A) b) consistent between both methods, and ensure λ is positive.

So I would expect equivalent behaviour between subtracting the nugget with A and b having negative entries and adding the nugget with A and b having positive entries.

Here is the behaviour from subtracting the nugget, instead, with PyKrige as is:

Which produces some strange variance, perhaps originating from this added term to the variance from Oliver and Webster, Equation 4.12:

So perhaps

PyKrige/pykrige/ok.py

Line 622 in 1a889a7

sigmasq = np.sum(x[:, :, 0] * -b[:, :, 0], axis=1)

Needs to have another term if using PyKrige as a non-exact interpolator? In either case, yes, its rather confusing. Does this make sense to you?

In GSTools, I was also trying to implement this exact feature. There I was following:
Cressie N.: Statistics for spatial data
3.2.1 Effect of Variogram Parameters on Kriging
Where the measurement error (nugget) was not applied to the kriging matrix, but to the right hand side of the kriging equation. This are the resulting differences:
(exact=True)

(exact=False)

And this is what I actually expect from the "exact" keyword, to drop these spikes to the exact conditions and I couldn't reproduce this with one of the above mentioned approaches for the kriging matrix.

I still have to take a look at the estimated kriging variance...

Regarding the formulation: You are right, that the negative sign infront of the variogram doesn't change the outcome. The only advantage of this approach is, that it can be easily reformulated into the system only depending on the covariance instead of the variogram (This is also a goal of mine for PyKrige+GSTools See: GeoStat-Framework/GSTools#79).

And now I know again, why I was struggling with this topic for so long.

That behaviour looks like what I would expect, too. My example plots didn't incorporate noise, sorry, so here they are again demonstrating that I think something is missing from that matrix approach:

This is for adding a positive nugget to PyKrige's existing sign conventions. It does not produce a reasonable interpolation if the nugget is subtracted.

I would advocate for the RHS approach based on this expected behaviour alone... But its also contradictory to some literature (exactly as you pointed out, haha).

Perhaps we might find some insight from existing true and tested implementations of Kriging. I'll see what I can dig up.

I would also vote for the RHS approach, since I think, that the argumentation of Cressie is the most convincing one we have discussed up to now. I will finish my implementations in GSTools, then we can discuss that, if you are ok with it?

The other approaches always also bring up arguments for the modification of the kriging matrix to gain numerical stability. This issue is also addressed in #151 with the use of the pseudo-inverse matrix.

Cressie also gives a hint, that modifying the LHS of the equation is incorrect:

And I noticed, that this is in line with H. Wackernagel: "Multivariate Geostatistics" (chapter "Kriging with known measurement error variance"), since there the covariance is used and the jump at the origin of the covariance function is ignored on the LHS.

This approach sounds good. In the meantime, I will try and pull up some examples from R's fields package, which implements mKrig. I think this could help us verify some behaviour.

MuellerSeb · 2020-06-03T18:40:11Z

Hey there, I found the places in the code, that need to be modified.
The RHS is modified internally to be 0 where the distance is almost 0 (here, I only linked the occurrences in ok, should be in similar places for ok3d, uk and uk3d):

PyKrige/pykrige/ok.py

Line 650 in a067eb1

if zero_value:

PyKrige/pykrige/ok.py

Line 612 in a067eb1

if zero_value:

PyKrige/pykrige/ok.py

Line 686 in a067eb1

if zero_value:

It is also present in the cython routines:

PyKrige/pykrige/lib/cok.pyx

Line 144 in a067eb1

check_b_vect(n, bd, b, eps)

If exact_values is False, the RHS needs to be unmodified. So you need to add the keyword to all types of execution and the line above should change to:

if zero_value and exact_values:

In the cython routines, you need to add exact_values to skip the call of check_b_vect.

Then you don't need to modify the diagonal. An everything should work.

PS: you haven't modified ok3d and uk3d yet.

MuellerSeb · 2020-06-22T14:21:34Z

@nannau : do you still aim to implement this? I think we now found a nice solution, that should be easy to implement ;-)

nannau · 2020-06-22T16:22:56Z

@MuellerSeb yes, I will implement it. Good catch! Sorry for the hiatus. Thanks for the reminder!

nannau · 2020-06-23T05:26:20Z

@MuellerSeb Do you mind taking a look at ok.py and /lib/cok.pyx in my latest commits to see that I'm on the right track? It seems I'm able to get the non-exact behaviour as expected. I admit, I don't fully understand the technicality for why it works, but I'm happy to implement these software changes.

MuellerSeb · 2020-06-23T09:43:52Z

@nannau : That looks very good! This just needs to be applied to ok3d, uk and uk3d as well.
And I would place the exact_value keyword to the end of the input list, so we don't break anything for users who use the algorithms with a simple argument list.
And the documentation for the new input keyword is missing ;-)

When adding the extended GridSearch (#128 (comment)) I could then add the exact value argument (Just a reminder for me).

Thanks for your work!

nannau · 2020-06-23T20:37:27Z

Ok, thanks @MuellerSeb! There remains the question of how best to go about testing this new feature, something which I am still not sure of how to proceed. I've added preliminary simple checks based on what I saw in test_core.py, but perhaps more thorough testing belongs in another pull request?

nannau · 2020-06-23T21:49:52Z

Also, here is an updated plot similar to my earlier plot showing this new non-exact behaviour. :) It looks great!

from pykrige.ok import OrdinaryKriging
import matplotlib.pyplot as plt
import numpy as np

plt.style.use("ggplot")

np.random.seed(42)

x = np.linspace(0, 12.5, 50)
xpred = np.linspace(0, 12.5, 393)
y = np.sin(x)*np.exp(-.25*x)+np.random.normal(-0.25, 0.25, 50)

# compare OrdinaryKriging as an exact and non exact interpolator
uk = OrdinaryKriging(x, np.zeros(x.shape), y, variogram_model="linear", exact_values=False)
uk_exact = OrdinaryKriging(x, np.zeros(x.shape), y, variogram_model="linear")

y_pred, y_std = uk.execute("grid", xpred, np.array([0.0]), backend="loop")
y_pred_exact, y_std_exact = uk_exact.execute("grid", xpred, np.array([0.0]), backend="loop")


y_pred = np.squeeze(y_pred)
y_std = np.squeeze(y_std)

y_pred_exact = np.squeeze(y_pred_exact)
y_std_exact = np.squeeze(y_std_exact)


fig, ax = plt.subplots(1, 1, figsize=(10, 4))

ax.scatter(x, y, label="Input Data")
ax.plot(xpred, y_pred_exact, label="Exact Prediction")
ax.plot(xpred, y_pred, label="Non Exact Prediction")

ax.fill_between(
    xpred,
    y_pred - 3 * y_std,
    y_pred + 3 * y_std,
    alpha=0.3,
    label="Confidence interval",
)
ax.legend(loc=9)
ax.set_ylim(-1.8, 1.3)
ax.legend(loc=9)
plt.xlabel('X')
plt.ylabel("Field")
plt.show()

MuellerSeb · 2020-06-24T10:55:41Z

@nannau that looks very nice! Great work. Maybe you could add a hint in the documentation, that this only has an effect, if a nugget is present, that is then interpreted as measurement error.

And could you add the example from above to the examples folder, so it shows up in the documentation gallery on readthedocs ;-)

I am OK with the testing so far. Maybe one test that checks the equality of the results for exact and non-exact interpolation away from the conditioning data points.

Then this is ready to merge!

nannau · 2020-06-25T01:30:23Z

One more thing for this PR's record. I just did a comparison of this change in PyKrige to results generated from R's fields package for consistency. I used an exponential function for the covariance model, but modified the denominator in of the exponent in PyKrige since R's default covariance model does not include a factor of 3 (see this document).

Supplying a custom covariance model in PyKrige identical to the above, while also ensuring that the same nugget, sill, and range parameters were used for both spatialProcess and PyKrige, I obtained the kriged field with predictions at identical points. This is really good news, and could be a path forward to some more strict testing. I'm not sure what method R is using to deal with exact values, but based on this, it clearly looks consistent.

nannau · 2020-06-26T23:24:16Z

One final submission for this PR's record comparing this behaviour to R's fields spatialProcess. This plot shows the residuals (absolute value of the difference) between the two fields. The residuals show a very minor ~0.01 absolute difference for the 2D Gaussian Sample field, and I think this can be expected due to the different core routines that R and PyKrige employ. This is probably beyond what this PR set out to do, but I think its a thorough demonstration that this feature is behaving as expected and is consistent with other software.

MuellerSeb

Only thing left for me is the headline for the Example. Thank you so much for the effort and the perseverance.

examples/exact_values_example_1D.py

MuellerSeb · 2020-06-30T07:53:08Z

Cool. Here we go. Now I have to bring my stuff for the 1.5.1... ;-) Thanks again!

nannau · 2020-06-30T16:58:12Z

@MuellerSeb thanks for your help!

MuellerSeb and others added 6 commits April 4, 2020 15:12

Merge pull request GeoStat-Framework#145 from GeoStat-Framework/develop

1a889a7

v1.5.0

Adds execute() exact_values option

89379ba

Adds execute() exact_values option

32de8ae

Adds basic tests for exact_values option

65e21e1

black files

c553332

Updates docstring

832f787

nannau changed the base branch from master to v2_init May 22, 2020 17:51

MuellerSeb self-requested a review May 23, 2020 12:15

nannau changed the base branch from v2_init to develop May 25, 2020 21:08

MuellerSeb reviewed May 26, 2020

View reviewed changes

nannau added 2 commits June 22, 2020 22:23

Adds exact_values to zero_value conditions

5b5420a

Skip check_vect_b call if not exact_values

ffbd472

nannau added 4 commits June 23, 2020 11:41

Updates documentation, adds ref for non-exact

20641a9

Updates preliminary testing

4905dec

Adds exact_values

32e5ac5

Adds test for comparison of backends

71b8dac

Black

8445ba5

MuellerSeb self-assigned this Jun 24, 2020

MuellerSeb added enhancement new feature labels Jun 24, 2020

MuellerSeb added this to the v1.5.1 milestone Jun 24, 2020

MuellerSeb linked an issue Jun 24, 2020 that may be closed by this pull request

implement the nugget on the diagonal to prevent exact interpolation #30

Closed

nannau added 2 commits June 24, 2020 09:45

Updates doc with nugget hint

4763026

Adds exact_values example

f23ef04

Adds non_exact testing

4444ec6

MuellerSeb approved these changes Jun 29, 2020

View reviewed changes

examples/exact_values_example_1D.py Show resolved Hide resolved

Adds headline and description

f1dda1e

MuellerSeb merged commit b33e5d8 into GeoStat-Framework:develop Jun 30, 2020

nannau deleted the feature/exact-values-option branch June 30, 2020 18:49

MuellerSeb mentioned this pull request Apr 16, 2021

Continuation Gstoos interface mmaelicke/scikit-gstat#88

Merged

Feature/exact values option #153

Feature/exact values option #153

Uh oh!

Conversation

nannau commented May 22, 2020

Uh oh!

nannau commented May 22, 2020

Uh oh!

MuellerSeb commented May 23, 2020

Uh oh!

MuellerSeb commented May 23, 2020

Uh oh!

nannau commented May 25, 2020

Uh oh!

MuellerSeb commented May 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nannau commented May 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nannau commented May 26, 2020

Uh oh!

MuellerSeb May 26, 2020

Choose a reason for hiding this comment

Uh oh!

nannau May 26, 2020

Choose a reason for hiding this comment

Uh oh!

MuellerSeb May 27, 2020

Choose a reason for hiding this comment

Uh oh!

nannau May 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MuellerSeb May 28, 2020

Choose a reason for hiding this comment

Uh oh!

MuellerSeb May 28, 2020

Choose a reason for hiding this comment

Uh oh!

MuellerSeb May 28, 2020

Choose a reason for hiding this comment

Uh oh!

nannau May 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MuellerSeb commented Jun 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MuellerSeb commented Jun 22, 2020

Uh oh!

nannau commented Jun 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nannau commented Jun 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MuellerSeb commented Jun 23, 2020

Uh oh!

nannau commented Jun 23, 2020

Uh oh!

nannau commented Jun 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MuellerSeb commented Jun 24, 2020

Uh oh!

nannau commented Jun 25, 2020

Uh oh!

nannau commented Jun 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MuellerSeb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MuellerSeb commented Jun 30, 2020

Uh oh!

nannau commented Jun 30, 2020

MuellerSeb commented May 25, 2020 •

edited

Loading

nannau commented May 25, 2020 •

edited

Loading

nannau May 27, 2020 •

edited

Loading

nannau May 29, 2020 •

edited

Loading

MuellerSeb commented Jun 3, 2020 •

edited

Loading

nannau commented Jun 22, 2020 •

edited

Loading

nannau commented Jun 23, 2020 •

edited

Loading

nannau commented Jun 23, 2020 •

edited

Loading

nannau commented Jun 26, 2020 •

edited

Loading