-
Notifications
You must be signed in to change notification settings - Fork 0
/
data-preparation.qmd
250 lines (174 loc) · 18.5 KB
/
data-preparation.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
---
title: Data preparation
execute:
eval: false
---
This chapter describes the data used for the process rate estimator as well as how it is pre-processed and how new variables are calculated.
## Description of the measurements
The study uses data collected from a mesocosm experiment -- i.e. an outdoor experiment that examines the natural environment under controlled conditions. The experiment was set up as a randomized complete block design, with 4 varieties and 3 replicates, using 12 non-weighted lysimeters. A non-weighted lysimeter is a device to measure the amount of water that drains through soil, and to determine the types and amounts of dissolved nutrients or contaminants in the water. Each lysimeter had five sampling ports with soil moisture probes and custom-built pore gas sample, at depths of 7.5, 30, 60, 90 and 120 cm below soil surface.
$$4 \times 3 \times 5 \times 161 = 9660$${#eq-dimension}
@eq-dimension shows how many observations we should expect to have. In reality, some observations are missing. Hence, the `measurements` data frame only includes 9558 rows, with 9 variables. @tbl-measurements gives an overview of those variables.
:::{.column-page}
+-------------+-------------------+----------------------------+-------------+---------+----------------------------------------------+
| Code | Name | Description | Class | Missing | Distribution |
+=============+===================+============================+=============+=========+==============================================+
| `date` | Date | Date of a measurement in | `"Date"` | 0% | ![](resources/date.png){width=300px} |
| | | the `"Date"` format | | | |
| | | (`YYYY-MM-DD`), ranging | | | |
| | | from August 27 2015 to | | | |
| | | February 3 2016. | | | |
+-------------+-------------------+----------------------------+-------------+---------+----------------------------------------------+
| `column` | Column | Factor variable indicating | `"ordered"` | 0% | ![](resources/column.png){width=300px} |
| | | one out of twelve | | | |
| | | experimental units. | | | |
+-------------+-------------------+----------------------------+-------------+---------+----------------------------------------------+
| `depth` | Measurement depth | Depth of the measurement, | `"numeric"` | 0% | ![](resources/depth.png){width=300px} |
| | | either 7.5, 30, 60, 90, or | | | |
| | | 120 cm. | | | |
+-------------+-------------------+----------------------------+-------------+---------+----------------------------------------------+
| `increment` | Depth increment | Height of a soil layer | `"integer"` | 0% | ![](resources/increment.png){width=300px} |
| | | from lower to upper | | | |
| | | boundary. | | | |
+-------------+-------------------+----------------------------+-------------+---------+----------------------------------------------+
| `variety` | Wheat variety | Either `CH Claro`, | `"factor"` | 0% | ![](resources/variety.png){width=300px} |
| | | `Monte Calme 268`, | | | |
| | | `Probus`, or `Zinal`. | | | |
+-------------+-------------------+----------------------------+-------------+---------+----------------------------------------------+
| `moisture` | Soil moisture | The moisture is expressed | `"numeric"` | 0% | ![](resources/hist-moisture.png){width=300px}|
| | | as volumetric water | | | |
| | | content in cubic meters of | | | |
| | | water per cubic meters of | | | |
| | | soil [$\frac{m^3}{m^3}$]. | | | |
+-------------+-------------------+----------------------------+-------------+---------+----------------------------------------------+
| `N2O` | Corrected N~2~O | | `"numeric"` | 86.9% | ![](resources/hist-N2O.png){width=300px} |
| | concentration | | | | |
+-------------+-------------------+----------------------------+-------------+---------+----------------------------------------------+
| `SP` | Site preference | Site preference values | `"numeric"` | 86.3% | ![](resources/hist-SP.png){width=300px} |
| | | for N~2~O | | | |
+-------------+-------------------+----------------------------+-------------+---------+----------------------------------------------+
| `d18O` | $\delta^{18}\ce O$| Ratio of stable isotopes | `"numeric"` | 86.3% | ![](resources/hist-d18O.png){width=300px} |
| | stable isotope | oxygen-18 (^18^O) and | | | |
| | ratio | oxygen-16 (^16^O). | | | |
+-------------+-------------------+----------------------------+-------------+---------+----------------------------------------------+
: Overview of the variables included in the `measurements` data as included in the R package `PRE`. For the continuous variables, a histogram shows the measurements distribution. {#tbl-measurements tbl-colwidths="[9,15,32,8,8,23]"}
:::
### Accessing the measurement data in R
In the R package `PRE`, this data is available as the object `measurements`. We can load the data with the `data` function in R.
```{r}
data("measurements", package = "PRE")
```
Upon doing so, a data frame called `measurements` will appear in our global environment. Alternatively, the data can be directly accessed as `PRE::measurements` (the prefix `PRE::` tells R that the object can be found in the `PRE` package.)
You can see an interactive overview of the `measurements` data in the column below. For many dates, there are no measurements of $\ce{N2O}$, site preference, or $\delta^{18}\text{O}$.
:::{.column-screen-inset}
---
```{r}
#| eval: true
#| echo: false
library(reactable)
data <- PRE::measurements
rownames(data) <- NULL
data$moisture <- round(data$moisture * 100, 1)
data$N2O <- round(data$N2O,2)
reactable(data,
defaultPageSize = 8,
columns = list(
date = colDef(name = "Date", filterable = TRUE),
column = colDef(name = "Column", filterable = TRUE),
depth = colDef(name = "Depth [cm]", filterable = TRUE, align = "left"),
increment = colDef(name = "Increment [cm]", filterable = TRUE, align = "left"),
variety = colDef(name = "Variety", filterable = TRUE),
moisture = colDef(name = "Moisture [%]"),
N2O = colDef(name = " N₂O conc."),
SP = colDef(name = "Site preference"),
d18O = colDef(name = "δ18O")
)
)
```
---
:::
Luckily, the measurements for N~2~O concentration, site preference as well as $\delta$^18^O are distributed more or less evenly over time. Hence, we'll interpolate the missing values. This step will be the focus of the next section.
## Calculating volumetric and area N~2~O-N
In a first step, we are calculating the N~2~O-N per volume and subsequently per area. This is a necessary step, because the process rate estimator requires these variables.
Since both `N2ONvolume` as well as `N2ONarea` are dependent on initial parameters, their computation is part of the workflow and might change slightly for a changed experimental set-up or different assumptions regarding the environment.
The volumetric N~2~O-N is calculated from the N~2~O concentration (@eq-N2ONvolume).
$$\text N_2 \text {O-N}_\text{volume} = \frac{28[\text{N}_2\text O]}{\text {R} \cdot \text {T}}$${#eq-N2ONvolume}
Here, $\text {R}$ is the gas constant and $T$ is the temperature.
In a next step, we calculate the per-area N~2~O-N from the volumetric N~2~O-N and the soil moisture as well as the total porosity and the increment in meters.
$$\text N_2 \text {O-N}_\text{area} = \text N_2 \text {O-N}_\text{volume} \times \frac{1}{100} \texttt{increment} \times \frac{10^4}{10^3} \left(\theta_t - \texttt{moisture} \right)$${#eq-N2ONarea}
All of these equations are implemented in the `getN2ON` function. This function returns a data frame very similar to `measurements`, but with two extra columns for the computed `N2ONvolume` and `N2ONarea` variables.
```{r}
data <- getN2ON(data = PRE::measurements, parameters = getParameters())
```
Note that the `getParameters` function simply returns a list with the default values for all necessary parameters such as the gas constant $\text R$ or the assumed air temperature $\text T$. If we wish to change these, we can pass alternative parameter values to the function. For example, we could assume an air temperature of 300 K with `getParameters(temperature = 300)`. This makes sure that depended parameter values are updated as well.
## Interpolating the measurements {#sec-interpolation}
The N~2~O concentration, site preference as well as $\delta$^18^O are estimated as a function of time for every depth and every column, separately. To achieve this function approximation, Kernel Regression as implemented in `npreg` is used [@hayfield2008nonparametric].
The model requires a single parameter, the bandwidth (`bws`), which defines how flexible it is.
However, the bandwidth needs to be actively chosen and cannot be intrinsically estimated -- it's a hypertuning parameter a.k.a. a hyperparameter.
### Choosing optimal bandwidths for the interpolation
@fig-explanation gives an intuitive understanding on why we don't want to use the same bandwidth for every single combination of depth and column: In some instances we seem to have a very clear signal, while in others, there seems to be much noise -- we are dealing with unequal variance, a.k.a. heteroskedasticity.
![Animated explanation on two examplary subsets of the data. On the left, there is a very high signal to noise ratio, thus the optimal `bandwith` hyperparameter will be smaller than on the right.](resources/explanation.gif){#fig-explanation}
The bandwidth hyperparameter is individually tuned using 3-fold 10 times repeated cross-validation for every combination of column and depth and variable[^variable], respectively. You can see an overview of the found hyperparameters in [figure @fig-hyperparameters-1], [-@fig-hyperparameters-2], and [-@fig-hyperparameters-3].
[^variable]: I.e. the three variables N~2~O concentration, site preference and $\delta$^18^O.
:::{.column-page-right}
::: {.panel-tabset}
## N~2~O concentration
![Optimal hyperparameter size by depth and column for the N~2~O-N concentration.](resources/hyperparameters-1.svg){#fig-hyperparameters-1 width=100%}
## Site preference
![Optimal hyperparameter size by depth and column for the site preference.](resources/hyperparameters-2.svg){#fig-hyperparameters-2 width=100%}
## Stable isotope ratio ($\delta$^18^O)
![Optimal hyperparameter size by depth and column for the stable isotope ratio $\delta$^18^O concentration.](resources/hyperparameters-3.svg){#fig-hyperparameters-3 width=100%}
:::
:::
Generally, smaller bandwidths correspond to a more flexible model, indicating a larger signal-to-noise ratio.
Notice how deeper measurements seem to correspond to smaller bandwidths, at least for the N~2~O concentration and the stable isotope ratio ($\delta$^18^O).
Note that the optimal bandwidths for N~2~O are also used for the interpolation of N~2~O-N~volume~ and N~2~O-N~area~.
### Numerical approximation of the derivative estimation
With the optimized hyperparameters, we are all set to estimate the functions over time, as well as their derivatives, which are needed for the process rate estimator.
One common way to numerically approximate the derivative of a function $f$ at a point $t$ is through the central difference method. This method computes the average rate of change over a small interval around the point of interest.
$$\frac{df}{dt} \approx \frac{f(t + h) - f(t - h)}{2h}$${#eq-centraldifference}
Here, $h$ is a small positive number known as the step size. The notations $f(t + h)$ and $f(t - h)$ represent the function values at $t + h$ and $t - h$, respectively. The smaller h is, the more accurate the approximation should be. However, if h is too small, then round-off errors from computer arithmetic can become significant. So, an appropriate balance must be struck. The central difference method is implemented as the `fderiv` function in the `pracma` R package.
Both the function estimation as well as the estimation of the derivative are performed with a single function call in the R package `PRE`.
```{r}
data <- getMissing(data = data, hyperparameters = PRE::hyperparameters)
```
Here, `data` is the data frame that we slightly extended with the `getN2ON` function. The object `hyperparameters` is a three-dimensional array containing the optimized hyperparameters for this project. It is also included in the `PRE` package[^2].
[^2]: To optimize the hyperparameters, you can run the R script `hypertuning.R`.
## Calculating the fluxes
### Model parameters
| Symbol | Code | Name | Value | Unit |
|:-------------------|:--------------|:---------------------------------------------------------|:----------------------|:---------|
| $BD$ | `BD` | Bulk density (mass of the many particles of the material divided by the bulk volume) | $1.686$ | g cm^-3^ |
| $\theta_w$ | `theta_w` | Soil volumetric water content | | |
| $\theta_a$ | `theta_a` | Air-filled porosity | $1-\frac{\theta_w}{\theta_t}$ | |
| $\theta_t$ | `theta_t` | Total soil porosity | $1-\frac{BD}{2.65}$ | |
| $\text T$ | `temperature` | Soil temperature | $298$ | K |
| $D_{\text{s}}$ | `D_s` | Gas diffusion coefficient | @eq-Ds | m^2^s^-1^|
| $D_{\text{fw}}$ | `D_fw` | Diffusivity of N~2~O in water | @eq-Dfw | |
| $D_{\text{fa}}$ | `D_fa` | Diffusivity of N~2~O in air | @eq-Dfa | |
| $D_{\text{fa,NTP}}$| | Free air diffusion coefficient under standard conditions | @eq-Dfa | |
| $n$ | `n` | Empirical parameter [@massman1998review] | 1.81 | |
| $H$ | `H` | Dimensionless Henry's solubility constant | @eq-H | |
| $\rho$ | `rho` | Gas density of N~2~O | $1.26 \times 10^6$ | mg m~-3~ |
: Overview of the parameters used in the model. {#tbl-parameters tbl-colwidths="[10,15,55,15,5]"}
The diffusion fluxes between soil increments are described by Frick's law (@eq-frick).
$$F_{\text{calc}} = \frac{dC}{dZ} D_{\text s} \rho$${#eq-frick}
Here, $D_s$ is the gas diffusion coefficient, $\rho$ is the gas density of N~2~O, and $\frac{dC}{dZ}$ is the N~2~O concentration gradient from lower to upper depth.
The fluxes are calculated based on N~2~O concentration gradients between 105-135 cm, 75-105 cm, 45-75 cm, 15-45 cm, and 0-15 cm depth layers, and ambient air above the soil surface.
$\theta_w$ is the soil volumetric water content, $\theta_a$ the air-filled porosity, and $\theta_T$ is the total soil porosity.
The gas diffusion coefficient $D_{\text s}$ was calculated according @eq-Ds as established by Millington and Quirk in 1961 [@millington1961permeability].
$$D_{\text s} = \left( \frac{\theta_w^{\frac{10}{3}} + D_{\text fw}}{H} + \theta_a^{\frac{10}{3}} \times D_{\text fa} \right) \times \theta_T^{-2}$${#eq-Ds}
Here, $H$ represents a dimensionless form of Henry's solubility constant ($H'$) for N~2~O in water at a given temperature. The constant $H$ for N~2~O is calculated as follows:
$$H = \frac{8.5470 \times 10^5 \times \exp \frac{-2284}{\text T}}{\text R \times \text T}$${#eq-H}
Here, $\text R$ is the gas constant, and $\text T$ is the temperature ($\text T = 298 \; \text K$).
$D_{\text{fw}}$ was calculated according to @eq-Dfw as documented by Versteeg and Van Swaaij (1988) [@versteeg1988solubility].
$$D_{\text{fw}} = 5.07 \times 10^{-6} \times \exp \frac{-2371}{\text T}$${#eq-Dfw}
$$D_{\text{fa}} = D_{\text{fa, NTP}} \times \left( \frac{\text T}{273.15} \right)^n \times \left( \frac{101'325}{\text P} \right)$${#eq-Dfa}
Calculate mean moisture between depth increments:
$$
\theta_{w} = \frac{1}{2}(m_d + m_{d+1})
$${#eq-theta-w}
Where $m$ is the moisture and $d$ is the depth index.
All the computations described in [equations @eq-frick] [-@eq-Ds], [-@eq-H], [-@eq-Dfa], [-@eq-Dfw] and [-@eq-theta-w] are performed by calling the following function from the R package `PRE`.
```{r}
data <- calculateFluxes(data = data, parameters = getParameters())
```