forked from kyleclo/stackedit-notebooks
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy patharma-models
387 lines (220 loc) · 14.1 KB
/
arma-models
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
- [Time Series Analysis by Wei](http://ruangbacafmipa.staff.ub.ac.id/files/2012/02/Time-Series-Analysis-by.-Wei.pdf)
- [Introduction to Time Series and Forecasting by Brockwell and Davis](https://gejza.nipax.cz/_media/stochasticke_procesy:2002-brockwell-introduction_time_series_and_forecasting.pdf)
# Stationarity
#### Definition
A process is **stationary** if its distribution doesn't change with time. Specifically:
- $n^{th}$ order stationarity is when $(y_{t_1},\dots,y_{t_n}) \overset{d}= (y_{t_1+k},\dots,y_{t_n+k})$ for any $n$-tuple and integer $k$.
- If the property holds for any value of $n$, then the process is **strongly stationary**.
- A process is $n^{th}$ order **weakly stationary** if the $1^{st}$ to $n^{th}$ moments are jointly time-invariant.
Note that strong and weak stationarity don't necessarily imply the other:
- e.g. A Cauchy iid process is strongly but not weakly stationary because the moments aren't finite.
- e.g. An alternating sequence of $N(0,1)$ and $Uniform(\{-1, 1\})$ is weakly but not strongly stationary.
#### Parameters
If a $2^{nd}$ order weakly stationary process has finite first and second moments, then:
- The mean and variance are time-invariant. We denote these $\mu_t = \mu$ and $\sigma^2_t = \sigma^2$.
- The autocovariance (ACVF) and autocorrelation (ACF) functions only depend on the lag $k$ between two time points. We denote these $\gamma(t_1,t_2) = \gamma_k$ and $\rho(t_1,t_2) = \rho_k$.
This is also called **covariance stationarity**. Assuming this form of weak stationarity will be sufficient for most models since we typically assume Gaussianity (uniquely identified by first and second moments).
# Partial autocorrelation
The PACF is
$$\frac{Cov[y_t, y_{t+k} \vert y_{t+1}, \dots, y_{t+k-1}]}{\sqrt{Var[y_t \vert y_{t+1}, \dots, y_{t+k-1}]} \sqrt{Var[y_{t+k} \vert y_{t+1}, \dots, y_{t+k-1}]}}$$
which is similar to the ACF $\rho_k$ but conditional on all values between $t$ and $t+k$.
The idea of conditioning on these values is expressed by the **residuals** from the linear regressions
$$\widehat{y}_{t} = \alpha_0 + \alpha_1 y_{t+1} + \dots + \alpha_{k-1} y_{t+k-1}$$
$$\widehat{y}_{t+k} = \beta_0 + \beta_1 y_{t+1} + \dots + \beta_{k-1} y_{t+k-1}$$
in that the PACF can be written like the ACF of the residuals:
$$\frac{Cov[(y_t - \widehat{y}_t)(y_{t+k} - \widehat{y}_{t+k})]}{\sqrt{Var[y_t - \widehat{y}_t]} \sqrt{Var[y_{t+k} - \widehat{y}_{t+k}]}}$$
There is a closed form expression for this, but unnecessary for now. See Section 2.3 of Wei if curious.
# MA(q) process
The MA(q) process is defined:
$$y_t = \mu + \epsilon_t + \psi_1 \epsilon_{t-1} + \dots + \psi_q \epsilon_{t-q}$$
where $\epsilon_t$ is iid with mean $0$ and variance $\sigma_{\epsilon}^2$.
Compactly,
$$y_t - \mu = \psi(B) \epsilon_t$$
where $\psi(B) = (1 + \psi_1 B + \psi_2 B^2 + \dots + \psi_q B^q)$.
#### Application
Useful for modeling phenomena in which events produce an immediate-but-short-lasting effect.
#### Assumptions
The finite-order **MA(q) process is always stationary**.
More generally, for an infinite-order MA process to be stationary, we require that $\sum_{j=1}^{\infty} \psi_j^2 \lt \infty$ to ensure finite variance (see next section).
#### Properties
- Mean
$$\begin{align}
E[y_t] &= \mu + \psi_1 E[\epsilon_{t-1}] + \dots + \psi_q E[\epsilon_{t-q}] \\
&= \mu
\end{align}$$
- Variance (where $\psi_0 = 1$):
$$\begin{align}
Var[y_t] &= Var[\epsilon_t] + \psi_1^2 Var[\epsilon_{t-1}] + \dots + \psi_q^2 Var[\epsilon_{t-q}] \\
&= \sigma_{\epsilon}^2 + \sigma_{\epsilon}^2 \sum_{j=1}^q \psi_j^2 \\
&= \sigma^2_{\epsilon} \sum_{j=0}^q \psi_j^2
\end{align}$$
- Autocovariance disappears for lag $k \gt q$:
$$\begin{align}
\gamma_k &= E[(y_t - \mu)(y_{t+k} - \mu)] \\
&= E\left[ \left( \sum_{i=0}^q \psi_i \epsilon_{t-i} \right) \left( \sum_{j=0}^q \psi_j \epsilon_{t+k-j} \right) \right] \\
&= E\left[ \sum_{i=0}^q \sum_{j=0}^q \psi_i \psi_j \epsilon_{t-i} \epsilon_{t+k-j} \right] \\
&= \begin{cases}
\sigma^2_{\epsilon} \sum_{j=0}^q \psi_j \psi_{j+k} & k \le q\\
0 & k \gt q
\end{cases} \\
&= \begin{cases}
\sigma^2_{\epsilon} \psi_k + \sigma^2_{\epsilon} \sum_{j=1}^q \psi_j \psi_{j+k} & k \le q\\
0 & k \gt q
\end{cases}
\end{align}$$
- PACF exhibits an exponential decay or damped sine wave as $k \to \infty$, depending on the sign of the $\psi$ terms
# AR(p) process
The AR(p) process is defined:
$$y_t = \mu + \phi_1 (y_{t-1} - \mu) + \dots + \phi_p (y_{t-p} - \mu) + \epsilon_t$$
where $\epsilon_t$ is iid with mean $0$ and variance $\sigma^2$.
Compactly,
$$\phi(B)(y_t - \mu) = \epsilon_t$$
where $\phi(B) = (1 - \phi_1 B - \phi_2 B^2 - \dots - \phi_p B^p)$.
#### Properties
Assuming stationarity:
- Mean
$$\begin{align}
E[y_t] &= \mu + \phi_1 (E[y_{t-1}] - \mu) + \dots + \phi_p (E[y_{t-p}] - \mu) \\
&= \mu + (E[y_t] - \mu) \sum_{j=1}^p \phi_j \\
&= \frac{\mu - \mu \sum_{j=1}^p \phi_j }{1 - \sum_{j=1}^p \phi_j } \\
&= \mu
\end{align}$$
- Variance:
$$\begin{align}
Var[y_t] &= \phi_1^2 Var[y_{t-1}] + \dots + \phi_p^2 Var[y_{t-p}] + \sigma^2_{\epsilon} \\
&= \sigma^2_{\epsilon} + Var[y_t] \sum_{j=1}^p \phi_j^2 \\
&= \frac{\sigma_{\epsilon}^2}{1 - \sum_{j=1}^p \phi_j^2 }
\end{align}$$
- Autocovariance (without loss of generality, assume $\mu = 0$) exhibits an exponential decay or damped sine wave as $k \to \infty$, depending on the sign of the $\phi$ terms:
$$\begin{align} \\
y_t y_{t-k} &= (\phi_1 y_{t-1} + \dots + \phi_p y_{t-p} + \epsilon_t) y_{t-k}\\
\gamma_k = E[y_t y_{t-k}] &= E[\phi_1 y_{t-1}y_{t-k} + \dots + \phi_p y_{t-p}y_{t-k} + \epsilon_t y_{t-k} ]\\
&= \phi_1 \gamma_{k-1} + \dots + \phi_p \gamma_{k-p} \\
&= \sum_{j=1}^p \phi_j \gamma_{k-j}
\end{align}$$
- PACF disappears for lag $k \gt p$.
# Invertability
A process is **invertible** if it can be written in an AR form. In fact, the finite-order **AR(p) process is always invertible**.
More generally, for an infinite-order AR process to be invertible, we require that $\sum_{j=1}^{\infty} \vert \phi_j \vert \lt \infty$.
Box & Jenkins argued that only invertible processes are worth forecasting.
Note that invertibility is not the same as stationarity:
- A stationarity MA process is invertible if the roots of the equation $\psi(B) = 0$ have absolute value $\gt 1$.
- An invertible AR process is stationary if the roots of the equation $\phi(B) = 0$ have absolute value $\gt 1$.
# ARMA
The ARMA(p,q) process is defined:
$$y_t = \mu + \phi_1 (y_{t-1} - \mu) + \dots + \phi_p (y_{t-1} - \mu) + \epsilon_t + \psi_1 \epsilon_{t-1} + \dots + \psi_q \epsilon_{t-q}$$
or compactly,
$$\phi(B)(y_t - \mu) = \psi(B) \epsilon_t$$
#### Method of moments estimation
For example, let's consider the stationary AR(2) process:
$$y_t = \mu + \phi_1 (y_{t-1} - \mu) + \phi_2 (y_{t-2} - \mu) + \epsilon_t$$
First, we estimate the mean with its sample estimator $\mu \approx \bar{y}$.
Next, we estimate $\phi_1$ and $\phi_2$ using the ACF at $k = 1,2$:
$$
\begin{align}
\rho_1 &= \frac{\gamma_1}{\gamma_0} = \frac{\phi_1 \gamma_0 + \phi_2 \gamma_1}{\gamma_0} = \phi_1 + \phi_2 \rho_1 \\
\rho_2 &= \frac{\gamma_2}{\gamma_0} = \frac{\phi_1 \gamma_1 + \phi_2 \gamma_0}{\gamma_0} = \phi_1 \rho_1 + \phi_2
\end{align}
$$
If we estimate $\rho_k \approx \widehat{\rho}_k$, then we're left with two equations and two unknowns. Solving the system of linear equations:
$$
\begin{pmatrix} \widehat{\rho}_1 \\ \widehat{\rho}_2 \end{pmatrix} = \begin{pmatrix} 1 & \widehat{\rho}_1 \\
\widehat{\rho}_1 & 1 \end{pmatrix} \begin{pmatrix}
\phi_1 \\
\phi_2
\end{pmatrix}
$$
will get us the estimators for $\phi_1$ and $\phi_2$.
This procedure is also called Yule-Walker estimation. It's not recommended because it gets very complex for MA or ARMA processes.
#### Maximum likelihood estimation
For example, let's consider the stationary AR(1) process where $\mu = 0$. Furthermore, we're now assuming $\epsilon_t$ are Gaussian:
$$y_t = \phi y_{t-1} + \epsilon_t$$
Then the likelihood can be factorized:
$$\begin{align}
f(y_1,\dots,y_n) &= f(y_1) f(y_2 \vert y_1) \cdots f(y_n \vert y_{n-1}, \dots, y_1) \\
&= f(y_1) f(y_2 \vert y_1) \cdots f(y_n \vert y_{n-1})
\end{align}$$
The components are derived:
$$
\begin{align}
f(y_1) &= N\left(0, \frac{\sigma^2_{\epsilon}}{1 - \phi^2}\right) \\
\log f(y_1) &= - \frac{1}{2} \log 2\pi \sigma_{\epsilon}^2 + \frac{1}{2} \log (1 - \phi^2) - \frac{(1-\phi^2) y_1^2}{2\sigma^2_{\epsilon}}
\end{align}
$$
and
$$
\begin{align}
f(y_t \vert y_{t-1}) &= N\left(\phi y_{t-1}, \sigma^2_{\epsilon} \right) \\
\log f(y_t \vert y_{t-1}) &= - \frac{1}{2} \log 2\pi \sigma_{\epsilon}^2 - \frac{(y_t - \phi y_{t-1})^2}{2\sigma^2_{\epsilon}}
\end{align}
$$
Therefore, the log-likelihood is
$$\begin{align}
l(\phi, \sigma^2_{\epsilon}) &= \log f(y_1) + \sum_{t=2}^n \log f(y_t \vert y_{t-1}) \\
&= -\frac{n}{2}\log 2\pi\sigma^2_{\epsilon} + \frac{1}{2} \log (1 - \phi^2) - \frac{y_1^2(1-\phi^2) + \sum_{t=2}^n (y_t - \phi y_{t-1})^2}{2\sigma^2_{\epsilon}}
\end{align}$$
Differentiating gives us:
$$\begin{align}
\frac{\partial l}{\partial \sigma^2_{\epsilon}} &= -\frac{n}{2\sigma^2_{\epsilon}} + \frac{y_1^2(1-\phi^2) + \sum_{t=2}^n (y_t - \phi y_{t-1})^2}{2\sigma^4_{\epsilon}} := 0
\end{align}$$
and
$$\begin{align}
\frac{\partial l}{\partial \phi} &= \frac{y_1^2 \phi + \sum_{t=2}^n (y_t - \phi y_{t-1}) y_{t-1}}{\sigma^2_{\epsilon}} - \frac{\phi}{1 - \phi^2} := 0
\end{align}$$
Solving for zero gives us:
$$\begin{align}
\widehat{\sigma}^2_{\epsilon} &= \frac{y_1^2(1-\phi^2) + \sum_{t=2}^n (y_t - \phi y_{t-1})^2}{n}
\end{align}$$
and
$$\begin{align}
\widehat{\phi} \underbrace{\left( \frac{\widehat{\sigma}^2_{\epsilon} }{1 - \widehat{\phi}^2} - y_1^2 \right)}_{\delta} &= \sum_{t=2}^n (y_t - \widehat{\phi} y_{t-1}) y_{t-1}
\end{align}$$
Note that since the expected value of the $\delta$ term is approximately $0$, the LHS is near 0. Then the RHS looks like the normal equation for simple linear regression of $y_t$ on $y_{t-1}$. Hence, the MLE for $\phi$ should be similar to the least squares estimate for $\phi$.
There exist exact closed form MLEs for ARMA(p,q) models, but they're too complicated to derive here.
#### Least squares estimation
An AR(p) model is linear in the parameters, so fits within the standard least squares regression framework.
But models with MA terms aren't linear in the parameters. For example, an ARMA(1,1):
$$y_t = \mu + \phi(y_{t-1} - \mu) + \epsilon_t + \psi \epsilon_{t-1}$$
# time-series-notes
These are notes summarizing main concepts from Brockwell and Davis time series forecasting text.
## Classical decomposition
Suppose we observe the sequence $y_1,\dots,y_n$.
#### Trend
Assume the model:
$$y_t = m_t + \epsilon_t$$
where $m_t$ is a deterministic **trend** component, and $\epsilon_t$ is random noise with mean $0$.
We can assume $m_t$ is some function of $t$ parameterized by $\theta$ which we estimate via usual regression model fitting procedures.
For example, the trend could be a polynomial function of time $m_t = \theta_0 + \theta_1 t + \theta_2 t^2$, and we can estimate the parameters using least squares minimization:
$$\min_{\theta} \sum_{t=1}^n (y_t - m_t)^2$$
#### Seasonality
Assume the model:
$$y_t = s_t + \epsilon_t$$
where $s_t$ is a deterministic **seasonality** component, and $\epsilon_t$ is still random noise with mean $0$.
We typically use a periodic function like a **Fourier expansion of order $k$** where:
$$s_t = \gamma + \sum_{j=1}^k \left( \alpha_j \cos \frac{2\pi t j}{d} + \beta_j \sin \frac{2\pi t j}{d}\right)$$
where $d$ is the number of observations in a single seasonal **period**. For example, $d=12$ for annual seasonality in monthly observations.
We can estimate $\gamma$, $\alpha_j$'s and $\beta_j$'s using least squares as well. We do this by computing **seasonal features** corresponding to the $\cos$ and $\sin$ terms evaluated at each $t$, and subsequently regressing $y_t$ on these $2k$ features.
Fitting a periodic function is also called **harmonic** regression. Optimal $k$ can be selected via cross validation.
The R package `forecast` has a function that generates the $n \times 2k$ matrix of seasonality features. It has a requirement that $k \gt d/2$, but I'm not sure why this is necessary.
#### Usage
Typically, we perform the classical decomposition with both trend and seasonality components on the observed series. Then, we take the residuals $e_t = y_t - m_t - s_t$ as our new series for fitting probabilistic models that assume **stationarity**.
In other words, the classical decomposition is a **preprocessing step** (like smoothing, differencing, removing outliers, Box-Cox transforming, etc.) since we assume the learned $m_t$ and $s_t$ terms are fixed upon performing likelihood-based modeling of $e_t$.
## Stationary processes
See other set of notes on stationarity and ARMA processes.
#### Differencing
$B$ is the backshift operator such that $B y_t = y_t$. Then $\Delta^k y_t = (1 - B)^k y_t$. We use Binomial theorem to evaluate the polynomial. For trend, usually $k = 1, 2$ is enough.
This has benefits over trend modeling:
- Doesn't require parameter estimation
- Allows trend to change over time (since change in slope represented by higher order differencing)
- Can remove seasonality components by seasonal differencing $B^d y_t = y_{t-d}$, then differencing for trend ($d = 1$).
# Evaluating performance
We have a procedure that takes a stream of values $y_1,\dots,y_n$ and produces a forecast $\widehat{y}_{n+k} \approx y_{n+k}$. We have several candidate procedures and want to select the one that produces the most accurate forecasts. For now, let's say we're only interested in $k = 1$, or one-step forecasts.
Then accuracy is a parameter of interest that we want to estimate. In the cross-sectional data setting, if we're interested in $E[(y - \widehat{y})^2]$, then we can use the estimate $\frac{1}{m} \sum_{i=1}^m (y_i - \widehat{y}_i)^2$ where $m$ is the size of the test set.
But this is more complicated in the time series setting because observations aren't independent.
### Fixed origin
We estimate some model $f_{1:}$
- [Source 1](http://francescopochetti.com/pythonic-cross-validation-time-series-pandas-scikit-learn/)
- Source 2
### Rolling origin
Suppose we knew that
We estimate some model $f_{1:100}$