Skip to content

Fast introduction to large-sample methods in statistics

License

Notifications You must be signed in to change notification settings

matloff/fastBigStat

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 

Repository files navigation

fastBigStat

A fast, practical introduction to asymptotic normality of statistical estimators.

Includes the necessary support material on random vectors and the multivariate (MV) normal (Gaussian) distribution family, as well as some insight into statistical inference based on normal distributions.

See also my fast introduction to statistics.

Author: Norm Matloff, UC Davis; bio

Prerequisites

  • Calculus-based probability theory -- random variables, cdfs, density functions, expected value etc.

  • Linear algebra.

  • Familiarity with limits.

Notation

  • Vectors will be in column form.

  • Prime (') will indicate matrix transpose.

  • The identity matrix, of whatever given size, will be denoted by I.

Some Preliminaries

  • Consider a random vector X = (X1,...,Xp)'.

  • E(X) means the vector of componentwise means E(Xi).

  • Covariance

    • For random variables U and V, with means α and β, define

      Cov(U,V) = E[(U-α)(V-β)]

      Intuitively, if often when U is above its mean α then V is also above its mean β, then the covariance will be positive, etc. It's like correlation, and in fact the latter is a scaled version of covariance,

      ρ(U,V) = Cov(U,V) / [sqrt(Var(U) sqrt(Var(V))]

    • The covariance matrix Cov(X) of a random vector X = (X1,...,Xp)' is a p x p matrix with (i,j) element Cov(Xi,Xj), which reduces to Var(Xi) when i = j.

    • In matrix terms, Cov(X) = E[(X - μ) (X - μ)'], where μ = E(X).

    • Let Σ be a covariance matrix. As a real, symmetric matrix, it can be diagonalized, i.e. Σ = P D P' for an orthogonal matrix P and diagonal matrix D. Let W = PD1/2P', where the square root matrix simply takes the square roots of the diagonal elements of D. Then W2 = Σ, and we can thus write W = Σ1/2.

  • Law of Iterated Expectation

    • Let U and V be random quantities (scalars here, for convenience) for which the expectations below exist. Then

      E[ E(V|U) ] = E(V)

      Note that E(V|U) is treated as a random variable, which it in fact is, since it is a function of U.

    • Intuition: Suppose we wish to find the mean height of all UC Davis students. We could ask each academic department to find the mean height of their students, then average those mean heights. (Since some departments are larger than others, that latter average is weighted.)

  • Role of the Data

    • Say we have data W1,...,Wn.

    • The statistical view is to treat the data as a random sample from a population.

    • In computer science, the data are often viewed as fixed, but some analyses describe the data as coming from a "data generating mechanism" and the like.

    • We take the statistical view here. Typically the data are assumed independent and identically distributed (iid). That first trait arises from sampling with replacement (for large n, not much of an issue), and the second means that each entity in the population has the same probability of being sampled.

    • Note that this means that each Wi has the distribution of the population. E.g. if 29% of the units (e.g. people) in the population have value below 221.4, then P(Wi < 221.4) = 0.29.0.

      Similarly, if the population quantity is normally distributed, that means each Wi has a normal distribution (with the same mean and variance as the population).

    • The sample mean

      Wbar = (1/n) (W1+...+Wn)

      being the sample analog of the population mean, is a natural estimate (i.e. one that we might think of first if we were inventing statistics) of the population mean E(W), where W has the distribution of the popuulation

    • Similarly, since

      Var(W) = E[(W - E(W))2],

      a natural estimator of Var(W) is

      s2 = (1/n) Σi (Wi - Wbar)2

      Suppose now W is a vector. The similar estimator of Cov(W) is

      (1/n) Σi (Wi - Wbar) (Wi - Wbar)'

      Many books will use n-1 rather than n as a divisor, to make s2 unbiased, i.e.

      E(s2) = Var(W)

      This says the average of s2, over all possible samples, is equal to the population variance. This usually won't matter for us, as we are primarily interested in large-n settings.

"Exact" vs. Approximate Statistical Inference: Historical Perspective

  • Say we have scalars Wi as above, with E(Wi) = μ and Var(Wi) = σ2. The latter two are population values, and we wish to estimate μ in particular.

  • When statistical inference methodology -- confidence intervals (CIs) and hypothesis tests -- was being developed, they had no computers, so they just made the assumption that the sampled quantity, W above, has a normal distribution, to simplify things.

  • The pioneers of statistics didn't have asymptotic theory available either, so the methodology they developed is called "exact"; the probabilities etc. are exactly correct (if the assumption of normal Wi holds).

  • The notion of the normal distribution goes back to the 18th and 19th centuries, especially due to the Central Limit Theorem (CLT).

  • One might say that the normal family was there in Nature, waiting to be discovered.

  • But related distribution familes, notably Student-t, chi-squared and the F-distribution are "man-made," in the following sense.

  • With some clever math, they found that certain quantities of interest had the same distribution as that of the sum of k squares of N(0,1) variables. They thus decided to give this distribution family a name, chi-squared, thus inventing a new distribution family. (And they laboriously tabulated this family's cdf by hand, just as they had for the normal.)

  • With some more clever math, they found that a certain quantity had the distribution of a N(0,1) random variable divided by the square root of an independent chi-squared random variable. So they invented a new distribution family accordingly, the Student-t distribution. For similar reasons, they invented the F-distribution family.

  • This gives us the test taught in all elementary statistics courses: To test the hypothesis H0: μ = c, form the expression T = (Wbar - c) / (s/n0.5) (where s is as above, but with an n-1 divisor). This has a Student-t distribution with n-1 degrees of freedom, whose probabilities can then be used to form cutoff levels for the test.

  • For n = 50, for instance, the lower and upper 2.5% cutoffs are -2.01 and 2.01. So, we reject H0 if (Wbar - c) / (s/n0.5) is outside these bounds.

  • This is called "exact" inference, since the probabilities are exact, if the Wi are normal.

  • Later other distribution families were developed, such as the exponential and gamma, and thus exact methods were developed for them as well. This led to development of a general method known as maximum likelihood estimation (MLE).

  • Yet, people soon realized that in the expression T above, Wbar is approximately normal by the Central Limit Theorem. Moreover, the quantity s is approximately σ. So T is approximately N(0,1) distributed (this intuitive statement can be made rigorous; see below), even if W is not normal.

  • In other words, one can compare T to the N(0,1) distribution, and thus achieve approximately correct inference, even if the sample distribution was not normal. Hence the practice of approximate inference.

  • This led to a branch of mathematical statistics known as large sample theory. Core among them is the fact that if a sequence of random variables Zn is asymptotically normal, then so is any smooth function of them g(Zn). Since the CLT says, roughly, that sums are approximately normal, this gives us a sequence Xn, which when combined with a suitable function g can give us asymptotic normality in many useful situations.

  • E.g. for MLEs. The likelihood is a product, so the log likelihood is a sum, just what we need for the Central Limit Theorem! It doesn't quite fit the CLT, is the terms are not independent, but it's enough, as we can use some theory to take care of the non-independence.

    We set the derivative of the log likelihood to 0, and solve for our parameter estimate. Here g becomes the inverse function that is needed to solve for the estimate, which is then approximately normal.

    The context here is a little different -- we are still assuming an exact distribution, e.g. exponential, for the sampled population, but the principle is the same. Without large sample theory, in most cases we would not be able to derive the exact distribution of the MLE.

  • The situation with linear regression models is similar, as will be seen below.

  • The famous statistician George Box famously said, "All models are wrong, but some are useful." No parametric model is truly correct in practice. No one is 10 feet tall, and no one's weight is negative, so heights and weights cannot be exactly normal. No actual random variable is continuous, as our measuring instruments are of only finite precision, so again models such as normal, exponential, gamma and so on are necessarily inexact. Even the iid assumption is often problematic. So the term exact inference is illusory.

  • Approximate methods are very popular, including with me. I always use N(0,1) instead of the Student-t distribution, for instance. but many analysts today still favor "exact" statistical methods. They note for example that there is always the question of "How large is 'large'?" if one uses asymptotics.

The Multivariate Normal Distribution Family

  • Consider a random vector X = (X1,...,Xp)'. The mean is now a vector μ, and standard deviation σ now becomes the p x p covariance matrix Σ. Notation: N(μ,Σ).

  • Density: c exp[-(t-μ)' Σ-1 (t-μ)], where the constant c makes the p-fold integral equal to 1. This is for the case in which Σ is invertible.

  • We will always assume Σ is inverible unless stated otherwise. It simply means that no Xi a linear combination of the others.

  • For p = 1 we have the familiar "bell-shaped curve", while for p = 2, we get a 3-dimensional bell.

bivariate normal density

  • Every marginal distribution of X is also multivariate normal.

  • For a constant m x p matrix A, AX is also multivariate normal. In particular, if m = 1, then AX is univariate normal; the converse is also true, i.e. if a'X is normal for all a, then X is multivariate normal.

  • Conditional distributions: For simplicity, consider the conditional distribution of Xi given (X1,...,Xj-1, Xj+1,...,Xp) = w. Then (inspiration for the familiar linear model):

    • That distribution is normal.

    • The mean of that distribution is linear in w, i.e. of the form c'w for some c. (A fairly simple formula for c is available but not shown here.)

    • The variance of that distribution is constant in w.

  • For a symmetric matrix A and vector u, the scalar quantity u'Au is called a quadratic form. For p-variate normal X, the quadratic form M = (X-μ)' Σ-1 (X-μ) turns out to have a chi-squared distribution with p degrees of freedom.

  • So if η is the q-th quantile of the χ2 distribution with p degrees of freedom, then

    P(M < η) = q

  • In the above picture, horizontal slices, i.e. level sets, are ellipses. For p > 2, the regions are p-dimensional ellipsoids. This can be used to derive confidence regions in statistical applications, as will be seen later.

Example: Linear Models, Exact and Approximate Inference

Let's see how all this plays out with the linear model.

Let Y be a random variable and X = (X1,...,Xp)' a p-dimensional random vector. E.g. we might wish to predict Y = college GPA from X1 = high school GPA, X2 = SAT score and so on. Let's call this the "GPA example."

Classical linear model

The classic assumptions of the model are as follows. Let t = (t1,...,tp)'.

  • Linearity: E(Y | X = t) = β0 + β1 t1 + ... + βp tp. for some unknown population vector β = (β0, β1,..., βp)'.

  • Normality: The distribution of Y | X = t is normal for all t.

  • Homoscedasticity: Var(Y | X = t) is the same for all t, which we will denote by σ2.

  • Independence: The random vectors (X1,Y1),..., (Xn,Yn) are independent.

The function r(t) = E(Y|X=t) is called the regression function of Y on X. The term does NOT necessarily mean a linear model. So the first bulleted item above says, "The regression function of Y on X is linear."

Fixed-X vs. random-X models.

  • In many if not most applications, the Xi are random, so that the vectors (X1,Y1),..., (Xh,Yh) are iid. In our GPA example above, we have sampled n students at random, and their X values -- HS GPA, SAT -- are thus random as well.

  • But historically, the X values have usually been taken to be fixed. A typical example would be, say, some industrial process in which a designed experiment is conducted, with pre-specified X values.

  • Today, the fixed-X view is still the typical one. In the GPA example, it would mean that we had sampled students with pre-specified HS GPA and SAT values, which would probably not be the case. However, we can still take the X values to be fixed, by conditioning on them.

  • In the fixed-X setting, we no longer have the second 'i' in 'iid', but it doesn't matter much. For instance, there are versions of the CLT that cover this situation.

Estimation

  • One estmates β from the data, using the formula for the estimates b = (b0, b1,...,bp)':

    b = (A'A)-1 A'W

    where A is the n x (p+1) matrix will all 1s in column 1 and in column j > 1, all n values of Xj-1. Here W is the vector of all n values of Y.

  • Write column j+1 of A as (Xj1,...,Xjn)'. If say Xj is human age, then this vector consists of the ages of all people in our dataset.

  • The vector b, as a linear combination of W, has an exact MV normal distribution if Y|X is exactly normally distributed.

  • By the way, since

    E(Y | X) = Xextend' β

    where Xextend = (1,X)'

    we have that

    E(W | A) = A β

    and thus

    E(b | A) = β

    (unbiasedness).

  • What about Cov(b|A)? Start with the fact that, by assumption,

    Cov(Y|A) = σ2 I

    σ2 (A'A)-1.

    Write B = (A'A)-1 A, so that b = BW. Then after some algebra, we have

    Cov(B|A) = B σ2 I B' = σ2 (A'A)-1

    Note that A is a function of the X data, but here we take it to be a constant, by considering the conditional distribution of b given A.

    Thus confidence intervals (CIs) and regions we obtain below are also conditional. However, note that if, e.g. we have a conditional CI at the 95% level, then the unconditional coverage probability is also 95%, by the Law of Iterated Expection above.

Exact inference

  • The unknown parameter σ2 must be estimated from the data, as

    s2 = [1/(n-p-1)] Σi (Yi - Ri b)2

    where Ri is row i of A.

  • So the estimated conditional covariance matrix of b is

    s2 (A'A)-1

  • The stringent assumptions above enable exact statistical inference. This enables exact confidence intervals and tests:

    • The quantity bi - βi, divided by the estimated standard deviation (square root of element i+1 in the estimated Cov(b|A)), has a Student t-distribution with n-p-1 df, thus setting up a CI for βi.

    • In some cases, we may be interested in something like, say,

      β2 - β1 = c'β

      where c = (0,1,-1,0,0,...,0)'. Since c'b will then have a univariate normal distribution (recall the above material on MV normal distributions), we again can easily obtain a CI for the desired quantity. Note that we need the estimated variance of c'b, which is

      s2 c'(A'A)-1c

Approximate inference

The quantity A'W consists of sums, so b is asymptotically MV normally distributed. Thus for large n, we can perform inference without assuming a normal Y|X:

CIs for c'β: An approximate 95% confidence interval for, e.g. β2 - β1 is

c'b ± 1.96 [s2 c'(A'A)-1c]0.5

Relaxing the assumptions

Just as Box pointed out that no model is perfect, no assumption (really, part of a model) is perfect. Let's look at the normality assumption (Y|X is normally distributed), and the homoscedasticity assumption, that says Var(Y|X=t) is constant in t.

What can we say, assuming that linearity still holds?

  • Typically, the larger is E(Y|X=t), then the larger is Var(Y|X=t). Say we are predicting weight from height. Since taller people tend to have more (interperson) variability in weight, homoscedasticity would be seriously violated.

  • If so, the estimator b will not be optimal, i.e. not Minimum Variance Unbiased. For optimality, weighted least squares must then be used, with weights equal to the reciprocal of the conditional variance. That is, we should minimize

    Σi [Var(Y|X=Xi)]-1 [Yi - b'Xi ]2

    The word "should" applies here, as we usually don't know that conditional variance function.

  • But in the sense of consistency, it doesn't matter what weights we use. If the linearity assumption holds, then b will still be a consistent estimator of β, i.e. b → β as n → ∞. (See informal proof at the end of this section.)

  • As noted earlier, lack of normality of Y|X alone does not invalidate statistical inference, due to the CLT. But assuming homoscedasticity when it is not approximately correct does invalidate inference, as the condtional covariance matrix of b is no longer σ2 (A'A)-1.

The sandwich estimator

Concerning that latter point, an important application of large-sample theory is the sandwich estimator, which makes inference asymptotically valid even when Var(Y|X=t) is not constant in t. It makes a simple adjustment to the estimated Cov(b) matrix computed under the assumption of homogeneous variance.

One of the CRAN packages that implements this adjustment is named, of course, "sandwich." Here is an example:

z <- lm(mpg ~ .,mtcars)  # built-in R dataset
covz <- vcov(z)  # estimated covariance matrix of b
covzadj <- sandwich(z)  # adjusted version of covz
covz['carb','carb']  # prints 0.6868307
covzadj['carb','carb']  # prints 0.4209969

Informal proof of consistency

For convenience, assume Random-X regression. Let w(t) be our weight function.

Note that for any nonnegative weight function w(t), the quantity

E[w(X)(Y - m(X))2]

is minimized across all functions m by m(t) = E(Y|X=t), as can be seen by conditioning on X and applying ordinary calculus. So if linearity holds,

E[w(X)((Y - v'X)2]

achieves its minimum at v = β for any weight function w, in particular the w having constant value 1. In other words,

E[(Y - v'X)2]

achieves its minimum at v = β

Say we blindly do OLS (ordinary least squares, i.e. unweighted), i.e. we choose v = b to minimize

(1/n) Σi [Yi - v'Xi ]2

which is the sample analog of

E[(Y - v'X)2]

For each fixed v, that sample quantity will converge to the corresponding population quantity. Then intuitively, the v that minimizes

(1/n) Σi [Yi - v'Xi ]2

will converge to the v that minimizes

E[(Y - v'X)2]

i.e. b will converge to β. However, a formal proof of this requires some real analysis.

No-intercept models

In some settings, it is known from the nature of the problem that β0 = 0. Thus our model is

E(Y | X = t) = β1 t1 + ... + βp tp

R's lm function allows us to specify this via -1 ("delete the constant term") in the model specification.

In simulation settings, it's often inconvenient to have an intercept term, but one can always transform to it by subtracting the means of each X feature and Y. This is easily done using the scale function, with center set to TRUE.

Multiple Inference Procedures

Suppose we form two 95% confidence intervals from our dataset. Though they are individually valid at the 95% level, this is not the case jointly. In other words, in repeated sampling, the proportion of samples in which both CIs contain the associated population parameter in the same sample is less than 95%. What if we want to form many CIs, and still have their joint confidence level at or above 95%?

This branch of statistics is called multiple inference or simultaneous inference.

The Bonferroni Inequality

One solution is to use the Bonferroni Inequality, which says that for any two events A and B,

P(A and B) ≥ 1 - P(not A) - P(not B)

This implies that if we want joint confidence level of two CIs to be at least 95%, we can form each at the 97.5% level.

The Scheffe' method

But this conservative approach is not feasible for forming more than two or three CIs. Another approach is the Scheffe' method, based on confidence ellipsoids.

How to form confidence ellipsoids

  • Consider the linear model first. If all the assumptions hold, The quantity

    (b-β)'[s-2 (A'A)] (b-β)

    has an F-distribution with (p+1,n-p-1) df. This is the finite-sample version of the quadratic form discussed earlier. This sets up a confidence ellipsoid for β:

    To form a (1-α) 100% confidence ellipsoid for β, let q be the upper-α quantile of the F distribution with (p+1,n-p-1) df. Then the confidence ellipsoid is the set of all t such that

    (b-t)'[s-2 (A'A)] (b-t) ≤ q

  • For large n, and now not assuming sampling from a normal population, we form an approximate (1-α) 100% confidence ellipsoid for β as follows: Let q be the upper-α quantile of the χ2 distribution with p+1 df. Then the confidence ellipsoid is the set of all t such that

    (b-t)'[s-2 (A'A)] (b-t) ≤ q

  • More generally, consider any asymptotically normal estimator θest of a p-dimensional population parametric θ. Let Q denote its (asymptotic) covariance matrix. The confidence ellipsoid for the true θ is the set of all t such that

    (b-t)' Q-1 (b-t) ≤ q

Forming simultaneous confidence intervals

Now let's see how we can go from confidence ellipsoid to confidence intervals for scalar quantities, and in such a way that the intervals have a given simultaneous confidence level.

  • Again, consider any asymptotically normal estimator θest of a p-dimensional population parametric θ. Let Q denote its (asymptotic) covariance matrix, and form the confidence ellipsoid as above.

  • Consider a linear combination w'θ. Set b and c such that the lines (p=2), planes (p=3) or hyperplanes (p > 3)

    w'θ = bw

    and

    w'θ = cw

    are tangent to the ellipsoid from above and below. Say cw is the smaller of bw and cw.

  • Let's visualize this, using the CRAN package ellipse. Its function ellipse.lm draws a confidence ellipse for the output of a linear model.

    fit <- lm(mpg ~ qsec + wt , mtcars)
    plot(ellipse(fit, which = c('qsec', 'wt'), level = 0.90), type = 'l') 
    # draw a pair of parallel tangent lines, w = (2.5,-1)'
    abline(-9.05,2.5)
    abline(-5.70,2.5)  

    This is an example from the package, using a 90% confidence level.

confidence ellipse{width=60%}

  • With 90% probability, θ is somewhere inside the ellipsoid. (Note that it is the ellipsoid that is random, not θ.)

  • And, if θ is in the ellipsoid, then w'θ will be in (cw,bw), though not only if; it is clear that there are some values of θ that are outside the ellipse but still between the two lines. In other words, (cw,bw) is a confidence interval for w'θ of level at least 90%.

  • For each possible w, we get the above pair of tangents. Geometrically it is clear that θ will be inside the ellipsoid if and only if w'θ is in (cw,bw) for all w.

  • Thus the probability that all the intervals (cw,bw) hold simultaneously is 90%.

Simulation of MV Normal Random Vectors

Suppose we wish the generate random vectors V having a p-variate normal distribution with mean ν and covariance matrix Γ. How can we do this?

We could find a square root matrix for Γ, G. Then we generate vectors X consisting of p independent N(0,1) random variables, and set

V = G X + ν

Since X has covariance matrix equal to the identity matrix I, V will have the specified mean and covariance matrix, and it will be MV normal by linearity.

R is famous for its CRAN package collection, so it is not surprising that this operation has already been coded, first in an early package, MASS and then in others. The function MASS::mvrnorm does this.

What about regression contexts, say linear models with the standard assumptions? The model

E(Y | X = t) = β0 + β1 t1 + ... + βp tp,

Var(Y|X=t) = σ2

can be (and in most books, usually is) written as

Y = β0 + β1 t1 + ... + βp tp + ε

where ε has a N(0,σ2) distribution and is independent of X. (In the fixed-X context, the latter assumption is automatic.)

So, one can use mvrnorm to generate X, then rnorm to generate ε, thus obtaining Y.

Types of Convergence

  • A sequence of random variables Vn is said to converge in probability to a random variable V if for every ε > 0,

    limn → ∞ P(|Vn - V| > ε) = 0

  • For random vectors Vi, replace | | by, e.g. Euclidean distance.

  • Say we have random variables Qn. If for some constant c we have

    P(limn → ∞ Qn = c) = 1

    then we say Qn converges almost surely to c. (The term "almost surely" is just a fancy way of saying, "With probability 1.")

  • Example: the Strong Law of Large Numbers (SLLN). If the Qn are iid with mean μ, then the sample average converges to the distributional average: Set

    An = (1/n) (Q1+...+Qn)

    Then An converges almost surely to μ.

  • If cdfs of random variables converge, i.e. for each t,

    limn → ∞ P(Qn ≤ t) = P(Q ≤ t)

    for some random variable Q, we say Qn converges in distribution to Q. Sometimes people omit mention of Q, simply referring to its distribution.

  • Some types of convergence imply others:

    almost sure => in probability => in distribution

  • In p dimensions, we say that Xi converges in distribution to X if the p-dimensional cdfs converge.

    That definition is hard to deal with, but the Cramer-Wold device often makes things easy: Consider a sequence of random vectors Xn and another random vector X. Then Xn converges in distribution to X if and only if for all p-vectors c, c'Xn converges in distribution to c'X.

  • Slutsky Theorem (due to Russian mathematician Evgeny Slutsky). Say Xn converges in distribution to X, and Yn converges in probability to a constant c. Then:

    • (i) Xn + Yn converges in distribution to X+c.

    • (ii) Xn Yn converges in distribution to Xc.

    • (iii) Xn / Yn converges in distribution to X/c.

    The quantities can be matrix valued, in which case u/v means u v-1, i.e. matrix inverse. Of course, the inverse must exist for all this to be valid.

Central Limit Theorems (CLTs)

  • Univariate Central Limit Theorem: Let Xi, i = 1,2,... be iid random variables with mean μ and variance σ2. Define Tn = X1+...+Xn, which has mean and variance n μ and n σ2. Then

    1/n1/2 σ-1 (Tn - n μ)

    converges in distribution to N(0,1).

    Many generalizations of the CLT exist, in which they relax the iid assmuption, especially the second 'i'.

  • Scaling by the factor n1/2 is "just right" to make things work here. If, say we were to divide instead by n, the quantity would go to 0, by the SLLN. That certainly would not be helpful in terms of finding probabilities, confidence intervals and so on.

    Other powers may apply for other quantities. For example, if we are interested in the maximum of X1,...,Xn rather than their sum, then we divide by (2 log n)1/2 (and the limiting distribution will be something other than normal).

  • We will often encounter a sequence of random variables Qn that is asymptotically normal with mean ν and variance γ. Note carefully that that does NOT mean that E(Qn) ≈ ν with a similar statement for variance. On the contrary, E(Qn) could be infinite or undefined. Instead, it merely means that cdf of

    n0.5(Qn - ν)/γ

    converges (pointwise) to the cdf of N(0,1).

  • Multivariate Central Limit Theorem: Let Xi, i = 1,2,... be iid random vectors with mean vector μ and covariance matrix Σ. Define Tn = X1+...+Xn, which has mean and covariance matrix n μ and n Σ. Then

    1/n1/2 (Tn - n μ)

    converges in distribution to N(0,Σ). This can easily be proved using the Cramer-Wold device.

Example: Method of Moments (MM) Estimators

This is a serious application of the above methodology, and will take time to prepare and apply the concepts. The reader's patience is requested.

Background on MM estimators

  • Suppose we are modeling a random variable with some parametric distribution family, such as normal, exponential or gamma. We want to estimate the parameter(s) from our data X1,...,Xn. How can we do this?

  • The most famous general method is Maximum Likelihood Estimation (MLE), but it's somewhat less-famous cousin, the Method of Moments (MM), is sometimes the handier one. (It is also the easier one to explain.)

  • This is best introduced by example. Say we are modeling our Xi as having an exponential distribution, i.e. they have density

    τ exp(-τt).

    for some unknown parameter τ. How do we construct the MM estimate T of τ?

    The mean of this distribution is 1/τ, which could be estimated by 1/T once we obtain T. On the other hand, the classical estimate of a population mean is the sample mean,

    Xbar = (1/n) ( X1+...+Xn)

    So set our estimated mean under the exponential assumption, 1/T, to our general estimated mean, without that assmuption:

    1/T = Xbar

    That gives us our MM estimate,

    T = 1/Xbar

    Done!

  • How does MM work in general, when we have more than one parameter? Let's use k to denote the number of parameters in the given parametric family. For instance, k = 2 for the univariate normal family, corresponding to the two parameters μ and σ2. We then must bring in E(X2) and so on, as follows.

  • The r-th moment of a random variable X is defined to be mr = E(Xr), r = 1,2,.... We will need to use the first k moments.

    Note that this is a population value; its intuitive sample estimate is

    Mr = (1/n) (X1r+...+Xnr)

  • Alternatively, for k > 1 one may use the r-th central moment,

    E[(X-E(X))r]

    Since for r = 2 this is Var(X) and for r = 3 we get the skewness of the distribution, things may be more convenient this way.

    It will be clear from context whether we mean the central or noncentral moment.

  • Use τ to denote the population value of the parameter, and T to denote its sample estimate under MM. Both are vectors if k > 1.

  • Before continuing, let's review what happened when we set 1/T = Xbar above.. The left-hand side is our estimate of m1 under the exponential model, while the right-hand side is a general estimator of m1, not assuming that model This is how the Method of Moments works, by matching the two. We will have k such equations, then solve them for T.

  • For an example with k = 2, consider the gamma distribution family:

    c τ1τ2 tτ2 - 1 exp(-τ1t)

    where c is a constant to make the density integrate to 1.0. This is a common model used in network communications and medical survival analysis for example.

    Here is what the gamma family of densities looks like. (From Probability and Statistics for Data Science: Math + R + Data, N. Matloff, 2019. Here r and λ refer to our τ2 and τ1.)

    Some densities from the gamma family

    The model is handy when one has a distribution on (0,∞), assumed unimodal.

  • The population moments are

    E(X) = τ2 / τ1

    and

    Var(X) = τ2 / τ12

  • So, we equate:

    M1 = T2 / T1

    and (using the central moment for M2)

    M2 = T2 / T12

  • These are nonlinear equations, but we are lucky here; they are solvable from simple algebra (otherwise we would need to use numerical approximation methods):

    T1 = M1 / M2

    and

    T2 = M1 T1 = M12 / M2

Consistency

  • Say we have an estimator θn for a parameter θ based on a sample of size n. As n goes to ∞, we would like our estimator to have the property that θn goes to θ. If it does so almost surely, we say it is strongly consistent; if the convergence is just in probability, it is known as weak consistency.

    Since the SLLN implies that Mi is strongly consistent for mi, this implies the same for the Ti, as long as the moments are continuous functions of the Mi. In the example above, for instance, M1 and M2 converge almost surely, hence the Ti do too.

Asymptotic normality

  • OK, we now have estimators for τ1 and τ2, and they are consistent. But the latter is a very weak property. For good practical value, it would be desirable to obtain confidence intervals for the τi from the Tk. For this we need asymptotic normality, as follows.

  • For simplicity, let's consider the case k = 1, so the density family has a single scalar parameter, τ, as we saw in the exponential example above. Then E(X) is some function g(τ). In the exponential example, g(τ) = 1/τ.

    Assume g is differentiable with a continuous derivative g1. Then Taylor's Theorem from calculus says

    g(T) = g(τ) + g1(Tbetw) (T - τ)

    for some Tbetw between τ and T.

  • Now to set up using the CLT, recall that we will set

    g(T) = M1

    Then write

    n0.5(M1 - m1) = n0.5 [g(T) - g(τ)] = n0.5 [g1(Tbetw) (T - τ)]

    and then

    n0.5(T - τ) = n0.5 (M1 - m1)/ g1(Tbetw)

  • We know that T is a strongly consistent estimator of τ, so by the continuity of g1, the denominator in the RHS converges almost surely to g1(τ). Applying the Slutsky Theorem and the CLT (M1 is a sum of iid terms), we see that the RHS is asymptotically normal (though not with standard deviation 1).

  • In other words

    n0.5(T - τ) =asympt. n0.5 (M1 - m1)/ g1(τ)

    Here =asymp. should not be viewed as "asymptotically equal to," but rather "having the same asymptotic distribution as."

    The variance in that asymptotically normal distribution will be

    Var(M1)/[g1(τ)]2

  • To compute a CI, we replace the quantities by their sample analogs. E.g. our estimate of Var(M1) is

    s2= (1/n) Σi=1n [Xi - M1]2

    (Divide by n-1 instead of n if you prefer, though there really is no reason to do so.)

    Our estimate of g1(τ) is

    g1(T) = -1/T2

    Our CI, say for a 95% confidence level, is then

    T ± 1.96 s/T2

  • For k = 2, we now have functions g and h, both having arguments T1 and T2. We now have derivatives g and h, but they are now gradients.

    n0.5(M1 - m1) =asymp.. n0.5 g' (T - τ)

    n0.5(M2 - m2) =asymp. n0.5 h' (T - τ)

    Write M = (M1,M2)' and similarly for m. Then

    n0.5 (M - m) =asymp. n0.5 A (T - τ)

    where the matrix A consists of g' in row 1 and h' in row 2.

  • So, finally, we have that the asymptotic distribution of

    n0.5 (T - τ) is MV normal with covariance matrix

    A-1 Cov(M) A-1'

    This enables confidence intervals and ellipsoids as before. Of course, the matrix A must be estimated, by substituting T for τ. Again, this can be justified using the Slutsky properties.

The Delta Method

  • Earlier we said, "if a sequence of random variables Zn including the case of random vectors), is asymptotically normal, then so is any smooth function of them g(Zn)." This fact is quite useful.

  • Before we go on, note again that "asymptotically normal" means in the sense of the CLT. So if we say "g(Zn) is asymptotically normal," we mean that its cdf is approximately equal to a normal cdf, and we must scale by n0.5.

  • The proof is similar to our derivation for MM estimators above. E.g. for dimension 2, set μ= (μ12) to denote the mean of the asymptotic normal distribution, with Σ representing the corresponding covariance matrix.

    Now write

    g(Zn) - μ ≈ g1(μ)(Z1n - μ1) + g2(μ)(Z2n - μ2) = A (Zn - μ)

    where the 1 x 2 matrix A = (g1(μ), g2(μ)).

    The right-hand side is now a linear form in an asymptotically MV normally distributed vector, thus asymptotically MV (actually univariate) normal, with mean 0 and variance A Σ A'. Again, A must be estimated as we say in the Delta method above.

About

Fast introduction to large-sample methods in statistics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published