guide:F7d7868547: Difference between revisions

Latest revision as of 02:22, 12 April 2024

The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the residuals (a residual being the difference between an observed value and the fitted value provided by a model) made in the results of each individual equation.

The most important application is in data fitting. When the problem has substantial uncertainties in the independent variable (the [math]X[/math] variable), then simple regression and least-squares methods have problems; in such cases, the methodology required for fitting errors-in-variables models may be considered instead of that for least squares.

Least-squares problems fall into two categories: linear or ordinary least squares and nonlinear least squares, depending on whether or not the residuals are linear in all unknowns. The linear least-squares problem occurs in statistical regression analysis; it has a closed-form solution. The nonlinear problem is usually solved by iterative refinement; at each iteration the system is approximated by a linear one, and thus the core calculation is similar in both cases.

The following discussion is mostly presented in terms of linear functions but the use of least squares is valid and practical for more general families of functions. Also, by iteratively applying local quadratic approximation to the likelihood (through the Fisher information), the least-squares method may be used to fit a generalized linear model.

The least-squares method was officially discovered and published by Adrien-Marie Legendre (1805),^[1] though it is usually also co-credited to Carl Friedrich Gauss (1795)^[2]^[3] who contributed significant theoretical advances to the method and may have previously used it in his work.^[4]^[5]

Problem statement

The objective consists of adjusting the parameters of a model function to best fit a data set. A simple data set consists of [math]n[/math] points (data pairs) [math](x_i,y_i)\![/math], [math]i=1,\ldots,n[/math], where [math]x_i[/math] is an independent variable and [math]y_i[/math] is a dependent variable whose value is found by observation. The model function has the form [math]f(x, \boldsymbol \beta)[/math], where [math]m[/math] adjustable parameters are held in the vector [math]\boldsymbol \beta[/math]. The goal is to find the parameter values for the model that "best" fits the data. The fit of a model to a data point is measured by its residual, defined as the difference between the observed value of the dependent variable and the value predicted by the model:

[[math]]r_i = y_i - f(x_i, \boldsymbol \beta).[[/math]]

The residuals are plotted against corresponding [math]x[/math] values. The random fluctuations about [math]r_i=0[/math] indicate a linear model is appropriate.

The least-squares method finds the optimal parameter values by minimizing the sum of squared residuals, [math]S[/math]:^[6]

[[math]]S=\sum_{i=1}^{n}r_i^2.[[/math]]

An example of a model in two dimensions is that of the straight line. Denoting the y-intercept as [math]\beta_0[/math] and the slope as [math]\beta_1[/math], the model function is given by [math]f(x,\boldsymbol \beta)=\beta_0+\beta_1 x[/math]. See linear least squares for a fully worked out example of this model.

A data point may consist of more than one independent variable. For example, when fitting a plane to a set of height measurements, the plane is a function of two independent variables,[math]X[/math] and z, say. In the most general case there may be one or more independent variables and one or more dependent variables at each data point.

To the right is a residual plot illustrating random fluctuations about [math]r_i=0[/math], indicating that a linear model[math](Y_i = \alpha + \beta x_i + U_i)[/math] is appropriate. [math]U_i[/math] is an independent, random variable.^[6]

The residuals are plotted against the corresponding [math]x[/math] values. The parabolic shape of the fluctuations about [math]r_i=0[/math] indicate a parabolic model is appropriate.

If the residual points had some sort of a shape and were not randomly fluctuating, a linear model would not be appropriate. For example, if the residual plot had a parabolic shape as seen to the right, a parabolic model[math](Y_i = \alpha + \beta x_i + \gamma x_i^2 + U_i)[/math] would be appropriate for the data. The residuals for a parabolic model can be calculated via [math]r_i=y_i-\hat{\alpha}-\hat{\beta} x_i-\widehat{\gamma} x_i^2[/math].^[6]

Limitations

This regression formulation considers only observational errors in the dependent variable (but the alternative total least squares regression can account for errors in both variables). There are two rather different contexts with different implications:

Regression for prediction. Here a model is fitted to provide a prediction rule for application in a similar situation to which the data used for fitting apply. Here the dependent variables corresponding to such future application would be subject to the same types of observation error as those in the data used for fitting. It is therefore logically consistent to use the least-squares prediction rule for such data.
Regression for fitting a "true relationship". In standard regression analysis that leads to fitting by least squares there is an implicit assumption that errors in the independent variable are zero or strictly controlled so as to be negligible. When errors in the independent variable are non-negligible, models of measurement error can be used; such methods can lead to parameter estimates, hypothesis testing and confidence intervals that take into account the presence of observation errors in the independent variables.^[7] An alternative approach is to fit a model by total least squares; this can be viewed as taking a pragmatic approach to balancing the effects of the different sources of error in formulating an objective function for use in model-fitting.

Linear Least Squares

Linear least squares (LLS) is the least squares approximation of linear functions to data. It is a set of formulations for solving statistical problems involved in linear regression, including variants for ordinary (unweighted), weighted, and generalized (correlated) residuals.

Main formulations

The three main linear least squares formulations are:

Formulation

Description

Ordinary least squares (OLS)

Is the most common estimator. OLS estimates are commonly used to analyze both experimental and observational data. The OLS method minimizes the sum of squared residuals, and leads to a closed-form expression for the estimated value of the unknown parameter vector [math]\beta[/math]:

[[math]] \hat{\boldsymbol\beta} = (\mathbf{X}^\mathsf{T}\mathbf{X})^{-1} \mathbf{X}^\mathsf{T} \mathbf{y}, [[/math]]

where [math]\mathbf{y}[/math] is a vector whose [math]i[/math]th element is the [math]i[/math]th observation of the dependent variable, and [math]\mathbf{X}[/math] is a matrix whose [math]ij[/math] element is the [math]i[/math]th observation of the [math]j[/math]th independent variable. The estimator is unbiased and consistent if the errors have finite variance and are uncorrelated with the regressors:^[8]

[[math]] \operatorname{E}[\,\mathbf{x}_i\varepsilon_i\,] = 0, [[/math]]

where [math]\mathbf{x}_i[/math] is the transpose of row [math]i[/math] of the matrix [math]\mathbf{X}.[/math] It is also efficient under the assumption that the errors have finite variance and are homoscedastic, meaning that [math]\operatorname{E}[\varepsilon_i^2|x_i][/math] does not depend on [math]i[/math]. The condition that the errors are uncorrelated with the regressors will generally be satisfied in an experiment, but in the case of observational data, it is difficult to exclude the possibility of an omitted covariate [math]z[/math] that is related to both the observed covariates and the response variable. The existence of such a covariate will generally lead to a correlation between the regressors and the response variable, and hence to an inconsistent estimator of [math]\beta[/math]. The condition of homoscedasticity can fail with either experimental or observational data. If the goal is either inference or predictive modeling, the performance of OLS estimates can be poor if multicollinearity is present, unless the sample size is large.

Weighted least squares (WLS)

Used when heteroscedasticity is present in the error terms of the model.

Generalized least squares (GLS)

Is an extension of the OLS method, that allows efficient estimation of [math]\beta[/math] when either heteroscedasticity, or correlations, or both are present among the error terms of the model, as long as the form of heteroscedasticity and correlation is known independently of the data. To handle heteroscedasticity when the error terms are uncorrelated with each other, GLS minimizes a weighted analogue to the sum of squared residuals from OLS regression, where the weight for the [math]i[/math]^th case is inversely proportional to [math]\operatorname{Var}(\varepsilon_i)[/math]. This special case of GLS is called "weighted least squares". The GLS solution to an estimation problem is

[[math]] \hat{\boldsymbol\beta} = (\mathbf{X}^\mathsf{T} \boldsymbol\Omega^{-1} \mathbf{X})^{-1}\mathbf{X}^\mathsf{T}\boldsymbol\Omega^{-1}\mathbf{y}, [[/math]]

where [math]\Omega[/math] is the covariance matrix of the errors. GLS can be viewed as applying a linear transformation to the data so that the assumptions of OLS are met for the transformed data. For GLS to be applied, the covariance structure of the errors must be known up to a multiplicative constant.

Properties

If the experimental errors, [math]\varepsilon[/math], are uncorrelated, have a mean of zero and a constant variance, [math]\sigma[/math], the Gauss–Markov theorem states that the least-squares estimator, [math]\hat{\boldsymbol{\beta}}[/math], has the minimum variance of all estimators that are linear combinations of the observations. In this sense it is the best, or optimal, estimator of the parameters. Note particularly that this property is independent of the statistical distribution function of the errors. In other words, the distribution function of the errors need not be a normal distribution. However, for some probability distributions, there is no guarantee that the least-squares solution is even possible given the observations; still, in such cases it is the best estimator that is both linear and unbiased.

For example, it is easy to show that the arithmetic mean of a set of measurements of a quantity is the least-squares estimator of the value of that quantity. If the conditions of the Gauss–Markov theorem apply, the arithmetic mean is optimal, whatever the distribution of errors of the measurements might be.

However, in the case that the experimental errors do belong to a normal distribution, the least-squares estimator is also a maximum likelihood estimator.^[9]

These properties underpin the use of the method of least squares for all types of data fitting, even when the assumptions are not strictly valid.

Limitations

An assumption underlying the treatment given above is that the independent variable,[math]X[/math], is free of error. In practice, the errors on the measurements of the independent variable are usually much smaller than the errors on the dependent variable and can therefore be ignored. When this is not the case, total least squares or more generally errors-in-variables models, or rigorous least squares, should be used. This can be done by adjusting the weighting scheme to take into account errors on both the dependent and independent variables and then following the standard procedure.^[10]^[11]

In some cases the (weighted) normal equations matrix [math]X^{\mathsf{T}}X[/math] is ill-conditioned. When fitting polynomials the normal equations matrix is a Vandermonde matrix. Vandermonde matrices become increasingly ill-conditioned as the order of the matrix increases. In these cases, the least squares estimate amplifies the measurement noise and may be grossly inaccurate. Various regularization techniques can be applied in such cases, the most common of which is called ridge regression. If further information about the parameters is known, for example, a range of possible values of [math]\mathbf{\hat{\boldsymbol{\beta}}}[/math], then various techniques can be used to increase the stability of the solution. For example, see constrained least squares.

Another drawback of the least squares estimator is the fact that the norm of the residuals, [math]\| \mathbf y - X\hat{\boldsymbol{\beta}} \|[/math] is minimized, whereas in some cases one is truly interested in obtaining small error in the parameter [math]\mathbf{\hat{\boldsymbol{\beta}}}[/math], e.g., a small value of [math]\|{\boldsymbol{\beta}}-\hat{\boldsymbol{\beta}}\|[/math]. However, since the true parameter [math]{\boldsymbol{\beta}}[/math] is necessarily unknown, this quantity cannot be directly minimized. If a prior probability on [math]\hat{\boldsymbol{\beta}}[/math] is known, then a Bayes estimator can be used to minimize the mean squared error, [math]E \left\{ \| {\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}} \|^2 \right\} [/math]. The least squares method is often applied when no prior is known. Surprisingly, when several parameters are being estimated jointly, better estimators can be constructed, an effect known as Stein's phenomenon. For example, if the measurement error is Gaussian, several estimators are known which dominate, or outperform, the least squares technique; the best known of these is the James–Stein estimator. This is an example of more general shrinkage estimators that have been applied to regression problems.

Ordinary Least Squares (OLS)

In statistics, ordinary least squares (OLS) is a type of linear least squares method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable (values of the variable being observed) in the given dataset and those predicted by the linear function of the independent variable.

Geometrically, this is seen as the sum of the squared distances, parallel to the axis of the dependent variable, between each data point in the set and the corresponding point on the regression surface—the smaller the differences, the better the model fits the data. The resulting estimator can be expressed by a simple formula, especially in the case of a simple linear regression, in which there is a single regressor on the right side of the regression equation.

The OLS estimator is consistent when the regressors are exogenous, and—by the Gauss–Markov theorem—optimal in the class of linear unbiased estimators when the errors are homoscedastic and serially uncorrelated. Under these conditions, the method of OLS provides minimum-variance mean-unbiased estimation when the errors have finite variances. Under the additional assumption that the errors are normally distributed, OLS is the maximum likelihood estimator.

Linear model

Suppose the data consists of [math]n[/math] observations [math]\left\{\mathbf{x}_i, y_i\right\}_{i=1}^n[/math]. Each observation [math]i[/math] includes a scalar response [math]y_i[/math] and a column vector [math]\mathbf{x}_i[/math] of [math]p[/math] parameters (regressors), i.e., [math]\mathbf{x}_i=\left[x_{i1}, x_{i2}, \dots, x_{ip}\right]^\mathsf{T}[/math]. In a linear regression model, the response variable, [math]y_i[/math], is a linear function of the regressors:

[[math]]y_i = \beta_1\ x_{i1} + \beta_2\ x_{i2} + \cdots + \beta_p\ x_{ip} + \varepsilon_i,[[/math]]

or in vector form,

[[math]] y_i = \mathbf{x}_i^\mathsf{T} \boldsymbol{\beta} + \varepsilon_i, \, [[/math]]

where [math]\mathbf{x}_i[/math], as introduced previously, is a column vector of the [math]i[/math]-th observation of all the explanatory variables; [math]\boldsymbol{\beta}[/math] is a [math]p \times 1[/math] vector of unknown parameters; and the scalar [math]\varepsilon_i[/math] represents unobserved random variables (errors) of the [math]i[/math]-th observation. [math]\varepsilon_i[/math] accounts for the influences upon the responses [math]y_i[/math] from sources other than the explanators [math]\mathbf{x}_i[/math]. This model can also be written in matrix notation as

[[math]] \mathbf{y} = \mathrm{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}, \, [[/math]]

where [math]\mathbf{y}[/math] and [math]\boldsymbol{\varepsilon}[/math] are [math]n \times 1[/math] vectors of the response variables and the errors of the [math] n [/math] observations, and [math]\mathrm{X}[/math] is an [math]n \times p[/math] matrix of regressors, also sometimes called the design matrix, whose row [math]i[/math] is [math]\mathbf{x}_i^\mathsf{T}[/math] and contains the [math]i[/math]-th observations on all the explanatory variables.

As a rule, the constant term is always included in the set of regressors [math]\mathrm{X}[/math], say, by taking [math]x_{i1}=1[/math] for all [math]i=1, \dots, n[/math]. The coefficient [math]\beta_1[/math] corresponding to this regressor is called the intercept.

Regressors do not have to be independent: there can be any desired relationship between the regressors (so long as it is not a linear relationship). For instance, we might suspect the response depends linearly both on a value and its square; in which case we would include one regressor whose value is just the square of another regressor. In that case, the model would be quadratic in the second regressor, but none-the-less is still considered a linear model because the model is still linear in the parameters ([math]\boldsymbol{\beta}[/math]).

Matrix/vector formulation

Consider an overdetermined system

[[math]]\sum_{j=1}^{p} X_{ij} \beta_j = y_i,\ (i=1, 2, \dots, n),[[/math]]

of [math] n [/math] linear equations in [math]p[/math] unknown coefficients, [math] \beta_1, \beta_2, \dots, \beta_p [/math], with [math] n \gt p [/math]. (Note: for a linear model as above, not all elements in [math] \mathrm{X} [/math] contains information on the data points. The first column is populated with ones, [math]X_{i1} = 1[/math]. Only the other columns contain actual data. So here [math]p[/math] is equal to the number of regressors plus one.) This can be written in matrix form as

[[math]]\mathrm{X} \boldsymbol{\beta} = \mathbf {y},[[/math]]

where

[[math]]\mathrm{X}=\begin{bmatrix} X_{11} & X_{12} & \cdots & X_{1p} \\ X_{21} & X_{22} & \cdots & X_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ X_{n1} & X_{n2} & \cdots & X_{np} \end{bmatrix} , \qquad \boldsymbol \beta = \begin{bmatrix} \beta_1 \\ \beta_2 \\ \vdots \\ \beta_p \end{bmatrix} , \qquad \mathbf y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}. [[/math]]

Such a system usually has no exact solution, so the goal is instead to find the coefficients [math]\boldsymbol{\beta}[/math] which fit the equations "best", in the sense of solving the quadratic minimization problem

[[math]]\hat{\boldsymbol{\beta}} = \underset{\boldsymbol{\beta}}{\operatorname{arg\,min}}\,S(\boldsymbol{\beta}), [[/math]]

where the objective function [math] S [/math] is given by

[[math]]S(\boldsymbol{\beta}) = \sum_{i=1}^n \biggl| y_i - \sum_{j=1}^p X_{ij}\beta_j\biggr|^2 = \bigl\|\mathbf y - \mathrm{X} \boldsymbol \beta \bigr\|^2.[[/math]]

A justification for choosing this criterion is given in Properties below. This minimization problem has a unique solution, provided that the [math]p[/math] columns of the matrix [math] \mathrm{X} [/math] are linearly independent, given by solving the so-called normal equations:

[[math]](\mathrm{X}^{\mathsf T} \mathrm{X} )\hat{\boldsymbol{\beta}} = \mathrm{X}^{\mathsf T} \mathbf y\ .[[/math]]

The matrix [math]\mathrm{X}^{\mathsf T} \mathrm X[/math] is known as the normal matrix or Gram matrix and the matrix [math]\mathrm{X}^{\mathsf T} \mathbf y[/math] is known as the moment matrix of regressand by regressors.^[12] Finally, [math]\hat{\boldsymbol{\beta}}[/math] is the coefficient vector of the least-squares hyperplane, expressed as

[[math]]\hat{\boldsymbol{\beta}}= \left( \mathrm{X}^{\mathsf T} \mathrm{X} \right)^{-1} \mathrm{X}^{\mathsf T} \mathbf y.[[/math]]

or

[[math]]\hat{\boldsymbol{\beta}}= \boldsymbol{\beta} + (\mathbf{X}^\top \mathbf{X} )^{-1}\mathbf {X}^\top \boldsymbol{\varepsilon}.[[/math]]

Estimation

Suppose [math]b[/math] is a "candidate" value for the parameter vector [math]\beta[/math]. The quantity [math]y_i-x_i^{\mathsf{T}}b[/math], called the residual for the [math]i[/math]-th observation, measures the vertical distance between the data point [math](x_i,y_i)[/math] and the hyperplane [math]y=X^{\mathsf{T}}b[/math], and thus assesses the degree of fit between the actual data and the model. The sum of squared residuals (SSR) (also called the error sum of squares (ESS) or residual sum of squares (RSS))^[13] is a measure of the overall model fit:

[[math]] S(b) = \sum_{i=1}^n (y_i - x_i ^\mathrm{T} b)^2 = (y-Xb)^\mathrm{T}(y-Xb), [[/math]]

where [math]\mathsf{T}[/math] denotes the matrix transpose, and the rows of[math]X[/math], denoting the values of all the independent variables associated with a particular value of the dependent variable, are X_i = x_i^T. The value of b which minimizes this sum is called the OLS estimator for [math]\beta[/math]. The function [math]S(b)[/math] is quadratic in [math]b[/math] with positive-definite Hessian, and therefore this function possesses a unique global minimum at [math]b =\hat\beta[/math], which can be given by the explicit formula:^[14]

[[math]] \hat\beta = \operatorname{argmin}_{b\in\mathbb{R}^p} S(b) = (X^\mathrm{T}X)^{-1}X^\mathrm{T}y\ . [[/math]]

The product [math]N=X^{\mathsf{T}}X[/math] is a Gram matrix and its inverse, [math]Q=N^{-1}[/math], is the cofactor matrix of [math]\beta[/math],^[15]^[16]^[17] closely related to its covariance matrix, [math]C_{\beta}[/math]. The matrix

[[math]](X^{\mathsf{T}}X)^{-1}X^{\mathsf{T}}=QX^{\mathsf{T}}[[/math]]

is called the Moore–Penrose pseudoinverse matrix of [math]X[/math]. This formulation highlights the point that estimation can be carried out if, and only if, there is no perfect multicollinearity between the explanatory variables (which would cause the gram matrix to have no inverse).

After we have estimated [math]\beta[/math], the fitted values (or predicted values) from the regression will be

[[math]] \hat{y} = X\hat\beta = Py, [[/math]]

where

[[math]]P=X(X^{\mathsf{T}}X)^{-1}X^{\mathsf{T}}[[/math]]

is the projection matrix onto the space [math]V[/math] spanned by the columns of[math]X[/math]. This matrix [math]P[/math] is also sometimes called the hat matrix because it "puts a hat" onto the variable [math]y[/math]. Another matrix, closely related to [math]P[/math] is the annihilator matrix [math]M=I_n-P[/math]; this is a projection matrix onto the space orthogonal to [math]V[/math]. Both matrices [math]P[/math] and [math]M[/math] are symmetric and idempotent (meaning that [math]P^2 = P[/math] and [math]M^2=M[/math]), and relate to the data matrix [math]X[/math] via identities [math]PX=X[/math] and [math]MX=0[/math].^[18] Matrix [math]M[/math] creates the residuals from the regression:

[[math]] \hat\varepsilon = y - \hat y = y - X\hat\beta = My = M(X\beta+\varepsilon) = (MX)\beta + M\varepsilon = M\varepsilon. [[/math]]

Using these residuals we can estimate the value of [math]\sigma^2[/math] using the reduced chi-squared statistic:

[[math]] s^2 = \frac{\hat\varepsilon ^\mathrm{T} \hat\varepsilon}{n-p} = \frac{(My)^\mathrm{T} My}{n-p} = \frac{y^\mathrm{T} M^\mathrm{T}My}{n-p}= \frac{y ^\mathrm{T} My}{n-p} = \frac{S(\hat\beta)}{n-p},\qquad \hat\sigma^2 = \frac{n-p}{n}\;s^2 [[/math]]

The denominator, [math]n-p[/math], is the statistical degrees of freedom. The first quantity, [math]s^2[/math], is the OLS estimate for [math]\sigma^2[/math], whereas the second, [math]\scriptstyle\hat\sigma^2[/math], is the MLE estimate for [math]\sigma^2[/math]. The two estimators are quite similar in large samples; the first estimator is always unbiased, while the second estimator is biased but has a smaller mean squared error. In practice [math]s^2[/math] is used more often, since it is more convenient for the hypothesis testing. The square root of [math]s^2[/math] is called the regression standard error,^[19] standard error of the regression,^[20]^[21] or standard error of the equation.^[18]

It is common to assess the goodness-of-fit of the OLS regression by comparing how much the initial variation in the sample can be reduced by regressing onto[math]X[/math]. The coefficient of determination [math]R^2[/math] is defined as a ratio of "explained" variance to the "total" variance of the dependent variable [math]y[/math], in the cases where the regression sum of squares equals the sum of squares of residuals:^[22]

[[math]] R^2 = \frac{\sum(\hat y_i-\overline{y})^2}{\sum(y_i-\overline{y})^2} = \frac{y ^\mathrm{T} P ^\mathrm{T} LPy}{y ^\mathrm{T} Ly} = 1 - \frac{y ^\mathrm{T} My}{y ^\mathrm{T} Ly} = 1 - \frac{\rm RSS}{\rm TSS} [[/math]]

where TSS is the total sum of squares for the dependent variable, [math]L=I_n-\frac{1}{n}J_n[/math], and [math]J_n[/math] is an [math]n\times n[/math] matrix of ones. ([math]L[/math] is a centering matrix which is equivalent to regression on a constant; it simply subtracts the mean from a variable.) In order for [math]R^2[/math] to be meaningful, the matrix [math]X[/math] of data on regressors must contain a column vector of ones to represent the constant whose coefficient is the regression intercept. In that case, [math]R^2[/math] will always be a number between 0 and 1, with values close to 1 indicating a good degree of fit.

The variance in the prediction of the independent variable as a function of the dependent variable is given in the article Polynomial least squares.

Simple linear regression model

If the data matrix[math]X[/math] contains only two variables, a constant and a scalar regressor [math]x_i[/math], then this is called the "simple regression model".^[23] This case is often considered in the beginner statistics classes, as it provides much simpler formulas even suitable for manual calculation. The parameters are commonly denoted as [math](\alpha,\beta)[/math]:

[[math]] y_i = \alpha + \beta x_i + \varepsilon_i. [[/math]]

The least squares estimates in this case are given by simple formulas

[[math]] \begin{align} \hat\beta &= \frac{{n} \sum{x_iy_i} - \sum{x_i}\sum{y_i} } {{n} \sum{x_i^2} - (\sum{x_i})^2 } \\ \hat\alpha &= \overline{y} - \hat\beta\,\overline{x}\ , \end{align} [[/math]]

Alternative derivations

In the previous section the least squares estimator [math]\hat\beta[/math] was obtained as a value that minimizes the sum of squared residuals of the model. However it is also possible to derive the same estimator from other approaches. In all cases the formula for OLS estimator remains the same:

[[math]]\hat{\beta}=(X^{\mathsf{T}}X)^{-1}X^{\mathsf{T}}y[[/math]]

; the only difference is in how we interpret this result.

Projection

For mathematicians, OLS is an approximate solution to an overdetermined system of linear equations [math]X\beta = y[/math], where [math]\beta[/math] is the unknown. Assuming the system cannot be solved exactly (the number of equations [math]n[/math] is much larger than the number of unknowns [math]p[/math]), we are looking for a solution that could provide the smallest discrepancy between the right- and left- hand sides. In other words, we are looking for the solution that satisfies

[[math]] \hat\beta = {\rm arg}\min_\beta\,\lVert y - X\beta \rVert, [[/math]]

where [math]||\cdot||[/math] is the standard L² norm in the [math]n[/math]-dimensional Euclidean space [math]\mathbb{R}^n[/math]. The predicted quantity [math]X\beta[/math] is just a certain linear combination of the vectors of regressors. Thus, the residual vector [math]y-X\beta[/math] will have the smallest length when[math]y[/math]is projected orthogonally onto the linear subspace spanned by the columns of[math]X[/math]. The OLS estimator [math]\hat\beta[/math] in this case can be interpreted as the coefficients of vector decomposition of [math]\hat{y}=Py[/math] along the basis of[math]X[/math].

In other words, the gradient equations at the minimum can be written as:

[[math]](\mathbf y - X \hat{\boldsymbol{\beta}})^{\rm T} X=0.[[/math]]

A geometrical interpretation of these equations is that the vector of residuals, [math]\mathbf y - X \hat{\boldsymbol{\beta}}[/math] is orthogonal to the column space of[math]X[/math], since the dot product [math](\mathbf y-X\hat{\boldsymbol{\beta}})\cdot X \mathbf v[/math] is equal to zero for any conformal vector, [math]\mathbf{v}[/math]. This means that [math]\mathbf y - X \boldsymbol{\hat \beta}[/math] is the shortest of all possible vectors [math]\mathbf{y}- X \boldsymbol \beta[/math], that is, the variance of the residuals is the minimum possible. This is illustrated at the right.

Introducing [math]\hat{\boldsymbol{\gamma}}[/math] and a matrix [math]K[/math] with the assumption that a matrix [math][X \ K][/math] is non-singular and [math]K^{\mathsf{T}}X=0[/math] (cf. Orthogonal projections), the residual vector should satisfy the following equation:

[[math]]\hat{\mathbf{r}} \triangleq \mathbf{y} - X \hat{\boldsymbol{\beta}} = K \hat{{\boldsymbol{\gamma}}}.[[/math]]

The equation and solution of linear least squares are thus described as follows:

[[math]] \mathbf{y} = \begin{bmatrix}X & K\end{bmatrix} \begin{pmatrix} \hat{\boldsymbol{\beta}} \\ \hat{\boldsymbol{\gamma}} \end{pmatrix} ,[[/math]]

[[math]] \begin{pmatrix} \hat{\boldsymbol{\beta}} \\ \hat{\boldsymbol{\gamma}} \end{pmatrix} = \begin{bmatrix}X & K\end{bmatrix}^{-1} \mathbf{y} = \begin{bmatrix} (X^{\rm T} X)^{-1} X^{\rm T} \\ (K^{\rm T} K)^{-1} K^{\rm T} \end{bmatrix} \mathbf{y} .[[/math]]

Another way of looking at it is to consider the regression line to be a weighted average of the lines passing through the combination of any two points in the dataset.^[24] Although this way of calculation is more computationally expensive, it provides a better intuition on OLS.

Maximum likelihood

The OLS estimator is identical to the maximum likelihood estimator (MLE) under the normality assumption for the error terms.^[25] From the properties of MLE, we can infer that the OLS estimator is asymptotically efficient (in the sense of attaining the Cramér–Rao bound for variance) if the normality assumption is satisfied.^[26]

Properties

Assumptions

There are several different frameworks in which the linear regression model can be cast in order to make the OLS technique applicable. Each of these settings produces the same formulas and same results. The only difference is the interpretation and the assumptions which have to be imposed in order for the method to give meaningful results. The choice of the applicable framework depends mostly on the nature of data in hand, and on the inference task which has to be performed.

One of the lines of difference in interpretation is whether to treat the regressors as random variables, or as predefined constants. In the first case (random design) the regressors [math]x_i[/math] are random and sampled together with the [math]y_i[/math]'s from some population, as in an observational study. This approach allows for more natural study of the asymptotic properties of the estimators. In the other interpretation (fixed design), the regressors[math]X[/math] are treated as known constants set by a design, and[math]y[/math]is sampled conditionally on the values of[math]X[/math] as in an experiment. For practical purposes, this distinction is often unimportant, since estimation and inference is carried out while conditioning on[math]X[/math]. All results stated in this article are within the random design framework.

Classical linear regression model

The classical model focuses on the "finite sample" estimation and inference, meaning that the number of observations n is fixed. This contrasts with the other approaches, which study the asymptotic behavior of OLS, and in which the number of observations is allowed to grow to infinity.

Condition	Description
Correct specification	The linear functional form must coincide with the form of the actual data-generating process.
Strict exogeneity	The errors in the regression should have conditional mean zero:^[27][math] \operatorname{E}[\,\varepsilon\mid X\,] = 0. [/math]The immediate consequence of the exogeneity assumption is that the errors have mean zero: [math]\operatorname{E}[\epsilon] = 0 [/math], and that the regressors are uncorrelated with the errors: [math]\operatorname{E}[X^{\mathsf{T}}\epsilon] = 0[/math].
Linear Independence	The regressors in[math]X[/math] must all be linearly independent. Mathematically, this means that the matrix [math]X[/math] must have full column rank almost surely:^[28] [[math]] \Pr\!\big[\,\operatorname{rank}(X) = p\,\big] = 1. [[/math]] Usually, it is also assumed that the regressors have finite moments up to at least the second moment. Then the matrix [math]Q_{xx}=\operatorname{E}[X^{\mathsf{T}}X]/n[/math] is finite and positive semi-definite. When this assumption is violated the regressors are called linearly dependent or perfectly multicollinear. In such case the value of the regression coefficient [math]\beta[/math] cannot be learned, although prediction of [math]y[/math] values is still possible for new values of the regressors that lie in the same linearly dependent subspace.
Spherical Errors^[28]	[[math]] \operatorname{Var}[\,\varepsilon \mid X\,] = \sigma^2 I_n, [[/math]] where [math]I_n[/math] is the identity matrix in dimension [math]n[/math], and [math]\sigma^2[/math] is a parameter which determines the variance of each observation. This [math]\sigma^2[/math] is considered a nuisance parameter in the model, although usually it is also estimated. If this assumption is violated then the OLS estimates are still valid, but no longer efficient. It is customary to split this assumption into two parts: Homoscedasticity: [math]\operatorname{E}[\epsilon_i^2\|X] = \sigma^2[/math], which means that the error term has the same variance [math]\sigma^2[/math] in each observation. When this requirement is violated this is called heteroscedasticity, in such case a more efficient estimator would be weighted least squares. If the errors have infinite variance then the OLS estimates will also have infinite variance (although by the law of large numbers they will nonetheless tend toward the true values so long as the errors have zero mean). In this case, robust estimation techniques are recommended. No autocorrelation: the errors are uncorrelated between observations: [math]\operatorname{E}[\epsilon_i\epsilon_j\|X]=0[/math] for [math]i\neq j[/math]. This assumption may be violated in the context of time series data, panel data, cluster samples, hierarchical data, repeated measures data, longitudinal data, and other data with dependencies. In such cases generalized least squares provides a better alternative than the OLS. Another expression for autocorrelation is serial correlation.
Normality	It is sometimes additionally assumed that the errors have normal distribution conditional on the regressors:^[29] [[math]] \varepsilon \mid X\sim \mathcal{N}(0, \sigma^2I_n). [[/math]] This assumption is not needed for the validity of the OLS method, although certain additional finite-sample properties can be established in case when it does (especially in the area of hypotheses testing). Also when the errors are normal, the OLS estimator is equivalent to the maximum likelihood estimator (MLE), and therefore it is asymptotically efficient in the class of all regular estimators. Importantly, the normality assumption applies only to the error terms; contrary to a popular misconception, the response (dependent) variable is not required to be normally distributed.^[30]

Independent and identically distributed (iid)

In some applications an additional assumption is imposed — that all observations are independent and identically distributed. This means that all observations are taken from a random sample which makes all the assumptions listed earlier simpler and easier to interpret. The list of assumptions in this case is:

Assumption	Description
iid obesrvations	[math](x_i,y_i)[/math] is independent from, and has the same distribution as, [math](x_j,y_j)[/math] for all [math]i\neq j[/math].
no perfect multicollinearity	[math]Q_{xx}=\operatorname{E}[x_ix_i^{\mathsf{T}}][/math] is a positive-definite matrix
Exogeneity	[math]\operatorname{E}[\varepsilon_ix_i]=0[/math]
homoscedasticity	[math]\operatorname{Var}[\varepsilon_ix_i] = \sigma^2[/math]

Finite sample properties

First of all, under the strict exogeneity assumption the OLS estimators [math]\scriptstyle\hat\beta[/math] and [math]s^2[/math] are unbiased, meaning that their expected values coincide with the true values of the parameters:^[31]

[[math]] \operatorname{E}[\, \hat\beta \mid X \,] = \beta, \quad \operatorname{E}[\,s^2 \mid X\,] = \sigma^2. [[/math]]

If the strict exogeneity does not hold (as is the case with many time series models, where exogeneity is assumed only with respect to the past shocks but not the future ones), then these estimators will be biased in finite samples.

The variance-covariance matrix (or simply covariance matrix) of [math]\scriptstyle\hat\beta[/math] is equal to^[32]

[[math]] \operatorname{Var}[\, \hat\beta \mid X \,] = \sigma^2(X ^T X)^{-1} = \sigma^2 Q. [[/math]]

In particular, the standard error of each coefficient [math]\scriptstyle\hat\beta_j[/math] is equal to square root of the [math]j[/math]-th diagonal element of this matrix. The estimate of this standard error is obtained by replacing the unknown quantity σ² with its estimate [math]s^2[/math]. Thus,

[[math]] \widehat{\operatorname{s.\!e.}}(\hat{\beta}_j) = \sqrt{s^2 (X ^T X)^{-1}_{jj}} [[/math]]

It can also be easily shown that the estimator [math]\scriptstyle\hat\beta[/math] is uncorrelated with the residuals from the model:^[32]

[[math]] \operatorname{Cov}[\, \hat\beta,\hat\varepsilon \mid X\,] = 0. [[/math]]

The Gauss–Markov theorem states that under the spherical errors assumption (that is, the errors should be uncorrelated and homoscedastic) the estimator [math]\scriptstyle\hat\beta[/math] is efficient in the class of linear unbiased estimators. This is called the best linear unbiased estimator (BLUE). Efficiency should be understood as if we were to find some other estimator [math]\scriptstyle\tilde\beta[/math] which would be linear in[math]y[/math]and unbiased, then ^[32]

[[math]] \operatorname{Var}[\, \tilde\beta \mid X \,] - \operatorname{Var}[\, \hat\beta \mid X \,] \geq 0 [[/math]]

in the sense that this is a nonnegative-definite matrix. This theorem establishes optimality only in the class of linear unbiased estimators, which is quite restrictive. Depending on the distribution of the error terms ε, other, non-linear estimators may provide better results than OLS.

Assuming normality

The properties listed so far are all valid regardless of the underlying distribution of the error terms. However, if you are willing to assume that the normality assumption holds (that is, that [math]\epsilon \sim N(0,\sigma^2I_n)[/math], then additional properties of the OLS estimators can be stated.

The estimator [math]\scriptstyle\hat\beta[/math] is normally distributed, with mean and variance as given before:^[33]

[[math]] \hat\beta\ \sim\ \mathcal{N}\big(\beta,\ \sigma^2(X ^\mathrm{T} X)^{-1}\big) [[/math]]

where [math]Q[/math] is the cofactor matrix. This estimator reaches the Cramér–Rao bound for the model, and thus is optimal in the class of all unbiased estimators.^[26] Note that unlike the Gauss–Markov theorem, this result establishes optimality among both linear and non-linear estimators, but only in the case of normally distributed error terms.

The estimator [math]s^2[/math] will be proportional to the chi-squared distribution:^[34]

[[math]] s^2\ \sim\ \frac{\sigma^2}{n-p} \cdot \chi^2_{n-p} [[/math]]

The variance of this estimator is equal to [math]2\sigma^4/(n-p)[/math], which does not attain the Cramér–Rao bound of [math]2\sigma^4/n[/math]. However it was shown that there are no unbiased estimators of [math]\sigma^2[/math] with variance smaller than that of the estimator [math]s^2[/math].^[35] If we are willing to allow biased estimators, and consider the class of estimators that are proportional to the sum of squared residuals (SSR) of the model, then the best (in the sense of the mean squared error) estimator in this class will be [math]\sim \sigma^2/(n-p+2)[/math], which even beats the Cramér–Rao bound in case when there is only one regressor ([math]p=1[/math]).^[36]

Moreover, the estimators [math]\hat{\beta}[/math] and [math]s^2[/math] are independent,^[37] the fact which comes in useful when constructing the t- and F-tests for the regression.

Large sample properties

The least squares estimators are point estimates of the linear regression model parameters [math]\beta[/math]. However, generally we also want to know how close those estimates might be to the true values of parameters. In other words, we want to construct the interval estimates.

Since we haven't made any assumption about the distribution of error term [math]\epsilon_i[/math], it is impossible to infer the distribution of the estimators [math]\hat\beta[/math] and [math]\hat\sigma^2[/math]. Nevertheless, we can apply the central limit theorem to derive their asymptotic properties as sample size [math]n[/math] goes to infinity. While the sample size is necessarily finite, it is customary to assume that [math]n[/math] is "large enough" so that the true distribution of the OLS estimator is close to its asymptotic limit.

We can show that under the model assumptions, the least squares estimator for [math]\beta[/math] is consistent (that is [math]\hat\beta[/math] converges in probability to [math]\beta[/math]) and asymptotically normal:

[[math]](\hat\beta - \beta)\ \xrightarrow{d}\ \mathcal{N}\big(0,\;\sigma^2Q_{xx}^{-1}\big),[[/math]]

where [math]Q_{xx} = X ^T X.[/math]

Intervals

Using this asymptotic distribution, approximate two-sided confidence intervals for the [math]j[/math]-th component of the vector [math]\hat{\beta}[/math] can be constructed as

[[math]]\beta_j \in \bigg[\ \hat\beta_j \pm q^{\mathcal{N}(0, 1)}_{1 - \frac{\alpha}{2}}\!\sqrt{\hat{\sigma}^2 \left[Q_{xx}^{-1}\right]_{jj}}\ \bigg] [[/math]]

at the [math]1-\alpha[/math] confidence level, where [math]Q[/math] denotes the quantile function of standard normal distribution, and [·]_jj is the [math]j[/math]-th diagonal element of a matrix.

Similarly, the least squares estimator for [math]\sigma^2[/math] is also consistent and asymptotically normal (provided that the fourth moment of [math]\epsilon_i[/math] exists) with limiting distribution

[[math]](\hat{\sigma}^2 - \sigma^2)\ \xrightarrow{d}\ \mathcal{N} \left(0,\;\operatorname{E}\left[\varepsilon_i^4\right] - \sigma^4\right). [[/math]]

These asymptotic distributions can be used for prediction, testing hypotheses, constructing other estimators, etc.. As an example consider the problem of prediction. Suppose [math]x_0[/math] is some point within the domain of distribution of the regressors, and one wants to know what the response variable would have been at that point. The mean response is the quantity [math]y_0 = x_0^\mathrm{T} \beta[/math], whereas the predicted response is [math]\hat{y}_0 = x_0^\mathrm{T} \hat\beta[/math]. Clearly the predicted response is a random variable, its distribution can be derived from that of [math]\hat{\beta}[/math]:

[[math]]\left(\hat{y}_0 - y_0\right)\ \xrightarrow{d}\ \mathcal{N}\left(0,\;\sigma^2 x_0^\mathrm{T} Q_{xx}^{-1} x_0\right),[[/math]]

which allows construct confidence intervals for mean response [math]y_0[/math] to be constructed:

[[math]]y_0 \in \left[\ x_0^\mathrm{T} \hat{\beta} \pm q^{\mathcal{N}(0, 1)}_{1 - \frac{\alpha}{2}}\!\sqrt{\hat\sigma^2 x_0^\mathrm{T} Q_{xx}^{-1} x_0}\ \right][[/math]]

at the [math]1-\alpha[/math] confidence level.

References

Mansfield Merriman, "A List of Writings Relating to the Method of Least Squares"
Bretscher, Otto (1995). Linear Algebra With Applications (3rd ed.). Upper Saddle River, NJ: Prentice Hall.
Stigler, Stephen M. (1981). "Gauss and the Invention of Least Squares". Ann. Stat. 9 (3): 465–474. doi:10.1214/aos/1176345451.
Britannica, "Least squares method"
Studies in the History of Probability and Statistics. XXIX: The Discovery of the Method of Least Squares R. L. Plackett
^6.0 ^6.1 ^6.2 A modern introduction to probability and statistics : understanding why and how. Dekking, Michel, 1946-. London: Springer. 2005. ISBN 978-1-85233-896-1. OCLC 262680588.CS1 maint: others (link)
For a good introduction to error-in-variables, please see Fuller, W. A. (1987). Measurement Error Models. John Wiley & Sons. ISBN 978-0-471-86187-4.
"Strong consistency of least squares estimates in multiple regression" (1978). PNAS 75 (7): 3034–3036. doi:10.1073/pnas.75.7.3034. PMID 16592540. Bibcode: 1978PNAS...75.3034L.
Margenau, Henry; Murphy, George Moseley (1956). The Mathematics of Physics and Chemistry. Princeton: Van Nostrand.
Gans, Peter (1992). Data fitting in the Chemical Sciences. New York: Wiley. ISBN 978-0-471-93412-7.
Deming, W. E. (1943). Statistical adjustment of Data. New York: Wiley.
Goldberger, Arthur S. (1964). "Classical Linear Regression". Econometric Theory. New York: John Wiley & Sons. pp. 158. ISBN 0-471-31101-4.
Hayashi, Fumio (2000). Econometrics. Princeton University Press. p. 15.
Hayashi (2000, page 18)
Ghilani, Charles D.; Paul r. Wolf, Ph. D. (12 June 2006). Adjustment Computations: Spatial Data Analysis. ISBN 9780471697282.
Hofmann-Wellenhof, Bernhard; Lichtenegger, Herbert; Wasle, Elmar (20 November 2007). GNSS – Global Navigation Satellite Systems: GPS, GLONASS, Galileo, and more. ISBN 9783211730171.
Xu, Guochang (5 October 2007). GPS: Theory, Algorithms and Applications. ISBN 9783540727156.
^18.0 ^18.1 Hayashi (2000, page 19)
Julian Faraway (2000), Practical Regression and Anova using R
Kenney, J.; Keeping, E. S. (1963). Mathematics of Statistics. van Nostrand. p. 187.
Zwillinger, D. (1995). Standard Mathematical Tables and Formulae. Chapman&Hall/CRC. p. 626. ISBN 0-8493-2479-3.
Hayashi (2000, page 20)
Hayashi (2000, page 5)
Akbarzadeh, Vahab (7 May 2014). "Line Estimation".
Hayashi (2000, page 49)
^26.0 ^26.1 Hayashi (2000, page 52)
Hayashi (2000, page 7)
^28.0 ^28.1 Hayashi (2000, page 10)
Hayashi (2000, page 34)
"Assumptions of multiple regression: Correcting two misconceptions" (2013). Practical Assessment, Research & Evaluation 18 (11).
Hayashi (2000, pages 27, 30)
^32.0 ^32.1 ^32.2 Hayashi (2000, page 27)
Amemiya, Takeshi (1985). Advanced Econometrics. Harvard University Press. p. 13. ISBN 9780674005600.
Amemiya (1985, page 14)
Rao, C. R. (1973). Linear Statistical Inference and its Applications (Second ed.). New York: J. Wiley & Sons. p. 319. ISBN 0-471-70823-2.
Amemiya (1985, page 20)
Amemiya (1985, page 27)

Wikipedia References

Wikipedia contributors. "Least Squares". Wikipedia. Wikipedia. Retrieved 17 August 2022.
Wikipedia contributors. "Linear least squares". Wikipedia. Wikipedia. Retrieved 17 August 2022.
Wikipedia contributors. "Ordinary least squares". Wikipedia. Wikipedia. Retrieved 17 August 2022.

[1] Mansfield Merriman, "A List of Writings Relating to the Method of Least Squares"

[brertscher-2] Bretscher, Otto (1995). Linear Algebra With Applications (3rd ed.). Upper Saddle River, NJ: Prentice Hall.

[3] Stigler, Stephen M. (1981). "Gauss and the Invention of Least Squares". Ann. Stat. 9 (3): 465–474. doi:10.1214/aos/1176345451.

[4] Britannica, "Least squares method"

[5] Studies in the History of Probability and Statistics. XXIX: The Discovery of the Method of Least Squares R. L. Plackett

[:0-6] 6.0 ^6.1 ^6.2 A modern introduction to probability and statistics : understanding why and how. Dekking, Michel, 1946-. London: Springer. 2005. ISBN 978-1-85233-896-1. OCLC 262680588.CS1 maint: others (link)

[7] For a good introduction to error-in-variables, please see Fuller, W. A. (1987). Measurement Error Models. John Wiley & Sons. ISBN 978-0-471-86187-4.

[8] "Strong consistency of least squares estimates in multiple regression" (1978). PNAS 75 (7): 3034–3036. doi:10.1073/pnas.75.7.3034. PMID 16592540. Bibcode: 1978PNAS...75.3034L.

[9] Margenau, Henry; Murphy, George Moseley (1956). The Mathematics of Physics and Chemistry. Princeton: Van Nostrand.

[pg-10] Gans, Peter (1992). Data fitting in the Chemical Sciences. New York: Wiley. ISBN 978-0-471-93412-7.

[11] Deming, W. E. (1943). Statistical adjustment of Data. New York: Wiley.

[12] Goldberger, Arthur S. (1964). "Classical Linear Regression". Econometric Theory. New York: John Wiley & Sons. pp. 158. ISBN 0-471-31101-4.

[13] Hayashi, Fumio (2000). Econometrics. Princeton University Press. p. 15.

[14] Hayashi (2000, page 18)

[15] Ghilani, Charles D.; Paul r. Wolf, Ph. D. (12 June 2006). Adjustment Computations: Spatial Data Analysis. ISBN 9780471697282.

[16] Hofmann-Wellenhof, Bernhard; Lichtenegger, Herbert; Wasle, Elmar (20 November 2007). GNSS – Global Navigation Satellite Systems: GPS, GLONASS, Galileo, and more. ISBN 9783211730171.

[17] Xu, Guochang (5 October 2007). GPS: Theory, Algorithms and Applications. ISBN 9783540727156.

[Hayashi_2000_loc=page_19-18] 18.0 ^18.1 Hayashi (2000, page 19)

[19] Julian Faraway (2000), Practical Regression and Anova using R

[20] Kenney, J.; Keeping, E. S. (1963). Mathematics of Statistics. van Nostrand. p. 187.

[21] Zwillinger, D. (1995). Standard Mathematical Tables and Formulae. Chapman&Hall/CRC. p. 626. ISBN 0-8493-2479-3.

[22] Hayashi (2000, page 20)

[23] Hayashi (2000, page 5)

[24] Akbarzadeh, Vahab (7 May 2014). "Line Estimation".

[25] Hayashi (2000, page 49)

[Hayashi_2000_loc=page_52-26] 26.0 ^26.1 Hayashi (2000, page 52)

[27] Hayashi (2000, page 7)

[Hayashi_2000_loc=page_10-28] 28.0 ^28.1 Hayashi (2000, page 10)

[29] Hayashi (2000, page 34)

[30] "Assumptions of multiple regression: Correcting two misconceptions" (2013). Practical Assessment, Research & Evaluation 18 (11).

[31] Hayashi (2000, pages 27, 30)

[HayashiFSP-32] 32.0 ^32.1 ^32.2 Hayashi (2000, page 27)

[33] Amemiya, Takeshi (1985). Advanced Econometrics. Harvard University Press. p. 13. ISBN 9780674005600.

[34] Amemiya (1985, page 14)

[35] Rao, C. R. (1973). Linear Statistical Inference and its Applications (Second ed.). New York: J. Wiley & Sons. p. 319. ISBN 0-471-70823-2.

[36] Amemiya (1985, page 20)

[37] Amemiya (1985, page 27)

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

@@ Line 1: / Line 1: @@
-The method of '''least squares''' is a standard approach in [[wikipedia:regression analysis|regression analysis]] to approximate the solution of [[wikipedia:overdetermined system|overdetermined system]]s (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the [[wikipedia:Residuals (statistics)|residuals]] (a residual being the difference between an observed value and the fitted value provided by a model) made in the results of each individual equation.
+The method of '''least squares''' is a standard approach in [[regression analysis|regression analysis]] to approximate the solution of [[overdetermined system|overdetermined system]]s (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the [[Residuals (statistics)|residuals]] (a residual being the difference between an observed value and the fitted value provided by a model) made in the results of each individual equation.
-The most important application is in [[wikipedia:curve fitting|data fitting]]. When the problem has substantial uncertainties in the [[wikipedia:independent variable|independent variable]] (the <math>X</math> variable), then simple regression and least-squares methods have problems; in such cases, the methodology required for fitting [[wikipedia:errors-in-variables models|errors-in-variables models]] may be considered instead of that for least squares.
+The most important application is in [[curve fitting|data fitting]]. When the problem has substantial uncertainties in the independent variable (the <math>X</math> variable), then simple regression and least-squares methods have problems; in such cases, the methodology required for fitting [[errors-in-variables models|errors-in-variables models]] may be considered instead of that for least squares.
-Least-squares problems fall into two categories: linear or [[wikipedia:ordinary least squares|ordinary least squares]] and [[wikipedia:nonlinear least squares|nonlinear least squares]], depending on whether or not the residuals are linear in all unknowns. The linear least-squares problem occurs in statistical [[wikipedia:regression analysis|regression analysis]]; it has a [[wikipedia:closed-form solution|closed-form solution]]. The nonlinear problem is usually solved by iterative refinement; at each iteration the system is approximated by a linear one, and thus the core calculation is similar in both cases.
+Least-squares problems fall into two categories: linear or [[ordinary least squares|ordinary least squares]] and [[nonlinear least squares|nonlinear least squares]], depending on whether or not the residuals are linear in all unknowns. The linear least-squares problem occurs in statistical [[regression analysis|regression analysis]]; it has a closed-form solution. The nonlinear problem is usually solved by iterative refinement; at each iteration the system is approximated by a linear one, and thus the core calculation is similar in both cases.
-[[wikipedia:Polynomial least squares|Polynomial least squares]] describes the variance in a prediction of the dependent variable as a function of the independent variable and the deviations from the fitted curve.
+The following discussion is mostly presented in terms of linear functions but the use of least squares is valid and practical for more general families of functions. Also, by iteratively applying local quadratic approximation to the likelihood (through the [[Fisher information|Fisher information]]), the least-squares method may be used to fit a [[generalized linear model|generalized linear model]].
-When the observations come from an [[wikipedia:exponential family|exponential family]] with identity as its natural sufficient statistics and mild-conditions are satisfied (e.g. for normal, exponential, Poisson and binomial distributions), standardized least-squares estimates and [[wikipedia:Maximum likelihood|maximum-likelihood]] estimates are identical.<ref>{{Cite journal | last1 = Charnes | first1 = A. | last2 = Frome | first2 = E. L. | last3 = Yu | first3 = P. L. | doi = 10.1080/01621459.1976.10481508 | title = The Equivalence of Generalized Least Squares and Maximum Likelihood Estimates in the Exponential Family | journal = Journal of the American Statistical Association | volume = 71 | issue = 353 | pages = 169–171 | year = 1976 }}</ref> The method of least squares can also be derived as a [[wikipedia:method of moments (statistics)|method of moments]] estimator.
+The least-squares method was officially discovered and published by [[Adrien-Marie Legendre|Adrien-Marie Legendre]] (1805),<ref>Mansfield Merriman, "A List of Writings Relating to the Method of Least Squares"</ref> though it is usually also co-credited to [[Carl Friedrich Gauss|Carl Friedrich Gauss]] (1795)<ref name=brertscher>{{cite book |last=Bretscher |first=Otto |title=Linear Algebra With Applications |edition=3rd |publisher=Prentice Hall |year=1995 |location=Upper Saddle River, NJ}}</ref><ref>{{cite journal |first=Stephen M. |last=Stigler |year=1981 |title=Gauss and the Invention of Least Squares |journal=Ann. Stat. |volume=9 |issue=3 |pages=465–474 |doi=10.1214/aos/1176345451 |doi-access=free }}</ref> who contributed significant theoretical advances to the method and may have previously used it in his work.<ref>Britannica, "Least squares method"</ref><ref>Studies in the History of Probability and Statistics. XXIX: The Discovery of the Method of Least Squares
-The following discussion is mostly presented in terms of [[wikipedia:linear|linear]] functions but the use of least squares is valid and practical for more general families of functions. Also, by iteratively applying local quadratic approximation to the likelihood (through the [[wikipedia:Fisher information|Fisher information]]), the least-squares method may be used to fit a [[wikipedia:generalized linear model|generalized linear model]].
-The least-squares method was officially discovered and published by [[wikipedia:Adrien-Marie Legendre|Adrien-Marie Legendre]] (1805),<ref>Mansfield Merriman, "A List of Writings Relating to the Method of Least Squares"</ref> though it is usually also co-credited to [[wikipedia:Carl Friedrich Gauss|Carl Friedrich Gauss]] (1795)<ref name=brertscher>{{cite book |last=Bretscher |first=Otto |title=Linear Algebra With Applications |edition=3rd |publisher=Prentice Hall |year=1995 |location=Upper Saddle River, NJ}}</ref><ref>{{cite journal |first=Stephen M. |last=Stigler |year=1981 |title=Gauss and the Invention of Least Squares |journal=Ann. Stat. |volume=9 |issue=3 |pages=465–474 |doi=10.1214/aos/1176345451 |doi-access=free }}</ref> who contributed significant theoretical advances to the method and may have previously used it in his work.<ref>Britannica, "Least squares method"</ref><ref>Studies in the History of Probability and Statistics. XXIX: The Discovery of the Method of Least Squares
 R. L. Plackett</ref>
-===The method===
+==Problem statement==
-[[File:Bendixen - Carl Friedrich Gauss, 1828.jpg|thumb|upright=0.8|[[wikipedia:Carl Friedrich Gauss|Carl Friedrich Gauss]]]]
-The first clear and concise exposition of the method of least squares was published by [[wikipedia:Adrien-Marie Legendre|Legendre]] in 1805.<ref>{{Citation |first=Adrien-Marie |last=Legendre |title=Nouvelles méthodes pour la détermination des orbites des comètes |trans-title=New Methods for the Determination of the Orbits of Comets |language=fr |publisher=F. Didot |location=Paris |year=1805 |url=https://books.google.com/books?id=FRcOAAAAQAAJ |hdl=2027/nyp.33433069112559 |hdl-access=free }}</ref> The technique is described as an algebraic procedure for fitting linear equations to data and Legendre demonstrates the new method by analyzing the same data as Laplace for the shape of the earth. Within ten years after Legendre's publication, the method of least squares had been adopted as a standard tool in astronomy and geodesy in France, Italy, and Prussia, which constitutes an extraordinarily rapid acceptance of a scientific technique.<ref name=stigler>{{cite book
-  | last = Stigler
-  | first = Stephen M.
-  | title = The History of Statistics: The Measurement of Uncertainty Before 1900
-  | publisher = Belknap Press of Harvard University Press
-  | year = 1986
-  | location = Cambridge, MA
-  | isbn = 978-0-674-40340-6
-  | url-access = registration
-  | url = https://archive.org/details/historyofstatist00stig
-  }}</ref>
-In 1809 [[wikipedia:Carl Friedrich Gauss|Carl Friedrich Gauss]] published his method of calculating the orbits of celestial bodies. In that work he claimed to have been in possession of the method of least squares since 1795. This naturally led to a priority dispute with Legendre. However, to Gauss's credit, he went beyond Legendre and succeeded in connecting the method of least squares with the principles of probability and to the [[wikipedia:normal distribution|normal distribution]]. He had managed to complete Laplace's program of specifying a mathematical form of the probability density for the observations, depending on a finite number of unknown parameters, and define a method of estimation that minimizes the error of estimation. Gauss showed that the [[wikipedia:arithmetic mean|arithmetic mean]] is indeed the best estimate of the location parameter by changing both the [[wikipedia:probability density|probability density]] and the method of estimation. He then turned the problem around by asking what form the density should have and what method of estimation should be used to get the arithmetic mean as estimate of the location parameter. In this attempt, he invented the normal distribution.
-An early demonstration of the strength of [[wikipedia:Gauss's method|Gauss's method]] came when it was used to predict the future location of the newly discovered asteroid [[wikipedia:Ceres (dwarf planet)|Ceres]]. On 1 January 1801, the Italian astronomer [[wikipedia:Giuseppe Piazzi|Giuseppe Piazzi]] discovered Ceres and was able to track its path for 40 days before it was lost in the glare of the sun. Based on these data, astronomers desired to determine the location of Ceres after it emerged from behind the sun without solving [[wikipedia:Kepler's laws of planetary motion|Kepler's complicated nonlinear equations]] of planetary motion. The only predictions that successfully allowed Hungarian astronomer [[wikipedia:Franz Xaver von Zach|Franz Xaver von Zach]] to relocate Ceres were those performed by the 24-year-old Gauss using least-squares analysis.
-In 1810, after reading Gauss's work, Laplace, after proving the [[wikipedia:central limit theorem|central limit theorem]], used it to give a large sample justification for the method of least squares and the normal distribution. In 1822, Gauss was able to state that the least-squares approach to regression analysis is optimal in the sense that in a linear model where the errors have a mean of zero, are uncorrelated, and have equal variances, the best linear unbiased estimator of the coefficients is the least-squares estimator. This result is known as the [[wikipedia:Gauss–Markov theorem|Gauss–Markov theorem]].
-The idea of least-squares analysis was also independently formulated by the American [[wikipedia:Robert Adrain|Robert Adrain]] in 1808. In the next two centuries workers in the theory of errors and in statistics found many different ways of implementing least squares.<ref>{{cite journal |doi=10.1111/j.1751-5823.1998.tb00406.x |first=J. |last=Aldrich |year=1998|title=Doing Least Squares: Perspectives from Gauss and Yule |journal=International Statistical Review |volume=66 |issue=1 |pages= 61–81|s2cid=121471194 }}</ref>
-==Problem statement==
+The objective consists of adjusting the parameters of a model function to best fit a data set. A simple data set consists of  <math>n</math> points (data pairs) <math>(x_i,y_i)\!</math>, <math>i=1,\ldots,n</math>, where <math>x_i</math> is an independent variable and <math>y_i</math> is a [[dependent variable|dependent variable]] whose value is found by observation. The model function has the form <math>f(x, \boldsymbol \beta)</math>, where <math>m</math> adjustable parameters are held in the vector <math>\boldsymbol \beta</math>. The goal is to find the parameter values for the model that "best" fits the data. The fit of a model to a data point is measured by its [[errors and residuals in statistics|residual]], defined as the difference between the observed value of the dependent variable and the value predicted by the model:
-The objective consists of adjusting the parameters of a model function to best fit a data set. A simple data set consists of  <math>n</math> points (data pairs) <math>(x_i,y_i)\!</math>, <math>i=1,\ldots,n</math>, where <math>x_i</math> is an [[wikipedia:independent variable|independent variable]] and <math>y_i</math> is a [[wikipedia:dependent variable|dependent variable]] whose value is found by observation. The model function has the form <math>f(x, \boldsymbol \beta)</math>, where <math>m</math> adjustable parameters are held in the vector <math>\boldsymbol \beta</math>. The goal is to find the parameter values for the model that "best" fits the data. The fit of a model to a data point is measured by its [[wikipedia:errors and residuals in statistics|residual]], defined as the difference between the observed value of the dependent variable and the value predicted by the model:
 <math display="block">r_i = y_i - f(x_i, \boldsymbol \beta).</math>[[File:Linear Residual Plot Graph.png|thumb|The residuals are plotted against corresponding <math>x</math> values. The random fluctuations about <math>r_i=0</math> indicate a linear model is appropriate.|251x251px]]
-The least-squares method finds the optimal parameter values by  minimizing the [[wikipedia:sum of squared residuals|sum of squared residuals]], <math>S</math>:<ref name=":0" />
+The least-squares method finds the optimal parameter values by  minimizing the [[sum of squared residuals|sum of squared residuals]], <math>S</math>:<ref name=":0" />
 <math display="block">S=\sum_{i=1}^{n}r_i^2.</math>
-An example of a model in two dimensions is that of the straight line. Denoting the y-intercept as <math>\beta_0</math> and the slope as <math>\beta_1</math>, the model function is given by <math>f(x,\boldsymbol \beta)=\beta_0+\beta_1 x</math>. See [[wikipedia:Linear least squares #Example|linear least squares]] for a fully worked out example of this model.
+An example of a model in two dimensions is that of the straight line. Denoting the y-intercept as <math>\beta_0</math> and the slope as <math>\beta_1</math>, the model function is given by <math>f(x,\boldsymbol \beta)=\beta_0+\beta_1 x</math>. See [[#Linear Least Squares #Example|linear least squares]] for a fully worked out example of this model.
 A data point may consist of more than one independent variable. For example, when fitting a plane to a set of height measurements, the plane is a function of two independent variables,<math>X</math> and ''z'', say. In the most general case there may be one or more independent variables and one or more dependent variables at each data point.
@@ Line 57: / Line 31: @@
 ==Limitations==
-This regression formulation considers only observational errors in the dependent variable (but the alternative [[wikipedia:total least squares|total least squares]] regression can account for errors in both variables). There are two rather different contexts with  different implications:
+This regression formulation considers only observational errors in the dependent variable (but the alternative [[total least squares|total least squares]] regression can account for errors in both variables). There are two rather different contexts with  different implications:
 *Regression for prediction. Here a model is fitted to provide a prediction rule for application in a similar situation to which the data used for fitting apply. Here the dependent variables corresponding to such future application would be subject to the same types of observation error as those in the data used for fitting. It is therefore logically consistent to use the least-squares prediction rule for such data.
-*Regression for fitting a "true relationship". In standard [[wikipedia:regression analysis|regression analysis]] that leads to fitting by least squares there is an implicit assumption that errors in the [[wikipedia:independent variable|independent variable]] are zero or strictly controlled so as to be negligible. When errors in the [[wikipedia:independent variable|independent variable]] are non-negligible, [[wikipedia:Errors-in-variables models|models of measurement error]] can be used; such methods can lead to [[wikipedia:parameter estimation|parameter estimates]], [[wikipedia:hypothesis testing|hypothesis testing]] and [[wikipedia:confidence interval|confidence interval]]s that take into account the presence of observation errors in the independent variables.<ref>For a good introduction to error-in-variables, please see {{cite book |last=Fuller |first=W. A. |author-link=Wayne Fuller |title=Measurement Error Models |publisher=John Wiley & Sons |year=1987 |isbn=978-0-471-86187-4 }}</ref> An alternative approach is to fit a model by [[wikipedia:total least squares|total least squares]]; this can be viewed as taking a pragmatic approach to balancing the effects of the different sources of error in formulating an objective function for use in model-fitting.
+*Regression for fitting a "true relationship". In standard [[regression analysis|regression analysis]] that leads to fitting by least squares there is an implicit assumption that errors in the independent variable are zero or strictly controlled so as to be negligible. When errors in the independent variable are non-negligible, [[Errors-in-variables models|models of measurement error]] can be used; such methods can lead to [[parameter estimation|parameter estimates]], [[hypothesis testing|hypothesis testing]] and [[confidence interval|confidence interval]]s that take into account the presence of observation errors in the independent variables.<ref>For a good introduction to error-in-variables, please see {{cite book |last=Fuller |first=W. A. |author-link=Wayne Fuller |title=Measurement Error Models |publisher=John Wiley & Sons |year=1987 |isbn=978-0-471-86187-4 }}</ref> An alternative approach is to fit a model by [[total least squares|total least squares]]; this can be viewed as taking a pragmatic approach to balancing the effects of the different sources of error in formulating an objective function for use in model-fitting.
 ==Linear Least Squares==
-'''Linear least squares''' ('''LLS''') is the [[wikipedia:least squares approximation|least squares approximation]] of [[wikipedia:linear functions|linear functions]] to data.
+'''Linear least squares''' ('''LLS''') is the [[least squares approximation|least squares approximation]] of [[linear functions|linear functions]] to data.
-It is a set of formulations for solving statistical problems involved in [[wikipedia:linear regression|linear regression]], including variants for [[wikipedia:Ordinary least squares|ordinary]] (unweighted), [[wikipedia:Weighted least squares|weighted]], and [[wikipedia:Generalized least squares|generalized]] (correlated) [[wikipedia:residuals (statistics)|residuals]].
+It is a set of formulations for solving statistical problems involved in [[linear regression|linear regression]], including variants for [[#Ordinary Least Squares (OLS)|ordinary]] (unweighted), [[Weighted least squares|weighted]], and [[Generalized least squares|generalized]] (correlated) [[residuals (statistics)|residuals]].
-[[wikipedia:Numerical methods for linear least squares|Numerical methods for linear least squares]] include inverting the matrix of the normal equations and [[wikipedia:matrix decomposition|orthogonal decomposition]] methods.
 ===Main formulations===
 The three main linear least squares formulations are:
-* '''[[wikipedia:Ordinary least squares|Ordinary least squares]]''' (OLS) is the most common estimator. OLS estimates are commonly used to analyze both [[wikipedia:experiment|experiment]]al and [[wikipedia:observational study|observational]] data. The OLS method minimizes the sum of squared [[wikipedia:Errors and residuals in statistics|residuals]], and leads to a closed-form expression for the estimated value of the unknown parameter vector <math>\beta</math>:
+{| class="table table-bordered"
+|-
+! Formulation !! Description
+|-
+| '''[[Ordinary least squares|Ordinary least squares]]''' (OLS) || Is the most common estimator. OLS estimates are commonly used to analyze both [[experiment|experiment]]al and [[observational study|observational]] data. The OLS method minimizes the sum of squared [[Errors and residuals in statistics|residuals]], and leads to a closed-form expression for the estimated value of the unknown parameter vector <math>\beta</math>:
 <math display="block">
    \hat{\boldsymbol\beta} = (\mathbf{X}^\mathsf{T}\mathbf{X})^{-1} \mathbf{X}^\mathsf{T} \mathbf{y},
-   </math> where <math>\mathbf{y}</math> is a vector whose <math>i</math>th element is the <math>i</math>th observation of the [[wikipedia:dependent variable|dependent variable]], and <math>\mathbf{X}</math> is a matrix whose <math>ij</math> element is the <math>i</math>th observation of the <math>j</math>th [[wikipedia:independent variable|independent variable]]. The estimator is [[wikipedia:bias of an estimator|unbiased]] and [[wikipedia:consistent estimator|consistent]] if the errors have finite variance and are uncorrelated with the regressors:<ref>{{cite journal | last1=Lai | first1=T.L. | last2=Robbins | first2=H. | last3=Wei | first3=C.Z. | journal=[[wikipedia:Proceedings of the National Academy of Sciences|PNAS]] | year=1978 | volume=75 | title=Strong consistency of least squares estimates in multiple regression | issue=7 | pages=3034–3036 | doi= 10.1073/pnas.75.7.3034 | pmid=16592540 | jstor=68164 | bibcode=1978PNAS...75.3034L | pmc=392707 | doi-access=free }}</ref> <math display="block">
+   </math> where <math>\mathbf{y}</math> is a vector whose <math>i</math>th element is the <math>i</math>th observation of the [[dependent variable|dependent variable]], and <math>\mathbf{X}</math> is a matrix whose <math>ij</math> element is the <math>i</math>th observation of the <math>j</math>th independent variable. The estimator is [[bias of an estimator|unbiased]] and [[consistent estimator|consistent]] if the errors have finite variance and are uncorrelated with the regressors:<ref>{{cite journal | last1=Lai | first1=T.L. | last2=Robbins | first2=H. | last3=Wei | first3=C.Z. | journal=[[Proceedings of the National Academy of Sciences|PNAS]] | year=1978 | volume=75 | title=Strong consistency of least squares estimates in multiple regression | issue=7 | pages=3034–3036 | doi= 10.1073/pnas.75.7.3034 | pmid=16592540 | jstor=68164 | bibcode=1978PNAS...75.3034L | pmc=392707 | doi-access=free }}</ref> <math display="block">
   \operatorname{E}[\,\mathbf{x}_i\varepsilon_i\,] = 0,
-  </math> where <math>\mathbf{x}_i</math> is the transpose of row <math>i</math> of the matrix <math>\mathbf{X}.</math> It is also [[wikipedia:efficiency (statistics)|efficient]] under the assumption that the errors have finite variance and are [[wikipedia:Homoscedasticity|homoscedastic]], meaning that <math>\operatorname{E}[\varepsilon_i^2|x_i]</math> does not depend on <math>i</math>. The condition that the errors are uncorrelated with the regressors will generally be satisfied in an experiment, but in the case of observational data, it is difficult to exclude the possibility of an omitted covariate <math>z</math> that is related to both the observed covariates and the response variable. The existence of such a covariate will generally lead to a correlation between the regressors and the response variable, and hence to an inconsistent estimator of <math>\beta</math>. The condition of homoscedasticity can fail with either experimental or observational data. If the goal is either inference or predictive modeling, the performance of OLS estimates can be poor if [[wikipedia:multicollinearity|multicollinearity]] is present, unless the sample size is large.
+  </math> where <math>\mathbf{x}_i</math> is the transpose of row <math>i</math> of the matrix <math>\mathbf{X}.</math> It is also [[efficiency (statistics)|efficient]] under the assumption that the errors have finite variance and are [[Homoscedasticity|homoscedastic]], meaning that <math>\operatorname{E}[\varepsilon_i^2|x_i]</math> does not depend on <math>i</math>. The condition that the errors are uncorrelated with the regressors will generally be satisfied in an experiment, but in the case of observational data, it is difficult to exclude the possibility of an omitted covariate <math>z</math> that is related to both the observed covariates and the response variable. The existence of such a covariate will generally lead to a correlation between the regressors and the response variable, and hence to an inconsistent estimator of <math>\beta</math>. The condition of homoscedasticity can fail with either experimental or observational data. If the goal is either inference or predictive modeling, the performance of OLS estimates can be poor if [[multicollinearity|multicollinearity]] is present, unless the sample size is large.
-* '''[[wikipedia:Weighted least squares|Weighted least squares]]''' (WLS) are used when [[wikipedia:heteroscedasticity|heteroscedasticity]] is present in the error terms of the model.
+|-
-* '''[[wikipedia:Generalized least squares|Generalized least squares]]''' (GLS) is an extension of the OLS method, that allows efficient estimation of <math>\beta</math> when either [[wikipedia:heteroscedasticity|heteroscedasticity]], or correlations, or both are present among the error terms of the model, as long as the form of heteroscedasticity and correlation is known independently of the data. To handle heteroscedasticity when the error terms are uncorrelated with each other, GLS minimizes a weighted analogue to the sum of squared residuals from OLS regression, where the weight for the <math>i</math><sup>th</sup> case is inversely proportional to <math>\operatorname{Var}(\varepsilon_i)</math>. This special case of GLS is called "weighted least squares". The GLS solution to an estimation problem is <math display="block">
+| '''[[Weighted least squares|Weighted least squares]]''' (WLS) || Used when [[heteroscedasticity|heteroscedasticity]] is present in the error terms of the model.
+|-
+| '''[[Generalized least squares|Generalized least squares]]''' (GLS)  ||  Is an extension of the OLS method, that allows efficient estimation of <math>\beta</math> when either [[heteroscedasticity|heteroscedasticity]], or correlations, or both are present among the error terms of the model, as long as the form of heteroscedasticity and correlation is known independently of the data. To handle heteroscedasticity when the error terms are uncorrelated with each other, GLS minimizes a weighted analogue to the sum of squared residuals from OLS regression, where the weight for the <math>i</math><sup>th</sup> case is inversely proportional to <math>\operatorname{Var}(\varepsilon_i)</math>. This special case of GLS is called "weighted least squares". The GLS solution to an estimation problem is <math display="block">
   \hat{\boldsymbol\beta} = (\mathbf{X}^\mathsf{T} \boldsymbol\Omega^{-1} \mathbf{X})^{-1}\mathbf{X}^\mathsf{T}\boldsymbol\Omega^{-1}\mathbf{y},
   </math> where <math>\Omega</math> is the covariance matrix of the errors. GLS can be viewed as applying a linear transformation to the data so that the assumptions of OLS are met for the transformed data. For GLS to be applied, the covariance structure of the errors must be known up to a multiplicative constant.
+|}
 ===Properties===
-If the experimental errors, <math>\varepsilon</math>, are uncorrelated, have a mean of zero and a constant variance, <math>\sigma</math>, the [[wikipedia:Gauss–Markov theorem|Gauss–Markov theorem]] states that the least-squares estimator, <math>\hat{\boldsymbol{\beta}}</math>, has the minimum variance of all estimators that are linear combinations of the observations. In this sense it is the best, or optimal, estimator of the parameters. Note particularly that this property is independent of the statistical [[wikipedia:Cumulative distribution function|distribution function]] of the errors. In other words, ''the distribution function of the errors need not be a [[wikipedia:normal distribution|normal distribution]]''. However, for some probability distributions, there is no guarantee that the least-squares solution is even possible given the observations; still, in such cases it is the best estimator that is both linear and unbiased.
+If the experimental errors, <math>\varepsilon</math>, are uncorrelated, have a mean of zero and a constant variance, <math>\sigma</math>, the [[Gauss–Markov theorem|Gauss–Markov theorem]] states that the least-squares estimator, <math>\hat{\boldsymbol{\beta}}</math>, has the minimum variance of all estimators that are linear combinations of the observations. In this sense it is the best, or optimal, estimator of the parameters. Note particularly that this property is independent of the statistical [[Cumulative distribution function|distribution function]] of the errors. In other words, ''the distribution function of the errors need not be a [[normal distribution|normal distribution]]''. However, for some probability distributions, there is no guarantee that the least-squares solution is even possible given the observations; still, in such cases it is the best estimator that is both linear and unbiased.
-For example, it is easy to show that the [[wikipedia:arithmetic mean|arithmetic mean]] of a set of measurements of a quantity is the least-squares estimator of the value of that quantity. If the conditions of the Gauss–Markov theorem apply, the arithmetic mean is optimal, whatever the distribution of errors of the measurements might be.
+For example, it is easy to show that the [[arithmetic mean|arithmetic mean]] of a set of measurements of a quantity is the least-squares estimator of the value of that quantity. If the conditions of the Gauss–Markov theorem apply, the arithmetic mean is optimal, whatever the distribution of errors of the measurements might be.
-However, in the case that the experimental errors do belong to a normal distribution, the least-squares estimator is also a [[wikipedia:maximum likelihood|maximum likelihood]] estimator.<ref>{{cite book |title=The Mathematics of Physics and Chemistry |last=Margenau |first=Henry | author2=Murphy, George Moseley |year=1956 | publisher=Van Nostrand |location=Princeton |url=https://archive.org/details/mathematicsofphy0002marg| url-access=registration }}</ref>
+However, in the case that the experimental errors do belong to a normal distribution, the least-squares estimator is also a [[maximum likelihood|maximum likelihood]] estimator.<ref>{{cite book |title=The Mathematics of Physics and Chemistry |last=Margenau |first=Henry | author2=Murphy, George Moseley |year=1956 | publisher=Van Nostrand |location=Princeton |url=https://archive.org/details/mathematicsofphy0002marg| url-access=registration }}</ref>
 These properties underpin the use of the method of least squares for all types of data fitting, even when the assumptions are not strictly valid.
 ===Limitations===
-An assumption underlying the treatment given above is that the independent variable,<math>X</math>, is free of error. In practice, the errors on the measurements of the independent variable are usually much smaller than the errors on the dependent variable and can therefore be ignored. When this is not the case, [[wikipedia:total least squares|total least squares]] or more generally [[wikipedia:errors-in-variables models|errors-in-variables models]], or ''rigorous least squares'', should be used. This can be done by adjusting the weighting scheme to take into account errors on both the dependent and independent variables and then following the standard procedure.<ref name="pg">{{cite book |title=Data fitting in the Chemical Sciences |last=Gans |first=Peter |year=1992 |publisher=Wiley |location=New York |isbn=978-0-471-93412-7 }}</ref><ref>{{cite book |title=Statistical adjustment of Data |last=Deming |first=W. E. |year=1943 |publisher=Wiley | location=New York }}</ref>
+An assumption underlying the treatment given above is that the independent variable,<math>X</math>, is free of error. In practice, the errors on the measurements of the independent variable are usually much smaller than the errors on the dependent variable and can therefore be ignored. When this is not the case, [[total least squares|total least squares]] or more generally errors-in-variables models, or ''rigorous least squares'', should be used. This can be done by adjusting the weighting scheme to take into account errors on both the dependent and independent variables and then following the standard procedure.<ref name="pg">{{cite book |title=Data fitting in the Chemical Sciences |last=Gans |first=Peter |year=1992 |publisher=Wiley |location=New York |isbn=978-0-471-93412-7 }}</ref><ref>{{cite book |title=Statistical adjustment of Data |last=Deming |first=W. E. |year=1943 |publisher=Wiley | location=New York }}</ref>
-In some cases the (weighted) normal equations matrix<math>X^{\mathsf{T}}X</math> is [[wikipedia:ill-conditioned|ill-conditioned]]. When fitting polynomials the normal equations matrix is a [[wikipedia:Vandermonde matrix|Vandermonde matrix]]. Vandermonde matrices become increasingly ill-conditioned as the order of the matrix increases. In these cases, the least squares estimate amplifies the measurement noise and may be grossly inaccurate. Various [[wikipedia:regularization (mathematics)|regularization]] techniques can be applied in such cases, the most common of which is called [[wikipedia:Tikhonov regularization|ridge regression]]. If further information about the parameters is known, for example, a range of possible values of <math>\mathbf{\hat{\boldsymbol{\beta}}}</math>, then various techniques can be used to increase the stability of the solution. For example, see [[wikipedia:#Constrained_linear_least_squares|constrained least squares]].
+In some cases the (weighted) normal equations matrix <math>X^{\mathsf{T}}X</math> is [[ill-conditioned|ill-conditioned]]. When fitting polynomials the normal equations matrix is a [[Vandermonde matrix|Vandermonde matrix]]. Vandermonde matrices become increasingly ill-conditioned as the order of the matrix increases. In these cases, the least squares estimate amplifies the measurement noise and may be grossly inaccurate. Various [[regularization (mathematics)|regularization]] techniques can be applied in such cases, the most common of which is called [[Tikhonov regularization|ridge regression]]. If further information about the parameters is known, for example, a range of possible values of <math>\mathbf{\hat{\boldsymbol{\beta}}}</math>, then various techniques can be used to increase the stability of the solution. For example, see [[#Constrained_linear_least_squares|constrained least squares]].
-Another drawback of the least squares estimator is the fact that the norm of the residuals, <math>\| \mathbf y - X\hat{\boldsymbol{\beta}} \|</math> is minimized, whereas in some cases one is truly interested in obtaining small error in the parameter <math>\mathbf{\hat{\boldsymbol{\beta}}}</math>, e.g., a small value of <math>\|{\boldsymbol{\beta}}-\hat{\boldsymbol{\beta}}\|</math>. However, since the true parameter <math>{\boldsymbol{\beta}}</math> is necessarily unknown, this quantity cannot be directly minimized. If a [[wikipedia:prior probability|prior probability]] on <math>\hat{\boldsymbol{\beta}}</math> is known, then a [[wikipedia:Minimum mean square error|Bayes estimator]] can be used to minimize the [[wikipedia:mean squared error|mean squared error]], <math>E \left\{ \| {\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}} \|^2 \right\} </math>. The least squares method is often applied when no prior is known. Surprisingly, when several parameters are being estimated jointly, better estimators can be constructed, an effect known as [[wikipedia:Stein's phenomenon|Stein's phenomenon]]. For example, if the measurement error is [[wikipedia:Normal distribution|Gaussian]], several estimators are known which [[wikipedia:dominating decision rule|dominate]], or outperform, the least squares technique; the best known of these is the [[wikipedia:James–Stein estimator|James–Stein estimator]]. This is an example of more general [[wikipedia:shrinkage estimator|shrinkage estimator]]s that have been applied to regression problems.
+Another drawback of the least squares estimator is the fact that the norm of the residuals, <math>\| \mathbf y - X\hat{\boldsymbol{\beta}} \|</math> is minimized, whereas in some cases one is truly interested in obtaining small error in the parameter <math>\mathbf{\hat{\boldsymbol{\beta}}}</math>, e.g., a small value of <math>\|{\boldsymbol{\beta}}-\hat{\boldsymbol{\beta}}\|</math>. However, since the true parameter <math>{\boldsymbol{\beta}}</math> is necessarily unknown, this quantity cannot be directly minimized. If a [[prior probability|prior probability]] on <math>\hat{\boldsymbol{\beta}}</math> is known, then a [[Minimum mean square error|Bayes estimator]] can be used to minimize the [[mean squared error|mean squared error]], <math>E \left\{ \| {\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}} \|^2 \right\} </math>. The least squares method is often applied when no prior is known. Surprisingly, when several parameters are being estimated jointly, better estimators can be constructed, an effect known as [[Stein's phenomenon|Stein's phenomenon]]. For example, if the measurement error is [[Normal distribution|Gaussian]], several estimators are known which [[dominating decision rule|dominate]], or outperform, the least squares technique; the best known of these is the [[James–Stein estimator|James–Stein estimator]]. This is an example of more general [[shrinkage estimator|shrinkage estimator]]s that have been applied to regression problems.
 ==Ordinary Least Squares (OLS)==
-In [[wikipedia:statistics|statistics]], '''ordinary least squares''' ('''OLS''') is a type of [[wikipedia:linear least squares|linear least squares]] method for estimating the unknown [[wikipedia:statistical parameter|parameters]] in a [[wikipedia:linear regression|linear regression]] model. OLS chooses the parameters of a [[wikipedia:linear function|linear function]] of a set of [[wikipedia:explanatory variable|explanatory variable]]s by the principle of [[wikipedia:least squares|least squares]]: minimizing the sum of the squares of the differences between the observed [[wikipedia:dependent variable|dependent variable]] (values of the variable being observed) in the given [[wikipedia:dataset|dataset]] and those predicted by the linear function of the [[wikipedia:independent variable|independent variable]].
+In [[statistics|statistics]], '''ordinary least squares''' ('''OLS''') is a type of [[linear least squares|linear least squares]] method for estimating the unknown [[statistical parameter|parameters]] in a [[linear regression|linear regression]] model. OLS chooses the parameters of a [[linear function|linear function]] of a set of [[explanatory variable|explanatory variable]]s by the principle of [[least squares|least squares]]: minimizing the sum of the squares of the differences between the observed [[dependent variable|dependent variable]] (values of the variable being observed) in the given [[dataset|dataset]] and those predicted by the linear function of the independent variable.
-Geometrically, this is seen as the sum of the squared distances, parallel to the axis of the dependent variable, between each data point in the set and the corresponding point on the regression surface—the smaller the differences, the better the model fits the data. The resulting [[wikipedia:Statistical estimation|estimator]] can be expressed by a simple formula, especially in the case of a [[wikipedia:simple linear regression|simple linear regression]], in which there is a single [[wikipedia:regressor|regressor]] on the right side of the regression equation.
+Geometrically, this is seen as the sum of the squared distances, parallel to the axis of the dependent variable, between each data point in the set and the corresponding point on the regression surface—the smaller the differences, the better the model fits the data. The resulting [[Statistical estimation|estimator]] can be expressed by a simple formula, especially in the case of a [[simple linear regression|simple linear regression]], in which there is a single [[regressor|regressor]] on the right side of the regression equation.
-The OLS estimator is [[wikipedia:consistent estimator|consistent]] when the regressors are [[wikipedia:exogenous|exogenous]], and—by the [[wikipedia:Gauss–Markov theorem|Gauss–Markov theorem]]—[[wikipedia:best linear unbiased estimator|optimal in the class of linear unbiased estimators]] when the [[wikipedia:statistical error|error]]s are [[wikipedia:homoscedastic|homoscedastic]] and [[wikipedia:autocorrelation|serially uncorrelated]]. Under these conditions, the method of OLS provides [[wikipedia:UMVU|minimum-variance mean-unbiased]] estimation when the errors have finite [[wikipedia:variance|variance]]s. Under the additional assumption that the errors are [[wikipedia:normal distribution|normally distributed]], OLS is the [[wikipedia:maximum likelihood estimator|maximum likelihood estimator]].
+The OLS estimator is [[consistent estimator|consistent]] when the regressors are [[exogenous|exogenous]], and—by the [[Gauss–Markov theorem|Gauss–Markov theorem]]—[[best linear unbiased estimator|optimal in the class of linear unbiased estimators]] when the [[statistical error|error]]s are [[homoscedastic|homoscedastic]] and [[autocorrelation|serially uncorrelated]]. Under these conditions, the method of OLS provides [[UMVU|minimum-variance mean-unbiased]] estimation when the errors have finite [[variance|variance]]s. Under the additional assumption that the errors are [[normal distribution|normally distributed]], OLS is the [[maximum likelihood estimator|maximum likelihood estimator]].
 === Linear model ===
-[[File:Okuns law quarterly differences.svg|300px|thumb|[[wikipedia:Okun's law|Okun's law]] in [[wikipedia:macroeconomics|macroeconomics]] states that in an economy the [[wikipedia:GDP|GDP]] growth should depend linearly on the changes in the unemployment rate. Here the ordinary least squares method is used to construct the regression line describing this law.]]
+[[File:Okuns law quarterly differences.svg|300px|thumb|[[Okun's law|Okun's law]] in [[macroeconomics|macroeconomics]] states that in an economy the [[GDP|GDP]] growth should depend linearly on the changes in the unemployment rate. Here the ordinary least squares method is used to construct the regression line describing this law.]]
-Suppose the data consists of <math>n</math> [[wikipedia:statistical unit|observations]] <math>\left\{\mathbf{x}_i, y_i\right\}_{i=1}^n</math>. Each observation <math>i</math> includes a scalar response <math>y_i</math> and a column vector <math>\mathbf{x}_i</math> of <math>p</math> parameters (regressors), i.e., <math>\mathbf{x}_i=\left[x_{i1}, x_{i2}, \dots, x_{ip}\right]^\mathsf{T}</math>. In a [[wikipedia:linear regression model|linear regression model]], the response variable, <math>y_i</math>, is a linear function of the regressors:
+Suppose the data consists of <math>n</math> [[statistical unit|observations]] <math>\left\{\mathbf{x}_i, y_i\right\}_{i=1}^n</math>. Each observation <math>i</math> includes a scalar response <math>y_i</math> and a column vector <math>\mathbf{x}_i</math> of <math>p</math> parameters (regressors), i.e., <math>\mathbf{x}_i=\left[x_{i1}, x_{i2}, \dots, x_{ip}\right]^\mathsf{T}</math>. In a [[linear regression model|linear regression model]], the response variable, <math>y_i</math>, is a linear function of the regressors:
 <math display="block">y_i = \beta_1\ x_{i1} + \beta_2\ x_{i2} + \cdots + \beta_p\ x_{ip} + \varepsilon_i,</math>
-or in [[wikipedia:row and column vectors|vector]] form,
+or in [[row and column vectors|vector]] form,
 <math display = "block"> y_i = \mathbf{x}_i^\mathsf{T} \boldsymbol{\beta} + \varepsilon_i, \, </math>
-where <math>\mathbf{x}_i</math>, as introduced previously, is a column vector of the <math>i</math>-th observation of all the explanatory variables; <math>\boldsymbol{\beta}</math> is a <math>p \times 1</math> vector of unknown parameters; and the scalar <math>\varepsilon_i</math> represents unobserved random variables ([[wikipedia:errors and residuals in statistics|errors]]) of the <math>i</math>-th observation. <math>\varepsilon_i</math> accounts for the influences upon the responses <math>y_i</math> from sources other than the explanators <math>\mathbf{x}_i</math>. This model can also be written in matrix notation as
+where <math>\mathbf{x}_i</math>, as introduced previously, is a column vector of the <math>i</math>-th observation of all the explanatory variables; <math>\boldsymbol{\beta}</math> is a <math>p \times 1</math> vector of unknown parameters; and the scalar <math>\varepsilon_i</math> represents unobserved random variables ([[errors and residuals in statistics|errors]]) of the <math>i</math>-th observation. <math>\varepsilon_i</math> accounts for the influences upon the responses <math>y_i</math> from sources other than the explanators <math>\mathbf{x}_i</math>. This model can also be written in matrix notation as
 <math display = "block"> \mathbf{y} = \mathrm{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}, \,  </math>
-where <math>\mathbf{y}</math> and <math>\boldsymbol{\varepsilon}</math> are <math>n \times 1</math> vectors of the response variables and the errors of the <math> n  </math> observations, and <math>\mathrm{X}</math> is an <math>n \times p</math> matrix of regressors, also sometimes called the [[wikipedia:design matrix|design matrix]], whose row <math>i</math> is <math>\mathbf{x}_i^\mathsf{T}</math> and contains the <math>i</math>-th observations on all the explanatory variables.
+where <math>\mathbf{y}</math> and <math>\boldsymbol{\varepsilon}</math> are <math>n \times 1</math> vectors of the response variables and the errors of the <math> n  </math> observations, and <math>\mathrm{X}</math> is an <math>n \times p</math> matrix of regressors, also sometimes called the [[design matrix|design matrix]], whose row <math>i</math> is <math>\mathbf{x}_i^\mathsf{T}</math> and contains the <math>i</math>-th observations on all the explanatory variables.
 As a rule, the constant term is always included in the set of regressors <math>\mathrm{X}</math>, say, by taking <math>x_{i1}=1</math> for all <math>i=1, \dots, n</math>. The coefficient <math>\beta_1</math> corresponding to this regressor is called the ''intercept''.
@@ Line 127: / Line 108: @@
 ====Matrix/vector formulation====
-Consider an [[wikipedia:overdetermined system|overdetermined system]]
+Consider an [[overdetermined system|overdetermined system]]
 <math display="block">\sum_{j=1}^{p} X_{ij} \beta_j = y_i,\ (i=1, 2, \dots, n),</math>
-of <math> n  </math> [[wikipedia:linear equation|linear equation]]s in <math>p</math> unknown [[wikipedia:coefficients|coefficients]], <math> \beta_1, \beta_2, \dots, \beta_p  </math>, with <math> n > p  </math>. (Note: for a linear model as above, not all elements in <math> \mathrm{X}  </math> contains information on the data points. The first column is populated with ones, <math>X_{i1} = 1</math>. Only the other columns contain actual data. So here <math>p</math> is equal to the number of regressors plus one.) This can be written in [[wikipedia:matrix (mathematics)|matrix]] form as
+of <math> n  </math> [[linear equation|linear equation]]s in <math>p</math> unknown [[coefficients|coefficients]], <math> \beta_1, \beta_2, \dots, \beta_p  </math>, with <math> n > p  </math>. (Note: for a linear model as above, not all elements in <math> \mathrm{X}  </math> contains information on the data points. The first column is populated with ones, <math>X_{i1} = 1</math>. Only the other columns contain actual data. So here <math>p</math> is equal to the number of regressors plus one.) This can be written in [[matrix (mathematics)|matrix]] form as
 <math display="block">\mathrm{X} \boldsymbol{\beta} = \mathbf {y},</math>
@@ Line 149: / Line 130: @@
 \end{bmatrix}. </math>
-Such a system usually has no exact solution, so the goal is instead to find the coefficients <math>\boldsymbol{\beta}</math> which fit the equations "best", in the sense of solving the [[wikipedia:Quadratic form (statistics)|quadratic]] [[wikipedia:Mathematical optimization|minimization]] problem
+Such a system usually has no exact solution, so the goal is instead to find the coefficients <math>\boldsymbol{\beta}</math> which fit the equations "best", in the sense of solving the quadratic minimization problem
 <math display="block">\hat{\boldsymbol{\beta}} = \underset{\boldsymbol{\beta}}{\operatorname{arg\,min}}\,S(\boldsymbol{\beta}), </math>
@@ Line 157: / Line 138: @@
 <math display="block">S(\boldsymbol{\beta}) = \sum_{i=1}^n \biggl| y_i - \sum_{j=1}^p X_{ij}\beta_j\biggr|^2 = \bigl\|\mathbf y - \mathrm{X} \boldsymbol \beta \bigr\|^2.</math>
-A justification for choosing this criterion is given in [[wikipedia:#Properties|Properties]] below. This minimization problem has a unique solution, provided that the <math>p</math> columns of the matrix <math> \mathrm{X}  </math> are [[wikipedia:linearly independent|linearly independent]], given by solving the so-called ''normal equations'':
+A justification for choosing this criterion is given in [[#Properties|Properties]] below. This minimization problem has a unique solution, provided that the <math>p</math> columns of the matrix <math> \mathrm{X}  </math> are [[linearly independent|linearly independent]], given by solving the so-called ''normal equations'':
 <math display="block">(\mathrm{X}^{\mathsf T} \mathrm{X} )\hat{\boldsymbol{\beta}} = \mathrm{X}^{\mathsf T} \mathbf y\ .</math>
-The matrix <math>\mathrm{X}^{\mathsf T} \mathrm X</math> is known as the ''normal matrix'' or [[wikipedia:Gram matrix|Gram matrix]] and the matrix <math>\mathrm{X}^{\mathsf T} \mathbf y</math> is known as the [[wikipedia:moment matrix|moment matrix]] of regressand by regressors.<ref>{{cite book |first=Arthur S. |last=Goldberger |author-link=wikipedia:Arthur Goldberger |chapter=Classical Linear Regression |title=Econometric Theory |location=New York |publisher=John Wiley & Sons |year=1964 |isbn=0-471-31101-4 |pages=[https://archive.org/details/econometrictheor0000gold/page/158 158] |chapter-url=https://books.google.com/books?id=KZq5AAAAIAAJ&pg=PA156 |url=https://archive.org/details/econometrictheor0000gold/page/158 }}</ref> Finally, <math>\hat{\boldsymbol{\beta}}</math> is the coefficient vector of the least-squares [[wikipedia:hyperplane|hyperplane]], expressed as
+The matrix <math>\mathrm{X}^{\mathsf T} \mathrm X</math> is known as the ''normal matrix'' or [[Gram matrix|Gram matrix]] and the matrix <math>\mathrm{X}^{\mathsf T} \mathbf y</math> is known as the [[moment matrix|moment matrix]] of regressand by regressors.<ref>{{cite book |first=Arthur S. |last=Goldberger |author-link=Arthur Goldberger |chapter=Classical Linear Regression |title=Econometric Theory |location=New York |publisher=John Wiley & Sons |year=1964 |isbn=0-471-31101-4 |pages=[https://archive.org/details/econometrictheor0000gold/page/158 158] |chapter-url=https://books.google.com/books?id=KZq5AAAAIAAJ&pg=PA156 |url=https://archive.org/details/econometrictheor0000gold/page/158 }}</ref> Finally, <math>\hat{\boldsymbol{\beta}}</math> is the coefficient vector of the least-squares [[hyperplane|hyperplane]], expressed as
 <math display="block">\hat{\boldsymbol{\beta}}= \left( \mathrm{X}^{\mathsf T} \mathrm{X} \right)^{-1} \mathrm{X}^{\mathsf T} \mathbf y.</math>
@@ Line 171: / Line 152: @@
 == Estimation ==
-Suppose <math>b</math> is a "candidate" value for the parameter vector <math>\beta</math>. The quantity <math>y_i-x_i^{\mathsf{T}}b</math>, called the '''[[wikipedia:errors and residuals in statistics|residual]]''' for the <math>i</math>-th observation, measures the vertical distance between the data point <math>(x_i,y_i)</math> and the hyperplane <math>y=X^{\mathsf{T}}b</math>, and thus assesses the degree of fit between the actual data and the model. The '''sum of squared residuals''' ('''SSR''') (also called the '''error sum of squares''' ('''ESS''') or '''residual sum of squares''' ('''RSS'''))<ref>{{cite book |last=Hayashi |first=Fumio |author-link=wikipedia:Fumio Hayashi  |title=Econometrics |publisher=Princeton University Press |year=2000 |page=15 }}</ref> is a measure of the overall model fit:
+Suppose <math>b</math> is a "candidate" value for the parameter vector <math>\beta</math>. The quantity <math>y_i-x_i^{\mathsf{T}}b</math>, called the '''[[errors and residuals in statistics|residual]]''' for the <math>i</math>-th observation, measures the vertical distance between the data point <math>(x_i,y_i)</math> and the hyperplane <math>y=X^{\mathsf{T}}b</math>, and thus assesses the degree of fit between the actual data and the model. The '''sum of squared residuals''' ('''SSR''') (also called the '''error sum of squares''' ('''ESS''') or '''residual sum of squares''' ('''RSS'''))<ref>{{cite book |last=Hayashi |first=Fumio |author-link=Fumio Hayashi  |title=Econometrics |publisher=Princeton University Press |year=2000 |page=15 }}</ref> is a measure of the overall model fit:
 <math display = "block">
      S(b) = \sum_{i=1}^n (y_i - x_i ^\mathrm{T} b)^2 = (y-Xb)^\mathrm{T}(y-Xb),
    </math>
-where <math>\mathsf{T}</math> denotes the matrix [[wikipedia:transpose|transpose]], and the rows of<math>X</math>, denoting the values of all the independent variables associated with a particular value of the dependent variable, are ''X<sub>i</sub> = x<sub>i</sub>''<sup>T</sup>.  The value of ''b'' which minimizes this sum is called the '''OLS estimator for <math>\beta</math>'''. The function <math>S(b)</math> is quadratic in <math>b</math> with positive-definite [[wikipedia:Hessian matrix|Hessian]], and therefore this function possesses a unique global minimum at <math>b =\hat\beta</math>, which can be given by the explicit formula:<ref>{{harvtxt|Hayashi|2000|loc=page 18}}</ref>
+where <math>\mathsf{T}</math> denotes the matrix [[transpose|transpose]], and the rows of<math>X</math>, denoting the values of all the independent variables associated with a particular value of the dependent variable, are ''X<sub>i</sub> = x<sub>i</sub>''<sup>T</sup>.  The value of ''b'' which minimizes this sum is called the '''OLS estimator for <math>\beta</math>'''. The function <math>S(b)</math> is quadratic in <math>b</math> with positive-definite [[Hessian matrix|Hessian]], and therefore this function possesses a unique global minimum at <math>b =\hat\beta</math>, which can be given by the explicit formula:<ref>{{harvtxt|Hayashi|2000|loc=page 18}}</ref>
 <math display = "block">
@@ Line 181: / Line 162: @@
    </math>
-The product <math>N=X^{\mathsf{T}}X</math> is a [[wikipedia:Gram matrix|Gram matrix]] and its inverse, <math>Q=N^{-1}</math>, is the ''cofactor matrix'' of <math>\beta</math>,<ref>{{Cite book|url=https://books.google.com/books?id=hZ4mAOXVowoC&pg=PA160|title=Adjustment Computations: Spatial Data Analysis|isbn=9780471697282|last1=Ghilani|first1=Charles D.|last2=Paul r. Wolf|first2=Ph. D.|date=12 June 2006}}</ref><ref>{{Cite book|url=https://books.google.com/books?id=Np7y43HU_m8C&pg=PA263|title=GNSS – Global Navigation Satellite Systems: GPS, GLONASS, Galileo, and more|isbn=9783211730171|last1=Hofmann-Wellenhof|first1=Bernhard|last2=Lichtenegger|first2=Herbert|last3=Wasle|first3=Elmar|date=20 November 2007}}</ref><ref>{{Cite book|url=https://books.google.com/books?id=peYFZ69HqEsC&pg=PA134|title=GPS: Theory, Algorithms and Applications|isbn=9783540727156|last1=Xu|first1=Guochang|date=5 October 2007}}</ref> closely related to its [[wikipedia:#Covariance matrix|covariance matrix]], <math>C_{\beta}</math>.
+The product <math>N=X^{\mathsf{T}}X</math> is a [[Gram matrix|Gram matrix]] and its inverse, <math>Q=N^{-1}</math>, is the ''cofactor matrix'' of <math>\beta</math>,<ref>{{Cite book|url=https://books.google.com/books?id=hZ4mAOXVowoC&pg=PA160|title=Adjustment Computations: Spatial Data Analysis|isbn=9780471697282|last1=Ghilani|first1=Charles D.|last2=Paul r. Wolf|first2=Ph. D.|date=12 June 2006}}</ref><ref>{{Cite book|url=https://books.google.com/books?id=Np7y43HU_m8C&pg=PA263|title=GNSS – Global Navigation Satellite Systems: GPS, GLONASS, Galileo, and more|isbn=9783211730171|last1=Hofmann-Wellenhof|first1=Bernhard|last2=Lichtenegger|first2=Herbert|last3=Wasle|first3=Elmar|date=20 November 2007}}</ref><ref>{{Cite book|url=https://books.google.com/books?id=peYFZ69HqEsC&pg=PA134|title=GPS: Theory, Algorithms and Applications|isbn=9783540727156|last1=Xu|first1=Guochang|date=5 October 2007}}</ref> closely related to its [[#Covariance matrix|covariance matrix]], <math>C_{\beta}</math>.
-The matrix <math display = "block">(X^{\mathsf{T}}X)^{-1}X^{\mathsf{T}}=QX^{\mathsf{T}}</math> is called the [[wikipedia:Moore–Penrose pseudoinverse|Moore–Penrose pseudoinverse]] matrix of <math>X</math>. This formulation highlights the point that estimation can be carried out if, and only if, there is no perfect [[wikipedia:multicollinearity|multicollinearity]] between the explanatory variables (which would cause the gram matrix to have no inverse).
+The matrix <math display = "block">(X^{\mathsf{T}}X)^{-1}X^{\mathsf{T}}=QX^{\mathsf{T}}</math> is called the [[Moore–Penrose pseudoinverse|Moore–Penrose pseudoinverse]] matrix of <math>X</math>. This formulation highlights the point that estimation can be carried out if, and only if, there is no perfect [[multicollinearity|multicollinearity]] between the explanatory variables (which would cause the gram matrix to have no inverse).
 After we have estimated <math>\beta</math>, the '''fitted values''' (or '''predicted values''') from the regression will be
@@ Line 188: / Line 169: @@
      \hat{y} = X\hat\beta = Py,
    </math>
-where <math display = "block">P=X(X^{\mathsf{T}}X)^{-1}X^{\mathsf{T}}</math> is the [[wikipedia:projection matrix|projection matrix]] onto the space <math>V</math> spanned by the columns of<math>X</math>. This matrix <math>P</math> is also sometimes called the [[wikipedia:hat matrix|hat matrix]] because it "puts a hat" onto the variable <math>y</math>. Another matrix, closely related to <math>P</math> is the ''annihilator'' matrix <math>M=I_n-P</math>; this is a projection matrix onto the space orthogonal to <math>V</math>. Both matrices <math>P</math> and <math>M</math> are [[wikipedia:symmetric matrix|symmetric]] and [[wikipedia:idempotent matrix|idempotent]] (meaning that <math>P^2 = P</math> and <math>M^2=M</math>), and relate to the data matrix <math>X</math> via identities <math>PX=X</math> and <math>MX=0</math>.<ref name="Hayashi 2000 loc=page 19">{{harvtxt|Hayashi|2000|loc=page 19}}</ref> Matrix <math>M</math> creates the '''residuals''' from the regression:
+where <math display = "block">P=X(X^{\mathsf{T}}X)^{-1}X^{\mathsf{T}}</math> is the [[projection matrix|projection matrix]] onto the space <math>V</math> spanned by the columns of<math>X</math>. This matrix <math>P</math> is also sometimes called the [[hat matrix|hat matrix]] because it "puts a hat" onto the variable <math>y</math>. Another matrix, closely related to <math>P</math> is the ''annihilator'' matrix <math>M=I_n-P</math>; this is a projection matrix onto the space orthogonal to <math>V</math>. Both matrices <math>P</math> and <math>M</math> are [[symmetric matrix|symmetric]] and [[idempotent matrix|idempotent]] (meaning that <math>P^2 = P</math> and <math>M^2=M</math>), and relate to the data matrix <math>X</math> via identities <math>PX=X</math> and <math>MX=0</math>.<ref name="Hayashi 2000 loc=page 19">{{harvtxt|Hayashi|2000|loc=page 19}}</ref> Matrix <math>M</math> creates the '''residuals''' from the regression:
 <math display = "block">
      \hat\varepsilon = y - \hat y = y - X\hat\beta = My = M(X\beta+\varepsilon) = (MX)\beta + M\varepsilon = M\varepsilon.
    </math>
-Using these residuals we can estimate the value of <math>\sigma^2</math> using the '''[[wikipedia:reduced chi-squared|reduced chi-squared]]''' statistic:
+Using these residuals we can estimate the value of <math>\sigma^2</math> using the '''[[reduced chi-squared|reduced chi-squared]]''' statistic:
 <math display = "block">
      s^2 = \frac{\hat\varepsilon ^\mathrm{T} \hat\varepsilon}{n-p} = \frac{(My)^\mathrm{T} My}{n-p} = \frac{y^\mathrm{T} M^\mathrm{T}My}{n-p}= \frac{y ^\mathrm{T} My}{n-p} = \frac{S(\hat\beta)}{n-p},\qquad
      \hat\sigma^2 = \frac{n-p}{n}\;s^2
    </math>
-The denominator, <math>n-p</math>, is the [[wikipedia:Degrees of freedom (statistics)|statistical degrees of freedom]]. The first quantity, <math>s^2</math>, is the OLS estimate for <math>\sigma^2</math>, whereas the second, <math style="vertical-align:0">\scriptstyle\hat\sigma^2</math>, is the MLE estimate for <math>\sigma^2</math>. The two estimators are quite similar in large samples; the first estimator is always [[wikipedia:estimator bias|unbiased]], while the second estimator is biased but has a smaller [[wikipedia:mean squared error|mean squared error]]. In practice <math>s^2</math> is used more often, since it is more convenient for the hypothesis testing. The square root of <math>s^2</math> is called the '''regression standard error''',<ref>[https://cran.r-project.org/doc/contrib/Faraway-PRA.pdf Julian Faraway (2000), ''Practical Regression and Anova using R'']</ref> '''standard error of the regression''',<ref>{{cite book |last1=Kenney |first1=J. |last2=Keeping |first2=E. S. |year=1963 |title=Mathematics of Statistics |publisher=van Nostrand |page=187 }}</ref><ref>{{cite book |last=Zwillinger |first=D. |year=1995 |title=Standard Mathematical Tables and Formulae |publisher=Chapman&Hall/CRC |isbn=0-8493-2479-3 |page=626 }}</ref> or '''standard error of the equation'''.<ref name="Hayashi 2000 loc=page 19"/>
+The denominator, <math>n-p</math>, is the [[Degrees of freedom (statistics)|statistical degrees of freedom]]. The first quantity, <math>s^2</math>, is the OLS estimate for <math>\sigma^2</math>, whereas the second, <math style="vertical-align:0">\scriptstyle\hat\sigma^2</math>, is the MLE estimate for <math>\sigma^2</math>. The two estimators are quite similar in large samples; the first estimator is always [[estimator bias|unbiased]], while the second estimator is biased but has a smaller [[mean squared error|mean squared error]]. In practice <math>s^2</math> is used more often, since it is more convenient for the hypothesis testing. The square root of <math>s^2</math> is called the '''regression standard error''',<ref>[https://cran.r-project.org/doc/contrib/Faraway-PRA.pdf Julian Faraway (2000), ''Practical Regression and Anova using R'']</ref> '''standard error of the regression''',<ref>{{cite book |last1=Kenney |first1=J. |last2=Keeping |first2=E. S. |year=1963 |title=Mathematics of Statistics |publisher=van Nostrand |page=187 }}</ref><ref>{{cite book |last=Zwillinger |first=D. |year=1995 |title=Standard Mathematical Tables and Formulae |publisher=Chapman&Hall/CRC |isbn=0-8493-2479-3 |page=626 }}</ref> or '''standard error of the equation'''.<ref name="Hayashi 2000 loc=page 19"/>
-It is common to assess the goodness-of-fit of the OLS regression by comparing how much the initial variation in the sample can be reduced by regressing onto<math>X</math>. The '''[[wikipedia:coefficient of determination|coefficient of determination]]''' <math>R^2</math> is defined as a ratio of "explained" variance to the "total" variance of the dependent variable <math>y</math>, in the cases where the regression sum of squares equals the sum of squares of residuals:<ref>{{harvtxt|Hayashi|2000|loc=page 20}}</ref>
+It is common to assess the goodness-of-fit of the OLS regression by comparing how much the initial variation in the sample can be reduced by regressing onto<math>X</math>. The '''[[coefficient of determination|coefficient of determination]]''' <math>R^2</math> is defined as a ratio of "explained" variance to the "total" variance of the dependent variable <math>y</math>, in the cases where the regression sum of squares equals the sum of squares of residuals:<ref>{{harvtxt|Hayashi|2000|loc=page 20}}</ref>
 <math display = "block">
      R^2 = \frac{\sum(\hat y_i-\overline{y})^2}{\sum(y_i-\overline{y})^2} = \frac{y ^\mathrm{T} P ^\mathrm{T} LPy}{y ^\mathrm{T} Ly} = 1 - \frac{y ^\mathrm{T} My}{y ^\mathrm{T} Ly} = 1 - \frac{\rm RSS}{\rm TSS}
    </math>
-where TSS is the '''[[wikipedia:total sum of squares|total sum of squares]]''' for the dependent variable, <math display="inline">L=I_n-\frac{1}{n}J_n</math>, and <math display="inline">J_n</math> is an <math>n\times n</math> matrix of ones. (<math>L</math> is a [[wikipedia:centering matrix|centering matrix]] which is equivalent to regression on a constant; it simply subtracts the mean from a variable.) In order for  <math>R^2</math> to be meaningful, the matrix<math>X</math> of data on regressors must contain a column vector of ones to represent the constant whose coefficient is the regression intercept. In that case, <math>R^2</math> will always be a number between 0 and 1, with values close to 1 indicating a good degree of fit.
+where TSS is the '''[[total sum of squares|total sum of squares]]''' for the dependent variable, <math display="inline">L=I_n-\frac{1}{n}J_n</math>, and <math display="inline">J_n</math> is an <math>n\times n</math> matrix of ones. (<math>L</math> is a [[centering matrix|centering matrix]] which is equivalent to regression on a constant; it simply subtracts the mean from a variable.) In order for  <math>R^2</math> to be meaningful, the matrix <math>X</math> of data on regressors must contain a column vector of ones to represent the constant whose coefficient is the regression intercept. In that case, <math>R^2</math> will always be a number between 0 and 1, with values close to 1 indicating a good degree of fit.
-The variance in the prediction of the independent variable as a function of the dependent variable is given in the article [[wikipedia:Polynomial least squares|Polynomial least squares]].
+The variance in the prediction of the independent variable as a function of the dependent variable is given in the article [[Polynomial least squares|Polynomial least squares]].
 === Simple linear regression model ===
@@ Line 224: / Line 205: @@
 === Alternative derivations ===
-In the previous section the least squares estimator <math>\hat\beta</math> was obtained as a value that minimizes the sum of squared residuals of the model. However it is also possible to derive the same estimator from other approaches. In all cases the formula for OLS estimator remains the same: <math>\hat{\beta}=(X^{\mathsf{T}}X)^{-1}X^{\mathsf{T}}y</math>; the only difference is in how we interpret this result.
+In the previous section the least squares estimator <math>\hat\beta</math> was obtained as a value that minimizes the sum of squared residuals of the model. However it is also possible to derive the same estimator from other approaches. In all cases the formula for OLS estimator remains the same: <math display = "block">\hat{\beta}=(X^{\mathsf{T}}X)^{-1}X^{\mathsf{T}}y</math>; the only difference is in how we interpret this result.
 ==== Projection ====
@@ Line 233: / Line 214: @@
      \hat\beta = {\rm arg}\min_\beta\,\lVert y - X\beta \rVert,
    </math>
-where <math>||\cdot||</math>  is the standard [[wikipedia:Norm (mathematics)#Euclidean norm|''L''<sup>2</sup>&nbsp;norm]] in the <math>n</math>-dimensional [[wikipedia:Euclidean space|Euclidean space]] <math>\mathbb{R}^n</math>. The predicted quantity <math>X\beta</math> is just a certain linear combination of the vectors of regressors. Thus, the residual vector <math>y-X\beta</math> will have the smallest length when<math>y</math>is [[wikipedia:projection (linear algebra)|projected orthogonally]] onto the [[wikipedia:linear subspace|linear subspace]] [[wikipedia:linear span|spanned]] by the columns of<math>X</math>. The OLS estimator <math>\hat\beta</math> in this case can be interpreted as the coefficients of [[wikipedia:vector decomposition|vector decomposition]] of <math>\hat{y}=Py</math> along the basis of<math>X</math>.
+where <math>||\cdot||</math>  is the standard [[Norm (mathematics)#Euclidean norm|''L''<sup>2</sup>&nbsp;norm]] in the <math>n</math>-dimensional [[Euclidean space|Euclidean space]] <math>\mathbb{R}^n</math>. The predicted quantity <math>X\beta</math> is just a certain linear combination of the vectors of regressors. Thus, the residual vector <math>y-X\beta</math> will have the smallest length when<math>y</math>is [[projection (linear algebra)|projected orthogonally]] onto the [[linear subspace|linear subspace]] [[linear span|spanned]] by the columns of<math>X</math>. The OLS estimator <math>\hat\beta</math> in this case can be interpreted as the coefficients of [[vector decomposition|vector decomposition]] of <math>\hat{y}=Py</math> along the basis of<math>X</math>.
 In other words, the gradient equations at the minimum can be written as:
@@ Line 239: / Line 220: @@
 <math display="block">(\mathbf y - X \hat{\boldsymbol{\beta}})^{\rm T} X=0.</math>
-A geometrical interpretation of these equations is that the vector of residuals, <math>\mathbf y - X \hat{\boldsymbol{\beta}}</math> is orthogonal to the [[wikipedia:column space|column space]] of<math>X</math>, since the dot product <math>(\mathbf y-X\hat{\boldsymbol{\beta}})\cdot X \mathbf v</math> is equal to zero for ''any'' conformal vector, <math>\mathbf{v}</math>. This means that <math>\mathbf y - X \boldsymbol{\hat \beta}</math> is the shortest of all possible vectors <math>\mathbf{y}- X \boldsymbol \beta</math>, that is, the variance of the residuals is the minimum possible. This is illustrated at the right.
+A geometrical interpretation of these equations is that the vector of residuals, <math>\mathbf y - X \hat{\boldsymbol{\beta}}</math> is orthogonal to the [[column space|column space]] of<math>X</math>, since the dot product <math>(\mathbf y-X\hat{\boldsymbol{\beta}})\cdot X \mathbf v</math> is equal to zero for ''any'' conformal vector, <math>\mathbf{v}</math>. This means that <math>\mathbf y - X \boldsymbol{\hat \beta}</math> is the shortest of all possible vectors <math>\mathbf{y}- X \boldsymbol \beta</math>, that is, the variance of the residuals is the minimum possible. This is illustrated at the right.
-Introducing <math>\hat{\boldsymbol{\gamma}}</math> and a matrix <math>K</math> with the assumption that a matrix <math>[X \ K]</math> is non-singular and <math>K^{\mathsf{T}}X=0</math> (cf. [[wikipedia:Linear projection#Orthogonal projections|Orthogonal projections]]), the residual vector should satisfy the following equation:
+Introducing <math>\hat{\boldsymbol{\gamma}}</math> and a matrix <math>K</math> with the assumption that a matrix <math>[X \ K]</math> is non-singular and <math>K^{\mathsf{T}}X=0</math> (cf. [[Linear projection#Orthogonal projections|Orthogonal projections]]), the residual vector should satisfy the following equation:
 <math display="block">\hat{\mathbf{r}} \triangleq \mathbf{y} - X \hat{\boldsymbol{\beta}} = K \hat{{\boldsymbol{\gamma}}}.</math>
 The equation and solution of linear least squares are thus described as follows:
@@ Line 250: / Line 231: @@
 ==== Maximum likelihood ====
-The OLS estimator is identical to the [[wikipedia:maximum likelihood estimator|maximum likelihood estimator]] (MLE) under the normality assumption for the error terms.<ref>{{harvtxt|Hayashi|2000|loc=page 49}}</ref> From the properties of MLE, we can infer that the OLS estimator is asymptotically efficient (in the sense of attaining the [[wikipedia:Cramér–Rao bound|Cramér–Rao bound]] for variance) if the normality assumption is satisfied.<ref name="Hayashi 2000 loc=page 52">{{harvtxt|Hayashi|2000|loc=page 52}}</ref>
+The OLS estimator is identical to the [[maximum likelihood estimator|maximum likelihood estimator]] (MLE) under the normality assumption for the error terms.<ref>{{harvtxt|Hayashi|2000|loc=page 49}}</ref> From the properties of MLE, we can infer that the OLS estimator is asymptotically efficient (in the sense of attaining the [[Cramér–Rao bound|Cramér–Rao bound]] for variance) if the normality assumption is satisfied.<ref name="Hayashi 2000 loc=page 52">{{harvtxt|Hayashi|2000|loc=page 52}}</ref>
 === Properties ===
@@ Line 256: / Line 237: @@
 ==== Assumptions ====
-There are several different frameworks in which the [[wikipedia:linear regression model|linear regression model]] can be cast in order to make the OLS technique applicable. Each of these settings produces the same formulas and same results. The only difference is the interpretation and the assumptions which have to be imposed in order for the method to give meaningful results. The choice of the applicable framework depends mostly on the nature of data in hand, and on the inference task which has to be performed.
+There are several different frameworks in which the [[linear regression model|linear regression model]] can be cast in order to make the OLS technique applicable. Each of these settings produces the same formulas and same results. The only difference is the interpretation and the assumptions which have to be imposed in order for the method to give meaningful results. The choice of the applicable framework depends mostly on the nature of data in hand, and on the inference task which has to be performed.
-One of the lines of difference in interpretation is whether to treat the regressors as random variables, or as predefined constants. In the first case ('''random design''') the regressors <math>x_i</math> are random and sampled together with the <math>y_i</math>&#39;s from some [[wikipedia:statistical population|population]], as in an [[wikipedia:observational study|observational study]]. This approach allows for more natural study of the [[wikipedia:asymptotic theory (statistics)|asymptotic properties]] of the estimators. In the other interpretation ('''fixed design'''), the regressors<math>X</math> are treated as known constants set by a [[wikipedia:design of experiments|design]], and<math>y</math>is sampled conditionally on the values of<math>X</math> as in an [[wikipedia:experiment|experiment]]. For practical purposes, this distinction is often unimportant, since estimation and inference is carried out while conditioning on<math>X</math>. All results stated in this article are within the random design framework.
+One of the lines of difference in interpretation is whether to treat the regressors as random variables, or as predefined constants. In the first case ('''random design''') the regressors <math>x_i</math> are random and sampled together with the <math>y_i</math>&#39;s from some [[statistical population|population]], as in an [[observational study|observational study]]. This approach allows for more natural study of the [[asymptotic theory (statistics)|asymptotic properties]] of the estimators. In the other interpretation ('''fixed design'''), the regressors<math>X</math> are treated as known constants set by a [[design of experiments|design]], and<math>y</math>is sampled conditionally on the values of<math>X</math> as in an [[experiment|experiment]]. For practical purposes, this distinction is often unimportant, since estimation and inference is carried out while conditioning on<math>X</math>. All results stated in this article are within the random design framework.
 ===== Classical linear regression model =====
-The classical model focuses on the "finite sample" estimation and inference, meaning that the number of observations ''n'' is fixed. This contrasts with the other approaches, which study the [[wikipedia:asymptotic theory (statistics)|asymptotic behavior]] of OLS, and in which the number of observations is allowed to grow to infinity.
+The classical model focuses on the "finite sample" estimation and inference, meaning that the number of observations ''n'' is fixed. This contrasts with the other approaches, which study the [[asymptotic theory (statistics)|asymptotic behavior]] of OLS, and in which the number of observations is allowed to grow to infinity.
-{| class="table"
+{| class="table table-bordered"
 |-
 ! Condition !! Description
@@ Line 269: / Line 250: @@
 | Correct specification|| The linear functional form must coincide with the form of the actual data-generating process.
 |-
-| Strict exogeneity || The errors in the regression should have [[wikipedia:conditional expectation|conditional mean]] zero:<ref>{{harvtxt|Hayashi|2000|loc=page 7}}</ref><math>
+| Strict exogeneity || The errors in the regression should have [[conditional expectation|conditional mean]] zero:<ref>{{harvtxt|Hayashi|2000|loc=page 7}}</ref><math>
      \operatorname{E}[\,\varepsilon\mid X\,] = 0.
    </math>The immediate consequence of the exogeneity assumption is that the errors have mean zero: <math>\operatorname{E}[\epsilon] = 0 </math>, and that the regressors are uncorrelated with the errors: <math>\operatorname{E}[X^{\mathsf{T}}\epsilon] = 0</math>.
 |-
-| Linear Independence || The regressors in<math>X</math> must all be [[wikipedia:linearly independent|linearly independent]]. Mathematically, this means that the matrix<math>X</math> must have full [[wikipedia:column rank|column rank]] almost surely:<ref name="Hayashi 2000 loc=page 10">{{harvtxt|Hayashi|2000|loc=page 10}}</ref><math display = "block">
+| Linear Independence || The regressors in<math>X</math> must all be [[linearly independent|linearly independent]]. Mathematically, this means that the matrix <math>X</math> must have full [[column rank|column rank]] almost surely:<ref name="Hayashi 2000 loc=page 10">{{harvtxt|Hayashi|2000|loc=page 10}}</ref><math display = "block">
      \Pr\!\big[\,\operatorname{rank}(X) = p\,\big] = 1.
-   </math>Usually, it is also assumed that the regressors have finite moments up to at least the second moment. Then the matrix <math>Q_{xx}=\operatorname{E}[X^{\mathsf{T}}X]/n</math> is finite and positive semi-definite. When this assumption is violated the regressors are called linearly dependent or [[wikipedia:multicollinearity|perfectly multicollinear]]. In such case the value of the regression coefficient <math>\beta</math> cannot be learned, although prediction of<math>y</math>values is still possible for new values of the regressors that lie in the same linearly dependent subspace.
+   </math>Usually, it is also assumed that the regressors have finite moments up to at least the second moment. Then the matrix <math>Q_{xx}=\operatorname{E}[X^{\mathsf{T}}X]/n</math> is finite and positive semi-definite. When this assumption is violated the regressors are called linearly dependent or [[multicollinearity|perfectly multicollinear]]. In such case the value of the regression coefficient <math>\beta</math> cannot be learned, although prediction of <math>y</math> values is still possible for new values of the regressors that lie in the same linearly dependent subspace.
 |-
 | Spherical Errors<ref name="Hayashi 2000 loc=page 10"/> || <math display = "block">
      \operatorname{Var}[\,\varepsilon \mid X\,] = \sigma^2 I_n,
    </math>
-:where <math>I_n</math> is the [[wikipedia:identity matrix|identity matrix]] in dimension <math>n</math>, and <math>\sigma^2</math> is a parameter which determines the variance of each observation. This <math>\sigma^2</math> is considered a [[wikipedia:nuisance parameter|nuisance parameter]] in the model, although usually it is also estimated. If this assumption is violated then the OLS estimates are still valid, but no longer efficient.
+:where <math>I_n</math> is the [[identity matrix|identity matrix]] in dimension <math>n</math>, and <math>\sigma^2</math> is a parameter which determines the variance of each observation. This <math>\sigma^2</math> is considered a [[nuisance parameter|nuisance parameter]] in the model, although usually it is also estimated. If this assumption is violated then the OLS estimates are still valid, but no longer efficient.
 :It is customary to split this assumption into two parts:
-:* '''[[wikipedia:Homoscedasticity|Homoscedasticity]]''': <math>\operatorname{E}[\epsilon_i^2|X] = \sigma^2</math>, which means that the error term has the same variance <math>\sigma^2</math> in each observation. When this requirement is violated this is called [[wikipedia:heteroscedasticity|heteroscedasticity]], in such case a more efficient estimator would be [[wikipedia:weighted least squares|weighted least squares]]. If the errors have infinite variance then the OLS estimates will also have infinite variance (although by the [[wikipedia:law of large numbers|law of large numbers]] they will nonetheless tend toward the true values so long as the errors have zero mean). In this case, [[wikipedia:robust regression|robust estimation]] techniques are recommended.
+:* '''[[Homoscedasticity|Homoscedasticity]]''': <math>\operatorname{E}[\epsilon_i^2|X] = \sigma^2</math>, which means that the error term has the same variance <math>\sigma^2</math> in each observation. When this requirement is violated this is called [[heteroscedasticity|heteroscedasticity]], in such case a more efficient estimator would be [[weighted least squares|weighted least squares]]. If the errors have infinite variance then the OLS estimates will also have infinite variance (although by the [[law of large numbers|law of large numbers]] they will nonetheless tend toward the true values so long as the errors have zero mean). In this case, [[robust regression|robust estimation]] techniques are recommended.
-:* '''No [[wikipedia:autocorrelation|autocorrelation]]''': the errors are [[wikipedia:correlation|uncorrelated]] between observations: <math>\operatorname{E}[\epsilon_i\epsilon_j|X]=0</math> for <math>i\neq j</math>. This assumption may be violated in the context of [[wikipedia:time series|time series]] data, [[wikipedia:panel data|panel data]], cluster samples, hierarchical data, repeated measures data, longitudinal data, and other data with dependencies. In such cases [[wikipedia:generalized least squares|generalized least squares]] provides a better alternative than the OLS. Another expression for autocorrelation is ''serial correlation''.
+:* '''No [[autocorrelation|autocorrelation]]''': the errors are [[correlation|uncorrelated]] between observations: <math>\operatorname{E}[\epsilon_i\epsilon_j|X]=0</math> for <math>i\neq j</math>. This assumption may be violated in the context of [[guide:662dfe6591|time series]] data, [[panel data|panel data]], cluster samples, hierarchical data, repeated measures data, longitudinal data, and other data with dependencies. In such cases [[generalized least squares|generalized least squares]] provides a better alternative than the OLS. Another expression for autocorrelation is ''serial correlation''.
 |-
-| Normality || It is sometimes additionally assumed that the errors have [[wikipedia:multivariate normal distribution|normal distribution]] conditional on the regressors:<ref>{{harvtxt|Hayashi|2000|loc=page 34}}</ref>
+| Normality || It is sometimes additionally assumed that the errors have [[multivariate normal distribution|normal distribution]] conditional on the regressors:<ref>{{harvtxt|Hayashi|2000|loc=page 34}}</ref>
 <math display = "block">
      \varepsilon \mid X\sim \mathcal{N}(0, \sigma^2I_n).
-   </math>This assumption is not needed for the validity of the OLS method, although certain additional finite-sample properties can be established in case when it does (especially in the area of hypotheses testing). Also when the errors are normal, the OLS estimator is equivalent to the [[wikipedia:maximum likelihood estimator|maximum likelihood estimator]] (MLE), and therefore it is asymptotically efficient in the class of all [[wikipedia:regular estimator|regular estimator]]s. Importantly, the normality assumption applies only to the error terms; contrary to a popular misconception, the response (dependent) variable is not required to be normally distributed.<ref>{{cite journal|last1=Williams|first1=M. N|last2=Grajales|first2=C. A. G|last3=Kurkiewicz|first3=D|title=Assumptions of multiple regression: Correcting two misconceptions|journal=Practical Assessment, Research & Evaluation|date=2013|volume=18|issue=11|url=https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1308&context=pare}}</ref>
+   </math>This assumption is not needed for the validity of the OLS method, although certain additional finite-sample properties can be established in case when it does (especially in the area of hypotheses testing). Also when the errors are normal, the OLS estimator is equivalent to the [[maximum likelihood estimator|maximum likelihood estimator]] (MLE), and therefore it is asymptotically efficient in the class of all [[regular estimator|regular estimator]]s. Importantly, the normality assumption applies only to the error terms; contrary to a popular misconception, the response (dependent) variable is not required to be normally distributed.<ref>{{cite journal|last1=Williams|first1=M. N|last2=Grajales|first2=C. A. G|last3=Kurkiewicz|first3=D|title=Assumptions of multiple regression: Correcting two misconceptions|journal=Practical Assessment, Research & Evaluation|date=2013|volume=18|issue=11|url=https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1308&context=pare}}</ref>
 |}
@@ Line 294: / Line 275: @@
 ===== Independent and identically distributed (iid) =====
-In some applications, especially with [[wikipedia:cross-sectional data|cross-sectional data]], an additional assumption is imposed — that all observations are independent and identically distributed. This means that all observations are taken from a [[wikipedia:random sample|random sample]] which makes all the assumptions listed earlier simpler and easier to interpret. Also this framework allows one to state asymptotic results (as the sample size <math>n\rightarrow \infty</math>), which are understood as a theoretical possibility of fetching new independent observations from the [[wikipedia:data collection|data generating process]]. The list of assumptions in this case is:
+In some applications an additional assumption is imposed — that all observations are independent and identically distributed. This means that all observations are taken from a [[random sample|random sample]] which makes all the assumptions listed earlier simpler and easier to interpret. The list of assumptions in this case is:
 {| class="table"
@@ Line 300: / Line 281: @@
 ! Assumption !! Description
 |-
-| iid obesrvations || <math>(x_i,y_i)</math> is [[wikipedia:independent random variables|independent]] from, and has the same [[wikipedia:Probability distribution|distribution]] as, <math>(x_j,y_j)</math> for all <math>i\neq j</math>.
+| iid obesrvations || <math>(x_i,y_i)</math> is [[independent random variables|independent]] from, and has the same [[Probability distribution|distribution]] as, <math>(x_j,y_j)</math> for all <math>i\neq j</math>.
 |-
-| no perfect multicollinearity || <math>Q_{xx}=\operatorname{E}[x_ix_i^{\mathsf{T}}]</math> is a [[wikipedia:positive-definite matrix|positive-definite matrix]]
+| no perfect multicollinearity || <math>Q_{xx}=\operatorname{E}[x_ix_i^{\mathsf{T}}]</math> is a [[positive-definite matrix|positive-definite matrix]]
 |-
 | Exogeneity || <math>\operatorname{E}[\varepsilon_ix_i]=0</math>
@@ Line 310: / Line 291: @@
 ==== Finite sample properties ====
-First of all, under the ''strict exogeneity'' assumption the OLS estimators <math>\scriptstyle\hat\beta</math> and <math>s^2</math> are [[wikipedia:Bias of an estimator|unbiased]], meaning that their expected values coincide with the true values of the parameters:<ref>{{harvtxt|Hayashi|2000|loc=pages 27, 30}}</ref>
+First of all, under the ''strict exogeneity'' assumption the OLS estimators <math>\scriptstyle\hat\beta</math> and <math>s^2</math> are [[Bias of an estimator|unbiased]], meaning that their expected values coincide with the true values of the parameters:<ref>{{harvtxt|Hayashi|2000|loc=pages 27, 30}}</ref>
 <math display = "block">
      \operatorname{E}[\, \hat\beta \mid X \,] = \beta, \quad \operatorname{E}[\,s^2 \mid X\,] = \sigma^2.
    </math>
-If the strict exogeneity does not hold (as is the case with many [[wikipedia:time series|time series]] models, where exogeneity is assumed only with respect to the past shocks but not the future ones), then these estimators will be biased in finite samples.
+If the strict exogeneity does not hold (as is the case with many [[time series|time series]] models, where exogeneity is assumed only with respect to the past shocks but not the future ones), then these estimators will be biased in finite samples.
-The ''[[wikipedia:variance-covariance matrix|variance-covariance matrix]]'' (or simply ''covariance matrix'') of <math>\scriptstyle\hat\beta</math> is equal to<ref name="HayashiFSP">{{harvtxt|Hayashi|2000|loc=page 27}}</ref>
+The ''[[variance-covariance matrix|variance-covariance matrix]]'' (or simply ''covariance matrix'') of <math>\scriptstyle\hat\beta</math> is equal to<ref name="HayashiFSP">{{harvtxt|Hayashi|2000|loc=page 27}}</ref>
 <math display = "block">
      \operatorname{Var}[\, \hat\beta \mid X \,] = \sigma^2(X ^T X)^{-1} = \sigma^2 Q.
@@ Line 330: / Line 311: @@
    </math>
-The ''[[wikipedia:Gauss–Markov theorem|Gauss–Markov theorem]]'' states that under the ''spherical errors'' assumption (that is, the errors should be [[wikipedia:uncorrelated|uncorrelated]] and [[wikipedia:homoscedastic|homoscedastic]]) the estimator <math >\scriptstyle\hat\beta</math> is efficient in the class of linear unbiased estimators. This is called the ''best linear unbiased estimator'' (BLUE). Efficiency should be understood as if we were to find some other estimator <math >\scriptstyle\tilde\beta</math> which would be linear in<math>y</math>and unbiased, then <ref name="HayashiFSP"/>
+The ''[[Gauss–Markov theorem|Gauss–Markov theorem]]'' states that under the ''spherical errors'' assumption (that is, the errors should be [[uncorrelated|uncorrelated]] and [[homoscedastic|homoscedastic]]) the estimator <math >\scriptstyle\hat\beta</math> is efficient in the class of linear unbiased estimators. This is called the ''best linear unbiased estimator'' (BLUE). Efficiency should be understood as if we were to find some other estimator <math >\scriptstyle\tilde\beta</math> which would be linear in<math>y</math>and unbiased, then <ref name="HayashiFSP"/>
 <math display = "block">
      \operatorname{Var}[\, \tilde\beta \mid X \,] - \operatorname{Var}[\, \hat\beta \mid X \,] \geq 0
    </math>
-in the sense that this is a [[wikipedia:nonnegative-definite matrix|nonnegative-definite matrix]]. This theorem establishes optimality only in the class of linear unbiased estimators, which is quite restrictive. Depending on the distribution of the error terms ''ε'', other, non-linear estimators may provide better results than OLS.
+in the sense that this is a [[nonnegative-definite matrix|nonnegative-definite matrix]]. This theorem establishes optimality only in the class of linear unbiased estimators, which is quite restrictive. Depending on the distribution of the error terms ''ε'', other, non-linear estimators may provide better results than OLS.
 ===== Assuming normality =====
@@ Line 343: / Line 324: @@
      \hat\beta\ \sim\ \mathcal{N}\big(\beta,\ \sigma^2(X ^\mathrm{T} X)^{-1}\big)
    </math>
-where <math>Q</math> is the [[wikipedia:#Cofactor matrix|cofactor matrix]]. This estimator reaches the [[wikipedia:Cramér–Rao bound|Cramér–Rao bound]] for the model, and thus is optimal in the class of all unbiased estimators.<ref name="Hayashi 2000 loc=page 52"/> Note that unlike the [[wikipedia:Gauss–Markov theorem|Gauss–Markov theorem]], this result establishes optimality among both linear and non-linear estimators, but only in the case of normally distributed error terms.
+where <math>Q</math> is the cofactor matrix. This estimator reaches the [[Cramér–Rao bound|Cramér–Rao bound]] for the model, and thus is optimal in the class of all unbiased estimators.<ref name="Hayashi 2000 loc=page 52"/> Note that unlike the [[Gauss–Markov theorem|Gauss–Markov theorem]], this result establishes optimality among both linear and non-linear estimators, but only in the case of normally distributed error terms.
-The estimator <math>s^2</math> will be proportional to the [[wikipedia:chi-squared distribution|chi-squared distribution]]:<ref>{{harvtxt|Amemiya|1985|loc=page 14}}</ref>
+The estimator <math>s^2</math> will be proportional to the [[chi-squared distribution|chi-squared distribution]]:<ref>{{harvtxt|Amemiya|1985|loc=page 14}}</ref>
 <math display = "block">
      s^2\ \sim\ \frac{\sigma^2}{n-p} \cdot \chi^2_{n-p}
    </math>
-The variance of this estimator is equal to <math>2\sigma^4/(n-p)</math>, which does not attain the [[wikipedia:Cramér–Rao bound|Cramér–Rao bound]] of <math>2\sigma^4/n</math>. However it was shown that there are no unbiased estimators of <math>\sigma^2</math> with variance smaller than that of the estimator <math>s^2</math>.<ref>{{cite book |first=C. R. |last=Rao |author-link=C. R. Rao |title=Linear Statistical Inference and its Applications |location=New York |publisher=J. Wiley & Sons |year=1973 |edition=Second |page=319 |isbn=0-471-70823-2 }}</ref> If we are willing to allow biased estimators, and consider the class of estimators that are proportional to the sum of squared residuals (SSR) of the model, then the best (in the sense of the [[wikipedia:mean squared error|mean squared error]]) estimator in this class will be <math>\sim \sigma^2/(n-p+2)</math>, which even beats the Cramér–Rao bound in case when there is only one regressor (<math>p=1</math>).<ref>{{harvtxt|Amemiya|1985|loc=page 20}}</ref>
+The variance of this estimator is equal to <math>2\sigma^4/(n-p)</math>, which does not attain the [[Cramér–Rao bound|Cramér–Rao bound]] of <math>2\sigma^4/n</math>. However it was shown that there are no unbiased estimators of <math>\sigma^2</math> with variance smaller than that of the estimator <math>s^2</math>.<ref>{{cite book |first=C. R. |last=Rao |author-link=C. R. Rao |title=Linear Statistical Inference and its Applications |location=New York |publisher=J. Wiley & Sons |year=1973 |edition=Second |page=319 |isbn=0-471-70823-2 }}</ref> If we are willing to allow biased estimators, and consider the class of estimators that are proportional to the sum of squared residuals (SSR) of the model, then the best (in the sense of the [[mean squared error|mean squared error]]) estimator in this class will be <math>\sim \sigma^2/(n-p+2)</math>, which even beats the Cramér–Rao bound in case when there is only one regressor (<math>p=1</math>).<ref>{{harvtxt|Amemiya|1985|loc=page 20}}</ref>
-Moreover, the estimators <math >\hat{\beta}</math> and <math>s^2</math> are [[wikipedia:independent random variables|independent]],<ref>{{harvtxt|Amemiya|1985|loc=page 27}}</ref> the fact which comes in useful when constructing the t- and F-tests for the regression.
-===== Influential observations =====
-As was mentioned before, the estimator <math>\hat\beta</math> is linear in <math>y</math>, meaning that it represents a linear combination of the dependent variables <math>y_i</math>. The weights in this linear combination are functions of the regressors <math>X</math>, and generally are unequal. The observations with high weights are called '''influential''' because they have a more pronounced effect on the value of the estimator.
-To analyze which observations are influential we remove a specific <math>j</math>-th observation and consider how much the estimated quantities are going to change (similarly to the [[wikipedia:jackknife method|jackknife method]]). It can be shown that the change in the OLS estimator for <math>\beta</math> will be equal to <ref name="DvdMck33">{{cite book |last1=Davidson |first1=Russell |last2=MacKinnon |first2=James G. |title=Estimation and Inference in Econometrics |location=New York |publisher=Oxford University Press |year=1993 |isbn=0-19-506011-3 |page=33 }}</ref>
-<math display = "block">
-    \hat\beta^{(j)} - \hat\beta = - \frac{1}{1-h_j} (X ^\mathrm{T} X)^{-1}x_j ^\mathrm{T} \hat\varepsilon_j\,,
-  </math>
-where <math>h_j = x_j^{\mathsf{T}}(X^{\mathsf{T}}X)^{-1}x_j</math> is the <math>j</math>-th diagonal element of the hat matrix <math>P</math>, and <math>x_j</math> is the vector of regressors corresponding to the <math>j</math>-th observation. Similarly, the change in the predicted value for <math>j</math>-th observation resulting from omitting that observation from the dataset will be equal to <ref name="DvdMck33"/>
-<math display = "block">
-    \hat{y}_j^{(j)} - \hat{y}_j = x_j ^\mathrm{T} \hat\beta^{(j)} - x_j ^T \hat\beta = - \frac{h_j}{1-h_j}\,\hat\varepsilon_j
-  </math>
-From the properties of the hat matrix, <math>0\leq h_j \leq 1 </math>, and they sum up to <math>p</math>, so that on average <math>h_j \approx p/n</math>. These quantities <math>h_j</math> are called the '''leverages''', and observations with high <math>h_j</math> are called '''leverage points'''.<ref>{{harvtxt|Davidson|Mackinnon|1993|loc=page 36}}</ref> Usually the observations with high leverage ought to be scrutinized more carefully, in case they are erroneous, or outliers, or in some other way atypical of the rest of the dataset.
-===== Partitioned regression =====
-Sometimes the variables and corresponding parameters in the regression can be logically split into two groups, so that the regression takes form
-<math display = "block">
-    y = X_1\beta_1 + X_2\beta_2 + \varepsilon,
-  </math>
-where<math>X</math><sub>1</sub> and<math>X</math><sub>2</sub> have dimensions <math>n\times p_1</math>, <math>n\times p_2</math>, and <math>\beta_1</math>, <math>\beta_2</math> are <math>p_1 \times 1 </math> and <math> p_2 \times 1</math> vectors, with <math>p = p_1 + p_2 = p</math>.
-The '''[[wikipedia:Frisch–Waugh–Lovell theorem|Frisch–Waugh–Lovell theorem]]''' states that in this regression the residuals <math>\hat\varepsilon</math> and the OLS estimate <math >\scriptstyle\hat\beta_2</math> will be numerically identical to the residuals and the OLS estimate for <math>\beta</math><sub>2</sub> in the following regression:<ref>{{harvtxt|Davidson|Mackinnon|1993|loc=page 20}}</ref>
-<math display = "block">
-    M_1y = M_1X_2\beta_2 + \eta\,,
-  </math>
-where <math>M_1</math> is the [[wikipedia:annihilator matrix|annihilator matrix]] for regressors<math>X_1</math>.
-The theorem can be used to establish a number of theoretical results. For example, having a regression with a constant and another regressor is equivalent to subtracting the means from the dependent variable and the regressor and then running the regression for the de-meaned variables but without the constant term.
+Moreover, the estimators <math >\hat{\beta}</math> and <math>s^2</math> are [[independent random variables|independent]],<ref>{{harvtxt|Amemiya|1985|loc=page 27}}</ref> the fact which comes in useful when constructing the t- and F-tests for the regression.
 ==== Large sample properties ====
-The least squares estimators are [[wikipedia:point estimate|point estimate]]s of the linear regression model parameters <math>\beta</math>. However, generally we also want to know how close those estimates might be to the true values of parameters. In other words, we want to construct the [[wikipedia:interval estimate|interval estimate]]s.
+The least squares estimators are [[point estimate|point estimate]]s of the linear regression model parameters <math>\beta</math>. However, generally we also want to know how close those estimates might be to the true values of parameters. In other words, we want to construct the [[interval estimate|interval estimate]]s.
-Since we haven't made any assumption about the distribution of error term <math>\epsilon_i</math>, it is impossible to infer the distribution of the estimators <math>\hat\beta</math> and <math>\hat\sigma^2</math>. Nevertheless, we can apply the [[wikipedia:central limit theorem|central limit theorem]] to derive their ''asymptotic'' properties as sample size <math>n</math> goes to infinity. While the sample size is necessarily finite, it is customary to assume that <math>n</math> is "large enough" so that the true distribution of the OLS estimator is close to its asymptotic limit.
+Since we haven't made any assumption about the distribution of error term <math>\epsilon_i</math>, it is impossible to infer the distribution of the estimators <math>\hat\beta</math> and <math>\hat\sigma^2</math>. Nevertheless, we can apply the [[central limit theorem|central limit theorem]] to derive their ''asymptotic'' properties as sample size <math>n</math> goes to infinity. While the sample size is necessarily finite, it is customary to assume that <math>n</math> is "large enough" so that the true distribution of the OLS estimator is close to its asymptotic limit.
-We can show that under the model assumptions, the least squares estimator for <math>\beta</math> is [[wikipedia:consistent estimator|consistent]] (that is <math>\hat\beta</math> [[wikipedia:Convergence of random variables#Convergence in probability|converges in probability]] to <math>\beta</math>) and asymptotically normal:
+We can show that under the model assumptions, the least squares estimator for <math>\beta</math> is [[consistent estimator|consistent]] (that is <math>\hat\beta</math> [[Convergence of random variables#Convergence in probability|converges in probability]] to <math>\beta</math>) and asymptotically normal:
 <math display = "block">(\hat\beta - \beta)\ \xrightarrow{d}\ \mathcal{N}\big(0,\;\sigma^2Q_{xx}^{-1}\big),</math>
 where <math>Q_{xx} = X ^T X.</math>
@@ Line 401: / Line 352: @@
    \bigg]
 </math> &nbsp; at the <math>1-\alpha</math> confidence level,
-where <math>Q</math> denotes the [[wikipedia:quantile function|quantile function]] of standard normal distribution, and [·]<sub>''jj''</sub> is the <math>j</math>-th diagonal element of a matrix.
+where <math>Q</math> denotes the [[quantile function|quantile function]] of standard normal distribution, and [·]<sub>''jj''</sub> is the <math>j</math>-th diagonal element of a matrix.
 Similarly, the least squares estimator for <math>\sigma^2</math> is also consistent and asymptotically normal (provided that the fourth moment of <math>\epsilon_i</math> exists) with limiting distribution
 <math display = "block">(\hat{\sigma}^2 - \sigma^2)\ \xrightarrow{d}\ \mathcal{N} \left(0,\;\operatorname{E}\left[\varepsilon_i^4\right] - \sigma^4\right). </math>
-These asymptotic distributions can be used for prediction, testing hypotheses, constructing other estimators, etc.. As an example consider the problem of prediction. Suppose <math>x_0</math> is some point within the domain of distribution of the regressors, and one wants to know what the response variable would have been at that point. The [[wikipedia:mean response|mean response]] is the quantity <math>y_0 = x_0^\mathrm{T} \beta</math>, whereas the [[wikipedia:predicted response|predicted response]] is <math>\hat{y}_0 = x_0^\mathrm{T} \hat\beta</math>. Clearly the predicted response is a random variable, its distribution can be derived from that of <math>\hat{\beta}</math>:
+These asymptotic distributions can be used for prediction, testing hypotheses, constructing other estimators, etc.. As an example consider the problem of prediction. Suppose <math>x_0</math> is some point within the domain of distribution of the regressors, and one wants to know what the response variable would have been at that point. The [[mean response|mean response]] is the quantity <math>y_0 = x_0^\mathrm{T} \beta</math>, whereas the [[predicted response|predicted response]] is <math>\hat{y}_0 = x_0^\mathrm{T} \hat\beta</math>. Clearly the predicted response is a random variable, its distribution can be derived from that of <math>\hat{\beta}</math>:
 <math display = "block">\left(\hat{y}_0 - y_0\right)\ \xrightarrow{d}\ \mathcal{N}\left(0,\;\sigma^2 x_0^\mathrm{T} Q_{xx}^{-1} x_0\right),</math>