Least-squares estimation of linear regression coefficients

Least-squares estimation of linear regression coefficients

In parametric statistics, the least-squares estimator is often used to estimate the coefficients of a linear regression. The least-squares estimator optimizes a certain criterion (namely it minimizes the sum of the square of the residuals). In this article, after setting the mathematical context of linear regression, we will motivate the use of the least-squares estimator widehat{ heta}_{LS} and derive its expression (as seen for example in the article regression analysis):

:widehat{ heta}_{LS}=(mathbf{X}^tmathbf{X})^{-1}mathbf{X}^tvec{Y}

We conclude by giving some qualities of this estimator and a geometrical interpretation.


For pinmathbb{N}^+, let "Y" be a random variable taking values in mathbb{R}, we call observation.

We next define the function η, linear in heta:

eta(X; heta)=sum_{j=1}^p heta_j X_j,

* For jin {1,...,p}, X_j is a random variable taking values in mathbb{R} and is called a factor and
* heta_j is a scalar, for jin {1,...,p}, and heta^t=( heta_1,cdots, heta_p), where heta^t denotes the transpose of vector heta.

Let X^t=(X_1,cdots,X_p). We can write eta(X; heta)=X^t heta. Define the error to be:

varepsilon( heta)=Y-X^t heta

We suppose that there exists a true parameter overline{ heta}inmathbb{R}^{p} such that mathbb{E} [varepsilon(overline{ heta})|X] =0. This means that, given the random variables (X_1,cdots,X_p), the best prediction we can make of "Y" is Y=eta(X;overline{ heta})=X^toverline{ heta}. Henceforth, varepsilon will denote varepsilon(overline{ heta}) and η will represent eta(X;overline{ heta}).

Least-squares estimator

The idea behind the least-squares estimator is to see linear regression as an orthogonal projection. Let "F" be the L2-space of all random variables whose square has a finite Lebesgue integral. Let "G" be the linear subspace of F generated by X_1,cdots,X_p (supposing that Yin F and (X_1,cdots,X_p)in F^p). We show in this paragraph that the function eta is an orthogonal projection of "Y" on "G" and we will construct the least-squares estimator.

Seeing linear regression as an orthogonal projection

We have mathbb{E}(Y|X)=eta, but Ymapstomathbb{E}(Y|X) is a projection, which means that eta is a projection of "Y" on "G". What is more, this projection is an orthogonal one.

To see this, we can build a scalar product in "F": for all couples of random variables X,Yin F, we definelangle X,Y angle_2:=mathbb{E} [X Y] . It is indeed a scalar product because if |X|_2^2=0, then X=0 almost everywhere (where |X|_2^2:=langle X,X angle_2 is the norm corresponding to this scalar product).

For all 1leq jleq p,


Therefore, varepsilon is orthogonal to any X_j and hence to the whole of the subspace "G", which means that eta is a projection of "Y" on "G", orthogonal with respect to the scalar product we have just defined. We have therefore shown:

eta(X;overline{ heta})=min_{fin G}|Y-f|^2_2.

Estimating the coefficients

If, for each jin{1,cdots,p} we have a sample of size n>p, (X^1_j,cdots,X^n_j) of X_j, along with a vector vec{Y} of "n" observations of "Y", we can build an estimation of the coefficients of this orthogonal projection. To do this, we can use an estimation of the scalar product defined earlier.

For all couples of samples of size "n" vec{U},vec{V}in F^n of random variables "U" and "V", we define langle vec{U},vec{V} angle:=vec{U}^t vec{V}, where vec{U}^t is the transpose of vector vec{U}, and |cdot|:=sqrt{langle cdot,cdot angle}. Note that the scalar product langle cdot,cdot angle is defined in F^n and no longer in "F".

Let us define the design matrix (or random design), a n imes p random matrix:mathbf{X}=left [egin{matrix}X_1^1&cdots&X^1_p\vdots&&vdots\X^n_1&cdots&X^n_pend{matrix} ight]

We can now adapt the minimization of the sum of the residuals: the least-squares estimator widehat{ heta}_{LS} will be the value, if it exists, of heta which minimizes |mathbf{X} heta-vec{Y}|^2. Therefore, langle mathbf{X},vec{varepsilon}(widehat{ heta}_{LS}) angle= mathbf{X}^t(mathbf{X}widehat{ heta}_{LS}-vec{Y})=0.

This yields mathbf{X}^t mathbf{X} widehat{ heta}_{LS} = mathbf{X}^t vec{Y}. If mathbf{X} is of full rank, then so is mathbf{X}^t mathbf{X}. In that case we can compute the least-squares estimator explicitly by inverting the p imes p matrix mathbf{X}^tmathbf{X}:

widehat{ heta}_{LS}=(mathbf{X}^tmathbf{X})^{-1} mathbf{X}^t vec{Y}

Qualities and geometrical interpretation

Qualities of this estimator

Not only is the least-square estimator easy to compute, but under the Gauss-Markov assumptions, the Gauss-Markov theorem states that the least-square estimators is the best linear unbiased estimator (BLUE) of overline{ heta}.

The vector of errors vec{varepsilon}=vec{Y}-mathbf{X}overline{ heta} is said to fulfil the Gauss-Markov assumptions if: :*mathbb{E}vec{varepsilon}=vec{0}:*mathbb{V}vec{varepsilon}=sigma^2 mathbf{I}_n (uncorrelated but not necessarily independent; homoscedastic but not necessarily identically distributed)

where sigma^2<+infty and mathbf{I}_n is the n imes n identity matrix.

This decisive advantage has led to a sometimes abusive use of least-squares. Least-squares depends on the fulfilment of the Gauss-Markov hypothesis and applying this method in a situation where these conditions are not met can lead to inaccurate results. For example, in the study of time-series, it is often difficult to assume independence of the residuals.

Geometrical interpretation

The situation described by the linear regression problem can be geometrically seen as follows:

The least-squares is also an M-estimator of ho-type for ho(r):=frac{r^2}{2}.

Wikimedia Foundation. 2010.

Look at other dictionaries:

  • Least squares — The method of least squares is a standard approach to the approximate solution of overdetermined systems, i.e., sets of equations in which there are more equations than unknowns. Least squares means that the overall solution minimizes the sum of… …   Wikipedia

  • Linear regression — Example of simple linear regression, which has one independent variable In statistics, linear regression is an approach to modeling the relationship between a scalar variable y and one or more explanatory variables denoted X. The case of one… …   Wikipedia

  • Least-squares spectral analysis — (LSSA) is a method of estimating a frequency spectrum, based on a least squares fit of sinusoids to data samples, similar to Fourier analysis. [cite book | title = Variable Stars As Essential Astrophysical Tools | author = Cafer Ibanoglu |… …   Wikipedia

  • Linear least squares (mathematics) — This article is about the mathematics that underlie curve fitting using linear least squares. For statistical regression analysis using least squares, see linear regression. For linear regression on a single variable, see simple linear regression …   Wikipedia

  • Ordinary least squares — This article is about the statistical properties of unweighted linear regression analysis. For more general regression analysis, see regression analysis. For linear regression on a single variable, see simple linear regression. For the… …   Wikipedia

  • Regression analysis — In statistics, regression analysis is a collective name for techniques for the modeling and analysis of numerical data consisting of values of a dependent variable (response variable) and of one or more independent variables (explanatory… …   Wikipedia

  • Regression Analysis of Time Series — Infobox Software name = RATS caption = developer = Estima latest release version = 7.0 latest release date = 2007 operating system = Cross platform genre = econometrics software license = Proprietary website =… …   Wikipedia

  • Linear algebra — R3 is a vector (linear) space, and lines and planes passing through the origin are vector subspaces in R3. Subspaces are a common object of study in linear algebra. Linear algebra is a branch of mathematics that studies vector spaces, also called …   Wikipedia

  • Linear probability model — The linear probability specification of a binary regression model assumes that, for binary outcome Y and regressor vector X ,: Pr(Y=1 | X=x) = x eta. A drawback of this model is that, unless restrictions are placed on eta , the estimated… …   Wikipedia

  • Deming regression — Deming regression. The red lines show the error in both x and y. This is different from the traditional least squares method which measures error parallel to the y axis. The case shown, with deviations measured perpendicularly, arises when x and… …   Wikipedia