Linear least squares/Proposed


Linear least squares/Proposed

Linear least squares is an important computational problem, that arises primarily in applications when it is desired to fit a linear mathematical model to observations obtained from experiments. Mathematically, it can be stated as the problem of finding an approximate solution of an overdetermined system of equations.

Linear least square problems admits a closed-form solution, in contrast to non-linear least squares problems, which have to be solved by an iterative procedure.

Problem statement and solutions

Consider an overdetermined system

:sum_{j=1}^{n} X_{ij}eta_j = y_i, (i=1, 2, dots, m),

of m linear equations in n unknowns, eta_1, eta_2, dots, eta_n.. Such a system usually has no solution, and the goal is then to find the numbers eta_j which fit "best" the equations, in the sense of minimizing the sum of squares of differences between the left- and right-hand sides of the equations.

The primary application of linear least squares is in data fitting. Given a set of "m" data points consisting of experimentally measured values, y_1, y_2,dots, y_m taken at "m" values of an independent variable, x_1, x_2,dots, x_m, (x_i may be a scalar or a vector) it is desired to find a model function y=f(x, oldsymbol eta), with oldsymbol eta = (eta_1, eta_2, dots, eta_n), that fits best the data. The model function is assumed to be linear in the parameters eta_j, so

:f(x, oldsymbol eta) = sum_{j=1}^{n} eta_j phi_j(x).

Here, the functions phi_j may be nonlinear in the variable x.

A best fit is realized when each difference between an observed value and the value calculated for the model is made as small as possible by varying the parameters. The difference is known as a residual, "r". :r_i= y_i - f(x_i, oldsymbol eta), (i=1, 2, dots, m) However, there are more residuals than parameters, so the parameters are overdetermined and no set of parameter values exists that can make the residuals equal to zero. In the least squares method the criterion chosen for best fit is that the sum of squared residuals

:S=sum_{i=1}^{m}r_i^2

is minimized. The problem then reduces to the overdetermined linear system mentioned earlier, with X_{ij}=phi_j(x_i).

The justification for choosing this criterion is given in properties, below. There is a unique set of parameter values that corresponds to the minimum value of the sum of squared residuals.

Specific solution, straight line fitting, with example

For straight line fitting there are only two parameters. This means that a complete algebraic solution may be worked out with relative ease. For the model:f(x_i,oldsymbol eta)=alpha+eta x_ithe normal equations (for derivation see below) are:left( sum 1^2 ight) alpha+ left( sum x_i ight) eta=sum y_i:left( sum x_i ight) alpha+ left( sum x_i^2 ight) eta=sum x_i y_i.All the summations go from "i=1" to "m". Each summation can be represented by a single symbol: sum 1^2=m: sum x_i=S_x: sum y_i=S_y: sum x_i^2=S_{x^2}: sum x_i y_i=S_{xy}.In terms of these symbols the normal equations become:m alpha+ S_x eta=S_y,:S_x alpha+ S_{x^2} eta=S_{xy},and the solution, by Cramer's rule is:hatalpha = frac {S_{x^2} S_y - S_x S_{xy {D} ,:hateta = frac {m S_{xy} - S_x S_y} {D} ,:D=m S_{x^2} - (S_x)^2These expressions have been used in hand calculators because each time a data point is added or removed, the five sums are adjusted and the parameters are recalculated, only seven operations in all. [Since S_x=mar x and S_y=mar y an alternative expression can be given for the slope.:hat{eta}=frac{sum(x_i-ar{x})(y_i-ar{y})}{sum(x_i-ar{x})^2}=frac{sum(x_i y_i -ar x y_i -ar y x_i+ar x ar y)}{sum(x_i^2-2x_iar{x}+ar x^2)}=frac{mS_{xy} -S_xS_y}{mS_{x^2}-(S_x)^2}Also, from the first normal equation, m alpha+S_xeta=S_y, hatalpha= ar y-ar x hateta] The standard deviations of the parameter estimates (often called their standard errors) are:sigma(alpha)=sqrt{frac{S}{m-2}frac{S_{x^2{D:sigma(eta)=sqrt{frac{S}{m-2}frac{m}{DThe correlation coefficient between the parameter estimates is : ho_{12} = frac{-S_x}{sqrt{mS_{x^2}

Example

With a set of observed data points y=2,3,3,4 obtained with the independent variable, "x" at values -1,0,2,4. :m=4!:sum x=5!:sum y=12!:sum x^2=21!:sum xy=20!:D=59!

:alpha=frac{21 imes 12-5 imes 20}{59}=frac{152}{59}:eta=frac{4 imes 20 - 5 imes 12}{59}=frac{20}{59}

Now, the residuals are calculated:egin{matrix}r_1=&-0.24 \r_2=&0.42 \r_3=&-0.25 \r_4=&0.07 \ end{matrix}

and S=0.305. After calculating the standard deviations the final result is obtained.:alpha =2.6 pm 0.2:eta =0.3 pm 0.1Note that the error is only quoted to one significant digit.

Normal equations method

"S" is minimized when its gradient with respect to each parameter is equal to zero. The elements of the gradient vector are the partial derivatives of "S" with respect to the parameters. :frac{partial S}{partial eta_j}=2sum_i r_ifrac{partial r_i}{partial eta_j}=0 (j=1,2,dots, n).The gradient equations are a set of "n" simultaneous equations in the "n" parameters. They are solved using the methods of linear algebra. Since r_i= y_i - sum_{j=1}^{n} X_{ij}eta_j, the derivatives are:frac{partial r_i}{partial eta_j}=-X_{ij}.

Substitution of the expressions for the residuals and the derivatives into the gradient equations gives:frac{partial S}{partial eta_j}=-2sum_{i=1}^{m}X_{ij} left( y_i-sum_{k=1}^{n} X_{ik}eta_k ight)=0.

Upon rearrangement, the "n" simultaneous linear equations, the normal equations:sum_{i=1}^{m}sum_{k=1}^{n} X_{ij}X_{ik}hat eta_k=sum_{i=1}^{m} X_{ij}y_i (j=1,2,dots, n),

are obtained. The normal equations are written in matrix notation as:mathbf{left(X^TX ight)hat oldsymbol eta=X^Ty}.Solution of the normal equations yields the least squares estimators,hat oldsymbol eta, of the parameter values.

General solutionAlthough the algebraic solution of the normal equations can be written as:mathbf{ hat oldsymboleta=left(X^TX ight)^{-1}X^Ty}it is not good practice to invert the normal equations matrix. An exception occurs in numerical smoothing and differentiation where an analytical expression is required.

If the matrix mathbf{X^TX} is well-conditioned and positive definite, that is, it has full rank, the normal equations can be solved directly by using the Cholesky decomposition mathbf{X^TX=R^TR}, where R is an upper triangular matrix, giving: mathbf{ R^T R hat oldsymboleta = X^Ty}. The solution is obtained in two stages, a forward substitution, mathbf{R^Tz=X^Ty}, followed by a backward substitution mathbf{Rhat oldsymboleta=z}. Both subtitutions are facilitated by the triangular nature of R.

See example of linear regression for a worked-out numerical example with three parameters.

Orthogonal decomposition methods

Orthogonal decomposition methods of solving the least squares problem are slower than the normal equations method but are more numerically stable.

The extra stability results from not having to form the product mathbf{X^TX}.The residuals are written in matrix notation as:mathbf{r=y-Xoldsymboleta}The matrix X is subjected to an orthogonal decomposition; the QR decomposition will serve to illustrate the process. :mathbf{X=QR}where Q is an orthogonal m imes m matrix and R is an m imes n matrix which is partitioned into a n imes n block, mathbfR_n, and a m-n imes n zero block. mathbfR_n is upper triangular.:mathbf{R}= egin{bmatrix}mathbf{R}_n \mathbf{0}end{bmatrix}The residual vector is left-multiplied by mathbf {Q^T}. :mathbf{Q^Tr=Q^T y -left(Q^TQ ight)R oldsymboleta}= egin{bmatrix}mathbf{left(Q^T y ight)}_n -mathbf{R}_n oldsymboleta \mathbf{left(Q^T y ight)}_{m-n}end{bmatrix}= egin{bmatrix}mathbf{U}\mathbf{L}end{bmatrix}The sum of squares of the transformed residuals, S=mathbf{r^T Q Q^Tr}, is the same as before, S=mathbf{r^Tr} because Q is orthogonal.:S=mathbf{U^TU+L^TL} The minimum value of "S" is attained when the upper block, U, is zero. Therefore the parameters are found by solving:mathbf{R}_n hatoldsymboleta =mathbf{left(Q^T y ight)}_nThese equations are easily solved as mathbf{R}_n is upper triangular.

An alternative decomposition of X is the singular value decomposition (SVD) [C.L. Lawson and R.J. Hanson, Solving Least Squares Problems, Prentice-Hall,1974] :mathbf{ X = SSigma V^*}.This is effectively another kind of orthogonal decomposition as both U and V are orthogonal. This method is the most computationally intensive, but is particularly useful if the normal equations matrix, mathbf{X^TX}, is very ill-conditioned (i.e. if its condition number multiplied by the machine's relative round-off error is appreciably large). In that case, including the smallest singular values in the inversion merely adds numerical noise to the solution. This can be cured using the truncated SVD approach, giving a more stable and exact answer, by explicitly setting to zero all singular values below a certain threshold and so ignoring them, a process closely related to factor analysis.

Weighted linear least squares

When the observations are not equally reliable, a weighted sum of squares:S=sum_{i=1}^{m}W_{ii}r_i^2may be minimized.

Each element of the diagonal weight matrix, W should,ideally, be equal to the reciprocal of the variance of the measurement. [This implies that the observations are uncorrelated. If the observations are correlated, the expression:S=sum_k sum_j r_k W_{kj} r_j,applies. In this case the weight matrix should ideally be equal to the inverse of the variance-covariance matrix of the observations.] The normal equations are then:mathbf{left(X^TWX ight)hat oldsymbol eta=X^TWy}.

Properties of the least-squares estimators

The gradient equations at the minimum can be written as:mathbf{(y-Xhatoldsymboleta)X}=0A geometrical interpretation of these equations is that the vector of residuals, mathbf{y-Xhatoldsymboleta} is orthogonal to the column space of mathbf{X}, since the dot product mathbf{(y-Xhatoldsymboleta). Xv} is equal to zero for "any" conformal vector, mathbf{v}. This means that mathbf{y}-mathbf{X}oldsymbol hat eta is the shortest of all possible vectors mathbf{y}-mathbf{X}oldsymbol eta, that is, the variance of the residuals is the minimum possible. This is illustrated at the right.

If the experimental errors, epsilon ,, are uncorrelated, have a mean of zero and a constant variance, sigma, the Gauss-Markov theorem states that the least-squares estimator, hat eta, has the minimum variance of all estimators that are linear combinations of the observations. In this sense it is the best, or optimal, estimator of the parameters. Note particularly that this property is independent of the statistical distribution function of the errors. In other words, "the distribution function of the errors need not be a normal distribution".

For example, it is easy to show that the arithmetic mean of a set of measurements of a quantity is the least-squares estimator of the value of that quantity. If the conditions of the Gauss-Markov theorem apply, the arithmetic mean is optimal, whatever the distribution of errors of the measurements might be.

However, in the case that the experimental errors do belong to a Normal distribuition, the least-squares estimator is also a maximum likelihood estimator. [H. Margenau and G.M. Murphy, The Mathematics of Physics and Chemistry, Van Nostrand, 1943, 1956]

These properties underpin the use of the method of least squares for all types of data fitting, even when the assumptions are not strictly valid.

Limitations

An assumption underlying the treatment given above is that the independent variable, "x", is free of error. In practice, the errors on the measurements of the independent variable are usually much smaller than the errors on the dependent variable and can therefore be ignored. When this is not the case, total least squares also known as "Errors-in-variables model", or "Rigorous least squares", should be used. This can be done by adjusting the weighting scheme to take into account errors on both the dependent and independent variables and then following the standard procedure.P. Gans, Data fitting in the Chemical Sciences, Wiley, 1992] [W.E. Deming, Statistical adjustment of Data, Wiley, 1943]

In some cases the (weighted) normal equations matrix mathbf{X^TX} is ill-conditioned; this occurs when the measurements have only a marginal effect on one or more of the estimated parameters.When fitting polynomials the normal equations matrix is a Vandermonde matrix. Vandermode matrices become increasingly ill-conditioned as the order of the matrix increases.] In these cases, the least squares estimate amplifies the measurement noise and may be grossly inaccurate. Various regularization techniques can be applied in such cases, the most common of which is called Tikhonov regularization. If further information about the parameters is known, for example, a range of possible values of x, then minimax techniques can also be used to increase the stability of the solution.

Another drawback of the least squares estimator is the fact that the norm of the residuals, |mathbf{y-Xoldsymboleta}| is minimized, whereas in some cases one is truly interested in obtaining small error in the parameter mathbf{oldsymboleta}, e.g., a small value of |oldsymboleta-hatoldsymboleta|. However, since oldsymboleta is unknown, this quantity cannot be directly minimized. If a prior probability on oldsymboleta is known, then a Bayes estimator can be used to minimize the mean squared error, E left{ | oldsymboleta - hatoldsymboleta |^2 ight} . The least squares method is often applied when no prior is known. Surprisingly, however, better estimators can be constructed, an effect known as Stein's phenomenon. For example, if the measurement error is Gaussian, several estimators are known which dominate, or outperform, the least squares technique; the best known of these is the James-Stein estimator.

Parameter errors, correlation and confidence limits

The parameter values are linear combinations of the observed values:mathbf{hat eta=(X^TWX)^{-1}X^TWy}Therefore an expression for the errors on the parameter can be obtained by error propagation from the errors on the observations. Let the variance-covariance matrix for the observations be denoted by M and that of the parameters by Meta. Then,:mathbf{M^eta=(X^TWX)^{-1}X^TW M W^TX(X^TWX)^{-1When mathbf{W=M^{-1, this simplifies to:mathbf{M^eta=(X^TWX)^{-1.

When unit weights are used (mathbf{W=I, hat eta=(X^TX)^{-1}X^Ty}) it is implied that the experimental errors are uncorrelated and all equal: mathbf{M}=sigma^2 mathbf{I}, where sigma^2, is known as the variance of an observation of unit weight, and mathbf{I} is an identity matrix. In this case sigma^2, is approximated by frac{S}{m-n}, where "S" is the minimum value of the objective function:mathbf{M^eta=}frac{S}{m-n}mathbf{(X^TX)^{-1.In all cases, the variance of the parameter eta_i is given by M^eta_{ii} and the covariance between parameters eta_i and eta_j is given by M^eta_{ij}. Standard deviation is the square root of variance and the correlation coefficient is given by ho_{ij} = M^eta_{ij}/sigma_i/sigma_j. These error estimates reflect only random errors in the measurements. The true uncertainty in the parameters is larger due to the presence of systematic errors which, by definition, cannot be quantified.Note that even though the observations may be un-correlated, the parameters are always correlated.

It is often "assumed", for want of any concrete evidence, that the error on a parameter belongs to a Normal distribution with a mean of zero and standard deviation sigma. Under that assumption the following confidence limits can be derived.:68% confidence limits, hat eta pm sigma:95% confidence limits, hat eta pm 2sigma:99% confidence limits, hat eta pm 2.5sigmaThe assumption is not unreasonable when "m>>n". If the experimental errors are normally distributed the parameters will belong to a Student's t-distribution with "m-n" degrees of freedom. When "m>>n" Student's t-distribution approximates to a Normal distribution. Note, however, that these confidence limits cannot take systematic error into account. Also, parameter errors should be quoted to one significant figure only, as they are subject to sampling error. [J. Mandel, The Statistical Analysis of Experimental Data, Interscience, 1964]

When the number of observations is relatively small, Chebychev's inequality can be used for an upper bound on probabilities, regardless of any assumptions about the distribution of experimental errors: the maximum probabilities that a parameter will be more than 1, 2 or 3 standard deviations away from its expectation value are 100%, 25% and 11% respectively.

Residual values and correlation

The residuals are related to the observations by:mathbf{hat r=y-X hat eta=y-X left(X^TWX ight)^{-1}X^T y}The symmetric, idempotent matrix mathbf{X left(X^TWX ight)^{-1}X^T} is known in the statistics literature as the hat matrix, mathbf{H}. Thus,:mathbf{hat r=left(I-H ight) y}where I is an identity matrix. The variance-covariance matrice of the residuals, Mr is given by:mathbf{M^r=left(I-H ight) M left(I-H ight)}.This shows that even though the observations may be uncorrelated, the residuals are always correlated.

If experimental error follows a normal distribution, then, because of the linear relationship between residuals and observations, so should residuals, [K.V. Mardia, J.T. Kent and J.M. Bibby, Multivariate analysis, Academic Press, 1979] but since the observations are only a sample of the population of all possible observations, the residuals should belong to a Student's t-distribution. Studentized residuals are useful in making a statistical test for an outlier when a particular residual appears to be excessively large.

Objective function

The objective function can be written as:S=mathbf{ y^T(I-H)^T(I-H)y=y^T(I-H)y}since mathbf{ (I-H)} is also symmetric and idempotent. It can be shown from this, [W. C. Hamilton, Statistics in Physical Science, The Ronald Press, New York, 1964] that the expected value of "S" is "m-n". Note, however, that this is true only if the weights have been assigned correctly. If unit weights are assumed, the expected value of "S" is (m-n)sigma^2, where sigma^2 is the variance of an observation.

If it is assumed that the residuals belong to a Normal distribution, the objective function, being a sum of weighted squared residuals, will belong to a Chi-square (chi ^2) distribution with "m-n" degrees of freedom. Some illustrative percentile values of chi ^2 are given in the following table. [M.R. Spiegel, Probability and Statistics, Schaum's Outline Series, McGraw-Hill 1982] :These values can be used for a statistical criterion as to the goodness-of-fit. When unit weights are used, the numbers should be divided by the variance of an observation.

Applications

*Polynomials in an independent variable, "x"
** Straight line: f(x_i, oldsymbol eta)=alpha +eta x. [F.S. Acton, Analysis of Straight-Line Data, Wiley, 1959]
** Quadratic: f(x_i, oldsymbol eta)=alpha + eta x +gamma x^2.
** Cubic, quartic and higher polynomials. For high-order polynomials the use of orthogonal polynomials is recommended. [P.G. Guest, Numerical Methods of Curve Fitting, Cambridge University Press, 1961.]
*Numerical smoothing and differentiation This is an application of polynomial fitting.
*Multinomials in more than one independent variable, including surface fitting
*Curve fitting with B-splines
*Chemometrics, Calibration curve, Standard addition, Gran plot, analysis of mixtures

Notes and references


*Cite book | author=Björck, Åke | authorlink= | coauthors= | title=Numerical methods for least squares problems | date=1996 | publisher=SIAM | location=Philadelphia | isbn=0-89871-360-9 | pages=

External links

* [http://mathworld.wolfram.com/LeastSquaresFitting.html Least Squares Fitting – From MathWorld]
* [http://mathworld.wolfram.com/LeastSquaresFittingPolynomial.html Least Squares Fitting-Polynomial – From MathWorld]


Wikimedia Foundation. 2010.

Look at other dictionaries:

  • Non-linear least squares — is the form of least squares analysis which is used to fit a set of m observations with a model that is non linear in n unknown parameters (m > n). It is used in some forms of non linear regression. The basis of the method is to… …   Wikipedia

  • Least-squares spectral analysis — (LSSA) is a method of estimating a frequency spectrum, based on a least squares fit of sinusoids to data samples, similar to Fourier analysis. [cite book | title = Variable Stars As Essential Astrophysical Tools | author = Cafer Ibanoglu |… …   Wikipedia

  • least squares approximation — ▪ statistics       in statistics, a method for estimating the true value of some quantity based on a consideration of errors (error) in observations or measurements. In particular, the line (function) that minimizes the sum of the squared… …   Universalium

  • least squares method — Statistical method for finding a line or curve the line of best fit that best represents a correspondence between two measured quantities (e.g., height and weight of a group of college students). When the measurements are plotted as points on a… …   Universalium

  • Linear — The word linear comes from the Latin word linearis , which means created by lines .In advanced mathematics, a linear map or function f ( x ) is a function which satisfies the following two properties:* Additivity (also called the superposition… …   Wikipedia

  • Lack-of-fit sum of squares — In statistics, a sum of squares due to lack of fit, or more tersely a lack of fit sum of squares, is one of the components of a partition of the sum of squares in an analysis of variance, used in the numerator in an F test of the null hypothesis… …   Wikipedia

  • Problems in Latin squares — In mathematics, the theory of Latin squares is an active research area with many open problems. As in other areas of mathematics, such problems are often made public at professional conferences and meetings. Problems posed here appeared in, for… …   Wikipedia

  • Robust regression — In robust statistics, robust regression is a form of regression analysis designed to circumvent some limitations of traditional parametric and non parametric methods. Regression analysis seeks to find the effect of one or more independent… …   Wikipedia

  • Structural equation modeling — (SEM) is a statistical technique for testing and estimating causal relations using a combination of statistical data and qualitative causal assumptions. This definition of SEM was articulated by the geneticist Sewall Wright (1921),[1] the… …   Wikipedia

  • Scale-invariant feature transform — Feature detection Output of a typical corner detection algorithm …   Wikipedia