Robust regression

In robust statistics, robust regression is a form of regression analysis designed to circumvent some limitations of traditional parametric and non-parametric methods. Regression analysis seeks to find the effect of one or more independent variables upon a dependent variable. Certain widely used methods of regression, such as ordinary least squares, have favourable properties if their underlying assumptions are true, but can give misleading results if those assumptions are not true; thus ordinary least squares is said to be not robust to violations of its assumptions. Robust regression methods are designed to be not overly affected by violations of assumptions by the underlying data-generating process.

In particular, least squares estimates for regression models are highly non-robust to outliers. While there is no precise definition of an outlier, outliers are observations which do not follow the pattern of the other observations. This is not normally a problem if the outlier is simply an extreme observation drawn from the tail of a normal distribution, but if the outlier results from non-normal measurement error or some other violation of standard ordinary least squares assumptions, then it compromises the validity of the regression results if a non-robust regression technique is used.

1 Applications
- 1.1 Heteroscedastic errors
- 1.2 Presence of outliers
2 History and unpopularity of robust regression
3 Methods for robust regression
- 3.1 Least squares alternatives
- 3.2 Parametric alternatives
4 Example: BUPA liver data
- 4.1 Outlier detection
5 See also
6 References
7 External links

Applications

Heteroscedastic errors

One instance in which robust estimation should be considered is when there is a strong suspicion of heteroscedasticity. In the homoscedastic model, it is assumed that the variance of the error term is constant for all values of x. Heteroscedasticity allows the variance to be dependent on x, which is more accurate for many real scenarios. For example, the variance of expenditure is often larger for individuals with higher income than for individuals with lower incomes. Software packages usually default to a homoscedastic model, even though such a model may be less accurate than a heteroscedastic model. One simple approach (Tofallis, 2008) is to apply least squares to percentage errors as this reduces the influence of the larger values of the dependent variable compared to ordinary least squares.

Presence of outliers

Another common situation in which robust estimation is used occurs when the data contain outliers. In the presence of outliers that do not come from the same data-generating process as the rest of the data, least squares estimation is inefficient and can be biased. Because the least squares predictions are dragged towards the outliers, and because the variance of the estimates is artificially inflated, the result is that outliers can be masked. (In many situations, including some areas of geostatistics and medical statistics, it is precisely the outliers that are of interest.)

Although it is sometimes claimed that least squares (or classical statistical methods in general) are robust, they are only robust in the sense that the type I error rate does not increase under violations of the model. In fact, the type I error rate tends to be lower than the nominal level when outliers are present, and there is often a dramatic increase in the type II error rate. The reduction of the type I error rate has been labelled as the conservatism of classical methods. Other labels might include inefficiency or inadmissibility.

History and unpopularity of robust regression

Despite their superior performance over least squares estimation in many situations, robust methods for regression are still not widely used. Several reasons may help explain their unpopularity (Hampel et al. 1986, 2005). One possible reason is that there are several competing methods and the field got off to many false starts. Also, computation of robust estimates is much more computationally intensive than least squares estimation; in recent years however, this objection has become less relevant as computing power has increased greatly. Another reason may be that some popular statistical software packages failed to implement the methods (Stromberg, 2004). The belief of many statisticians that classical methods are robust may be another reason.

Although uptake of robust methods has been slow, modern mainstream statistics text books often include discussion of these methods (for example, the books by Seber and Lee, and by Faraway; for a good general description of how the various robust regression methods developed from one another see Andersen's book). Also, modern statistical software packages such as R, Stata and S-PLUS include considerable functionality for robust estimation (see, for example, the books by Venables and Ripley, and by Marrona et al.). It is possible that these methods will come into wider use in the future.

Methods for robust regression

Least squares alternatives

The simplest methods of estimating parameters in a regression model that are less sensitive to outliers than the least squares estimates, is to use least absolute deviations. Even then, gross outliers can still have a considerable impact on the model, motivating research into even more robust approaches.

In 1973, Huber introduced M-estimation for regression (see robust statistics for additional details of M-estimation). The M in M-estimation stands for "maximum likelihood type". The method is robust to outliers in the response variable, but turned out not to be resistant to outliers in the explanatory variables (leverage points). In fact, when there are outliers in the explanatory variables, the method has no advantage over least squares.

In the 1980s, several alternatives to M-estimation were proposed as attempts to overcome the lack of resistance. See the book by Rousseeuw and Leroy for a very practical review. Least trimmed squares (LTS) is a viable alternative and is presently (2007) the preferred choice of Rousseeuw and Ryan (1997, 2008). The Theil–Sen estimator has a lower breakdown point than LTS but is statistically efficient and popular. Another proposed solution was S-estimation. This method finds a line (plane or hyperplane) that minimizes a robust estimate of the scale (from which the method gets the S in its name) of the residuals. This method is highly resistant to leverage points, and is robust to outliers in the response. However, this method was also found to be inefficient.

MM-estimation attempts to retain the robustness and resistance of S-estimation, whilst gaining the efficiency of M-estimation. The method proceeds by finding a highly robust and resistant S-estimate that minimizes an M-estimate of the scale of the residuals (the first M in the method's name). The estimated scale is then held constant whilst a close-by M-estimate of the parameters is located (the second M).

Parametric alternatives

Another approach to robust estimation of regression models is to replace the normal distribution with a heavy-tailed distribution. A t-distribution with between 4 and 6 degrees of freedom has been reported to be a good choice in various practical situations. Bayesian robust regression, being fully parametric, relies heavily on such distributions.

Under the assumption of t-distributed residuals, the distribution is a location-scale family. That is, $x \leftarrow (x-\mu)/\sigma$ . The degrees of freedom of the t-distribution is sometimes called the kurtosis parameter. Lange, Little and Taylor (1989) discuss this model in some depth from a non-Bayesian point of view. A Bayesian account appears in Gelman et al. (2003).

An alternative parametric approach is to assume that the residuals follow a mixture of normal distributions; in particular, a contaminated normal distribution in which the majority of observations are from a specified normal distribution, but a small proportion are from a normal distribution with much higher variance. That is, residuals have probability $1 - ε$ of coming from a normal distribution with variance $σ 2$ , where $ε$ is small, and probability $ε$ of coming from a normal distribution with variance $c σ 2$ for some $c > 1$

e i \sim(1 - ε) N (0,σ 2) + ε N (0, c σ 2).

Typically, $ε < 0.1$ . This is sometimes called the $ε$ -contamination model.

Parametric approaches have the advantage that likelihood theory provides an 'off the shelf' approach to inference (although for mixture models such as the $ε$ -contamination model, the usual regularity conditions might not apply), and it is possible to build simulation models from the fit. However, such parametric models still assume that the underlying model is literally true. As such, they do not account for skewed residual distributions or finite observation precisions.

Example: BUPA liver data

The BUPA liver data have been studied by various authors, including Breiman (2001). The data can be found via the classic data sets page and there is some discussion in the article on the Box-Cox transformation. A plot of the logs of ALT versus the logs of γGT appears below. The two regression lines are those estimated by ordinary least squares (OLS) and by robust MM-estimation. The analysis was performed in R using software made available by Venables and Ripley (2002).

The two regression lines appear to be very similar (and this is not unusual in a data set of this size). However, the advantage of the robust approach comes to light when the estimates of residual scale are considered. For ordinary least squares, the estimate of scale is 0.420, compared to 0.373 for the robust method. Thus, the relative efficiency of ordinary least squares to MM-estimation in this example is 1.266. This inefficiency leads to loss of power in hypothesis tests, and to unnecessarily wide confidence intervals on estimated parameters.

Outlier detection

Another consequence of the inefficiency of the ordinary least squares fit is that several outliers are masked. Because the estimate of residual scale is inflated, the scaled residuals are pushed closer to zero than when a more appropriate estimate of scale is used. The plots of the scaled residuals from the two models appear below. The variable on the x-axis is just the observation number as it appeared in the data set. Rousseeuw and Leroy (1986) contains many such plots.

The horizontal reference lines are at 2 and -2 so that any observed scaled residual beyond these boundaries can be considered to be an outlier. Clearly, the least squares method leads to many interesting observations being masked.

Whilst in one or two dimensions outlier detection using classical methods can be performed manually, with large data sets and in high dimensions the problem of masking can make identification of many outliers impossible. Robust methods automatically detect these observations, offering a serious advantage over classical methods when outliers are present.

References

Andersen, R. (2008). Modern Methods for Robust Regression. Sage University Paper Series on Quantitative Applications in the Social Sciences, 07-152.
Ben-Gal I., Outlier detection, In: Maimon O. and Rockach L. (Eds.) Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers," Kluwer Academic Publishers, 2005, ISBN 0-387-24435-2.
Breiman, L. (2001). "Statistical Modeling: the Two Cultures". Statistical Science 16 (3): 199–231. doi:10.1214/ss/1009213725. JSTOR 2676681.
Faraway, J. J. (2004). Linear Models with R. Chapman & Hall/CRC.
Draper, David (1988). "Rank-Based Robust Analysis of Linear Models. I. Exposition and Review". Statistical Science 3 (2): 239–257. doi:10.1214/ss/1177012915. JSTOR 2245578.
McKean, Joseph W. (2004). "Robust Analysis of Linear Models". Statistical Science 19 (4): 562–570. doi:10.1214/088342304000000549. JSTOR 4144426.
Gelman, A.; J. B. Carlin, H. S. Stern and D. B. Rubin (2003). Bayesian Data Analysis (Second Edition). Chapman & Hall/CRC.
Hampel, F. R.; E. M. Ronchetti, P. J. Rousseeuw and W. A. Stahel (1986, 2005). Robust Statistics: The Approach Based on Influence Functions. Wiley.
Lange, K. L.; R. J. A. Little and J. M. G. Taylor (1989). "Robust statistical modeling using the t-distribution". Journal of the American Statistical Association 84 (408): 881–896. doi:10.2307/2290063. JSTOR 2290063.
Maronna, R.; D. Martin and V. Yohai (2006). Robust Statistics: Theory and Methods. Wiley.
Radchenko S.G. (2005). Robust methods for statistical models estimation: Monograph. (on russian language). Кiev: РР «Sanspariel» ISBN 966-96574-0-7. pp. 504.
Rousseeuw, P. J.; A. M. Leroy (1986, 2003). Robust Regression and Outlier Detection. Wiley.
Ryan, T. P. (1997, 2008). Modern Regression Methods. Wiley.
Seber, G. A. F.; A. J. Lee (2003). Linear Regression Analysis (Second Edition). Wiley.
Stromberg, A. J. (2004). "Why write statistical software? The case of robust statistical methods". Journal of Statistical Software.
Strutz, Tilo (2010). Data Fitting and Uncertainty - A practical introduction to weighted least squares and beyond. Vieweg+Teubner. ISBN 978-3-8348-1022-9.
Tofallis, Chris (2008). "Least Squares Percentage Regression". Journal of Modern Applied Statistical Methods 7: 526–534. http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1406472.
Venables, W. N.; B. D. Ripley (2002). Modern Applied Statistics with S. Springer.

External links

R programming wikibooks
Brian Ripley's robust statistics course notes.
Nick Fieller's course notes on Statistical Modelling and Computation contain material on robust regression.
Olfa Nasraoui's Overview of Robust Statistics
Why write statistical software? The case of robust statistical methods, A. J. Stromberg

Robust regression is available in many statistical software packages:

Robust Regression Modeling Package for the R programming language
the 'MASS' and 'robust' packages for the R programming language
S-PLUS statistical software package
GraphPad Prism can perform robust nonlinear regression.

Statistics

Descriptive statistics

Continuous data

Location	Mean (Arithmetic, Geometric, Harmonic) · Median · Mode

Dispersion	Range · Standard deviation · Coefficient of variation · Percentile · Interquartile range

Shape	Variance · Skewness · Kurtosis · Moments · L-moments

Count data

Index of dispersion

Summary tables

Grouped data · Frequency distribution · Contingency table

Dependence

Pearson product-moment correlation · Rank correlation (Spearman's rho, Kendall's tau) · Partial correlation · Scatter plot

Statistical graphics

Bar chart · Biplot · Box plot · Control chart · Correlogram · Forest plot · Histogram · Q-Q plot · Run chart · Scatter plot · Stemplot · Radar chart

Data collection

Designing studies	Effect size · Standard error · Statistical power · Sample size determination

Survey methodology	Sampling · Stratified sampling · Opinion poll · Questionnaire

Controlled experiment	Design of experiments · Factorial experiment · Randomized experiment · Random assignment · Replication · Blocking · Optimal design

Uncontrolled studies	Natural experiment · Quasi-experiment · Observational study

Statistical inference

Statistical theory	Sampling distribution · Sufficient statistic · Meta-analysis

Bayesian inference	Bayesian probability · Prior · Posterior · Credible interval · Bayes factor · Bayesian estimator · Maximum posterior estimator

Frequentist inference	Confidence interval · Hypothesis testing · Likelihood-ratio

Specific tests	Z-test (normal) · Student's t-test · F-test · Pearson's chi-squared test · Wald test · Mann–Whitney U · Shapiro–Wilk · Signed-rank · Kolmogorov–Smirnov test

General estimation	Mean-unbiased · Median-unbiased · Maximum likelihood · Method of moments · Minimum distance · Density estimation

Correlation and regression analysis

Correlation	Pearson product-moment correlation · Partial correlation · Confounding variable · Coefficient of determination

Regression analysis	Errors and residuals · Regression model validation · Mixed effects models · Simultaneous equations models

Linear regression	Simple linear regression · Ordinary least squares · General linear model · Bayesian regression

Non-standard predictors	Nonlinear regression · Nonparametric · Semiparametric · Isotonic · Robust

Generalized linear model	Exponential families · Logistic (Bernoulli) · Binomial · Poisson

Partition of variance	Analysis of variance (ANOVA) · Analysis of covariance · Multivariate ANOVA · Degrees of freedom

Categorical, multivariate, time-series, or survival analysis

Categorical data	Cohen's kappa · Contingency table · Graphical model · Log-linear model · McNemar's test

Multivariate statistics	Multivariate regression · Principal components · Factor analysis · Cluster analysis · Copulas

Time series analysis	Decomposition (Trend · Stationary process) · ARMA model · ARIMA model · Vector autoregression · Spectral density estimation

Survival analysis	Survival function · Kaplan–Meier · Logrank test · Failure rate · Proportional hazards models · Accelerated failure time model

Applications

Biostatistics	Bioinformatics · Biometrics · Clinical trials & studies · Epidemiology · Medical statistics · Pharmaceutical statistics

Engineering statistics	Methods engineering · Probabilistic design · Process & Quality control · Reliability · System identification

Social statistics	Actuarial science · Census · Crime statistics · Demography · Econometrics · National accounts · Official statistics · Population · Psychometrics

Spatial statistics	Cartography · Environmental statistics · Geographic information system · Geostatistics · Kriging

Category · Portal · Outline · Index

Least squares and regression analysis

Computational statistics

Least squares · Linear least squares · Non-linear least squares · Iteratively reweighted least squares

Correlation and dependence

Pearson product-moment correlation · Rank correlation (Spearman's rho, Kendall's tau) · Partial correlation · Confounding variable

Regression analysis

Ordinary least squares · Partial least squares · Total least squares · Ridge regression

Regression as a
statistical model

Linear regression	Simple linear regression · Ordinary least squares · Generalized least squares · Weighted least squares · General linear model

Predictor structure	Polynomial regression · Growth curve · Segmented regression · Local regression

Non-standard	Nonlinear regression · Nonparametric · Semiparametric · Robust · Quantile · Isotonic

Non-normal errors	Generalized linear model · Binomial · Poisson · Logistic

Decomposition of variance

Analysis of variance · Analysis of covariance · Multivariate AOV

Model exploration

Mallows' Cp · Stepwise regression · Model selection · Regression model validation

Background

Mean and predicted response · Gauss–Markov theorem · Errors and residuals · Goodness of fit · Studentized residual · Minimum mean-square error

Design of experiments

Response surface methodology · Optimal design · Bayesian design

Numerical approximation

Numerical analysis · Approximation theory · Numerical integration · Gaussian quadrature · Orthogonal polynomials · Chebyshev polynomials · Chebyshev nodes

Applications

Curve fitting · Calibration curve · Numerical smoothing and differentiation · System identification · Moving least squares

Regression analysis category - Statistics category · Statistics portal · Statistics outline · Statistics topics

Categories:

Wikimedia Foundation. 2010.

Игры ⚽ Поможем написать курсовую

Look at other dictionaries:

Robust statistics — provides an alternative approach to classical statistical methods. The motivation is to produce estimators that are not unduly affected by small departures from model assumptions. Contents 1 Introduction 2 Examples of robust and non robust… … Wikipedia
Regression analysis — In statistics, regression analysis is a collective name for techniques for the modeling and analysis of numerical data consisting of values of a dependent variable (response variable) and of one or more independent variables (explanatory… … Wikipedia
Regression testing — is any type of software testing which seeks to uncover software regressions. Such regressions occur whenever software functionality that was previously working correctly stops working as intended. Typically regressions occur as an unintended… … Wikipedia
Regression toward the mean — In statistics, regression toward the mean (also known as regression to the mean) is the phenomenon that if a variable is extreme on its first measurement, it will tend to be closer to the average on a second measurement, and a fact that may… … Wikipedia
Regression discontinuity design — In statistics, econometrics, epidemiology and related disciplines, a regression discontinuity design (RDD) is a design that elicits the causal effects of interventions by exploiting a given exogenous threshold determining assignment to treatment … Wikipedia
Outline of regression analysis — In statistics, regression analysis includes any technique for learning about the relationship between one or more dependent variables Y and one or more independent variables X. The following outline is an overview and guide to the variety of… … Wikipedia
Linear regression — Example of simple linear regression, which has one independent variable In statistics, linear regression is an approach to modeling the relationship between a scalar variable y and one or more explanatory variables denoted X. The case of one… … Wikipedia
Nonlinear regression — See Michaelis Menten kinetics for details In statistics, nonlinear regression is a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or… … Wikipedia
Kernel-Regression — Unter Kernel Regression versteht man eine Reihe nichtparametrischer statistischer Methoden, bei denen die Abhängigkeit einer zufälligen Größe von Ausgangsdaten mittels Kerndichteschätzung geschätzt werden. Die Art der Abhängigkeit, dargestellt… … Deutsch Wikipedia
Unit-weighted regression — In statistics, unit weighted regression is perhaps the easiest form of multiple regression analysis, a method in which two or more variables are used to predict the value of an outcome. At a conceptual level, the example of weight loss can… … Wikipedia

Academic Dictionaries and Encyclopedias

Robust regression

Contents

Applications

Heteroscedastic errors

Presence of outliers

History and unpopularity of robust regression

Methods for robust regression

Least squares alternatives

Parametric alternatives

Example: BUPA liver data

Outlier detection

See also

References

External links

Look at other dictionaries:

Share the article and excerpts

Academic Dictionaries and Encyclopedias

Wikipedia

Robust regression

Contents

Applications

Heteroscedastic errors

Presence of outliers

History and unpopularity of robust regression

Methods for robust regression

Least squares alternatives

Parametric alternatives

Example: BUPA liver data

Outlier detection

See also

References

External links

Look at other dictionaries:

Share the article and excerpts

Direct link