Box plot

Box plot

In descriptive statistics, a boxplot (also known as a box-and-whisker diagram or plot) is a convenient way of graphically depicting groups of numerical data through their five-number summaries (the smallest observation, lower quartile (Q1), median (Q2), upper quartile (Q3), and largest observation). A boxplot may also indicate which observations, if any, might be considered outliers. The boxplot was invented in 1977 by the American statistician John Tukey.

Boxplots can be useful to display differences between populations without making any assumptions of the underlying statistical distribution. The spacings between the different parts of the box help indicate the degree of dispersion (spread) and skewness in the data, and identify outliers. Boxplots can be drawn either horizontally or vertically.

Construction

For a data set, one constructs a horizontal box plot in the following manner:
*Calculate the first quartile (x_{.25}), the median (x_{.50}) and third quartile (x_{.75})
*Calculate the interquartile range (IQR) by subtracting the first quartile from the third quartile. (x_{.75}-x_{.25})
*Construct a box above the number line bounded on the left by the first quartile (x_{.25}) and on the right by the third quartile (x_{.75}).
*Indicate where the median lies inside of the box with the presence of a symbol or a line dividing the box at the median value.
*The mean value of the data can also be labeled with a point.
*Any data observation which lies more than scriptstyle 1.5 cdotmathrm{IQR} lower than the first quartile or scriptstyle 1.5 cdotmathrm{IQR} higher than the third quartile is considered an outlier. Indicate where the smallest value that is not an outlier is by connecting it to the box with a horizontal line or "whisker". Optionally, also mark the position of this value more clearly using a small vertical line. Likewise, connect the largest value that is not an outlier to the box by a "whisker" (and optionally mark it with another small vertical line).
*Indicate outliers by open and closed dots. "Extreme" outliers, or those which lie more than three times the IQR (scriptstyle 3 cdotmathrm{IQR}) to the left and right from the first and third quartiles respectively, are indicated by the presence of a closed dot. "Mild" outliers - that is, those observations which lie more than 1.5 times the IQR from the first and third quartile but are not also extreme outliers are indicated by the presence of a open dot. (Sometimes no distinction is made between "mild" and "extreme" outliers.)
*Add an appropriate label to the number line and title the boxplot.
*A boxplot may be constructed in a similar manner vertically as opposed to horizontally by merely interchanging "bottom" for "left", "top" for "right" and "vertical" for "horizontal" in the above description.

Example

A plain-text version might look like this:

+-----+-+ * o |-------| | |---
+-----+-+ +---+---+---+---+---+---+---+---+---+---+---+---+ number line 0 1 2 3 4 5 6 7 8 9 10 11 12

For this data set:
* smallest non-outlier observation = 5 (left "whisker") (left "whisker" would have been 4 had there been an observation with a value of 4 (Q1-scriptstyle 1.5cdotmathrm{IQR}))
* lower quartile (Q1, x_{.25}) = 7
* median (Med, x_{.5}) = 8.5
* upper quartile (Q3, x_{.75}) = 9
* largest non-outlier observation = 10 (right "whisker")
* interquartile range, mathrm{IQR} = Q3-Q1 = 2
* the value 3.5 is a "mild" outlier, between scriptstyle 1.5 cdotmathrm{IQR} and scriptstyle 3cdotmathrm{IQR} below Q1
* the value 0.5 is an "extreme" outlier, more than scriptstyle 3cdotmathrm{IQR} below Q1
* the data are skewed to the left ("negatively skewed")

The horizontal lines (the "whiskers") extend to at most 1.5 times the box width (the interquartile range) from either or both ends of the box. They must end at an observed value, thus connecting all the values outside the box that are not more than 1.5 times the box width away from the box. Three times the box width marks the boundary between "mild" and "extreme" outliers. In this boxplot, "mild" and "extreme" outliers are differentiated by closed and open dots, respectively.

There are alternative implementations of this detail of the box plot in various software packages, such as the whiskers extending to at most the 5th and 95th (or some more extreme) percentiles. Such approaches do not conform to Tukey's definition, with its emphasis on the median in particular and counting methods in general, and they tend to produce "outliers" for all data sets larger than ten, no matter what the shape of the distribution. [There are also several minor variations on how to calculate the quartiles (see also quantile), and Tukey (1977) originally proposed instead using another variant that he named "hinges". The difference between the definitions is no more than the difference between two consecutive data values, however, so it is always dwarfed by sampling variability as so is of little practical consequence.]

Alternative forms

Box and whisker plots are uniform in their use of the box: the bottom and top of the box are always the 25th and 75th percentile (the lower and upper quartiles, respectively), and the band near the middle of the box is always the 50th percentile (the median). But the ends of the whiskers can represent several possible alternative values, among them:

* the minimum and maximum of all the data
* the lowest datum still within 1.5 IQR of the lower quartile, and the highest datum still within 1.5 IQR of the upper quartile
* one standard deviation above and below the mean of the data
* the 9th percentile and the 91st percentile
* the 2nd percentile and the 98th percentile

Any data not included between the whiskers should be plotted as an outlier with a dot, small circle, or star, but occasionally this is not done.

Some box plots include an additional dot or a cross is plotted inside of the box, to represent the mean of the data in addition to the median.

On some box plots a crosshatch is placed on each whisker, before the end of the whisker.

Fairly rarely, box plots can be presented with no whiskers at all.

Because of this variability, it is appropriate to describe the convention being used for the whiskers and outliers in the caption for the plot........!

:"The unusual percentiles 2%, 9%, 91%, 98% are sometimes used for whisker cross-hatches and whisker ends to show the seven-number summary. If the data are normally distributed the locations of the seven marks on the box plot will be equally spaced.

Visualization

The boxplot is a quick graphic for examining one or more sets of data. Boxplots may seem more primitive than a histogram or kernel density estimate but they do have some advantages. They take up less space and are therefore particularly useful for comparing distributions between several groups or sets of data (see Figure 1 for an example). Choice of number and width of bins techniques can heavily influence the appearance of a histogram, and choice of bandwidth can heavily influence the appearance of a kernel density estimate.

As looking at a statistical distribution is more intuitive than looking at a boxplot, comparing the boxplot against the probability density function (theoretical histogram) for a normal N(0,1σ2) distribution may be a useful tool for understanding the boxplot (Figure 2).

ee also

* Exploratory data analysis
* Five-number summary
* Seven-number summary

Notes

References

* John W. Tukey. "Exploratory Data Analysis". Addison-Wesley, Reading, MA. 1977.
* Michael Frigge and David C. Hoaglin and Boris Iglewicz. " [http://links.jstor.org/sici?sici=0003-1305%28198902%2943%3A1%3C50%3ASIOTB%3E2.0.CO%3B2-E Some Implementations of the Boxplot] ". "The American Statistician". Vol. 43 (1), February 1989. 50–54.
* Yoav Benjamini. " [http://links.jstor.org/sici?sici=0003-1305%28198811%2942%3A4%3C257%3AOTBOAB%3E2.0.CO%3B2-%23 Opening the Box of a Boxplot] ". "The American Statistician". Vol 42 (4), November 1988. 257–262.
* Peter J. Rousseeuw, Ida Ruts and John W. Tukey. " [http://links.jstor.org/sici?sici=0003-1305%28199911%2953%3A4%3C382%3ATBABB%3E2.0.CO%3B2-K The Bagplot: A Bivariate Boxplot] ". "The American Statistician". Vol 53 (4), November 1999. 382–387.

External links

* [http://www.lcgceurope.com/lcgceurope/data/articlestandard/lcgceurope/132005/152912/article.pdf Visual Presentation of Data by Means of Box Plots] (PDF)
* [http://www.physics.csbsju.edu/stats/box2.html On-line box plot calculator with explanations and examples]
* [http://www.duncanwil.co.uk/boxplot.html Box and Whisker Diagrams: getting Microsoft Excel to plot them for you]
* [http://peltiertech.com/Excel/Charts/BoxWhisker.html Box and Whisker Plots in Microsoft Excel]
* [http://blog.immeria.net/2007/01/box-plot-and-whisker-plots-in-excel.html Box plot and whisker plots in Excel 2007]
* [http://informationandvisualization.de/blog/box-plot Box plot explanation, examples and a javascript/css-based box plot]


Wikimedia Foundation. 2010.

Игры ⚽ Нужна курсовая?

Look at other dictionaries:

  • Box-Plot — Ein vertikaler Boxplot über einem Zahlenstrahl dargestellt Der Boxplot (auch Box Whisker Plot oder deutsch Kastengrafik) ist ein Diagramm, das zur graphischen Darstellung der Verteilung statistischer Daten verwendet wird. Er fasst dabei… …   Deutsch Wikipedia

  • box plot — noun A graphical summary of a numerical data sample through five statistics: median, lower quartile, upper quartile, and some indication of more extreme upper and lower values. Syn: box and whisker diagram …   Wiktionary

  • box plot — Statistics. a graphic representation of a distribution by a rectangle, the ends of which mark the maximum and minimum values, and in which the median and first and third quartiles are marked by lines parallel to the ends. * * * …   Universalium

  • box plot — a graphic representation of a frequency distribution of a set of data; for each group is drawn a rectangle with upper and lower limits representing the interquartile range, horizontal line within the rectangle representing the median, and… …   Medical dictionary

  • box plot — Statistics. a graphic representation of a distribution by a rectangle, the ends of which mark the maximum and minimum values, and in which the median and first and third quartiles are marked by lines parallel to the ends …   Useful english dictionary

  • box and whisker plot — see ↑boxplot below. • • • Main Entry: ↑box boxˈplot or box and whisker plot noun A method of displaying statistical data by means of a box representing the values between the 25th and 75th percentile, divided by a horizontal line representing the …   Useful english dictionary

  • Plot (graphics) — Scatterplot of the eruption interval for Old Faithful (a geyser). A plot is a graphical technique for representing a data set, usually as a graph showing the relationship between two or more variables. The plot can be drawn by hand or by a… …   Wikipedia

  • Box-Whisker-Plot — Ein vertikaler Boxplot über einem Zahlenstrahl dargestellt Der Boxplot (auch Box Whisker Plot oder deutsch Kastengrafik) ist ein Diagramm, das zur graphischen Darstellung der Verteilung statistischer Daten verwendet wird. Er fasst dabei… …   Deutsch Wikipedia

  • box and whiskers plot — noun A graphical summary of a numerical data sample through five statistics median, upper quartile, lower quartile and upper extreme and lower extreme values by depiction as a box wit …   Wiktionary

  • Plot generator — A plot generator is either: # a fictional plot device which permits the generation of plots for an extended serial without requiring a great deal of logical connection between the episodes, or # a literal device (such as a computer program) used… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”