Box Plot


In descriptive statistics, a boxplot is a convenient way to graphically depict numeric data through their quantiles. The spacings between the different parts of the box indicate the degree of spread and skewness in the data and show outliers.

Box plots are non-parametric: box plots display variation in samples of a statistical population without making any assumptions of the underlying statistical distribution. They are also known as box-and-whisker plots because of the lines (whiskers) extending from the boxes indicating variability outside the upper and lower quartiles.[1][2]


Visualization


There are two versions of box plots, one that displays the plot without the outliers and one with the outliers. See this page for a detail description of the two versions of box plots.

Now, let's take a look at how to draw a box plot from Khan Acadamy.[3] (note: they use the version of box plot without the outliers.)

The relative position and size of the box, the whisker and the median can provide us information regarding the shape of the data.[4]

  1. Symetric data:

  2. Left-skewed (negatively-skewed) data:

  3. Right-skewed (positively-skewed) data:

How do the mean, median and mode compare in these three shapes of data? Here is a demostration!


Example


Box plots are widely used in life science research. Look at the example below and consider the sample Q & A.

Example 1: Link In this paper, the authors investigated the toxic effect of bisphenol A (BPA) on male reproduction by monitoring the level of two n-6 fatty acids, LA and AA, in the testes of exposed and control groups of rats. They concluded that BPA caused testicular n-6 fatty acid composition variation and decreased antioxidant enzyme levels.

Sample Q & A:

Q: Describe the box plots in Figure 6.

A: The LA relative level of exposed group, the AA relative level of unexposed group and the AA/LA ratio of the unexposed group have smaller variation compared to their counterpart, as reflected by a narrower box width and shorter whiskers. The median AA/LA ratio data from the exposed group lies at the lower part of the box, which means the data is skewed to the right (i.e. the half of the data that are greater than the median span a wider range than the other half of the data that are smaller than the median). We would expect to see a mean value larger than the median for this group.

Now look at another example and try to answer the following questions yourself.

Example 2: Link In this paper, the authors investigated the influence of mutations in the TP53 gene on the outcomes of breast carcinoma patients. Figure 1 are box plots showing correlations of tumor proliferation status verses the expression levels of various genes.

Questions:

  1. In Figure1a, what shape of the p53 mRNA expression data would it be for tumors with low proliferation and high proliferation, respectively? Would you expected it to be symmetric, left-skewed or right-skewed?
  2. In Figure1d, what are the median p53 mRNA expression levels for wild-type and mutated respectively? Would you expect the difference of the means between wild-type and mutated be greater or smaller than the difference of the medians?



About

This tutorial was first written by Mingshu Huang and is maintained as part of quabinet.org. The project was funded through a Spark Grant from the Harvard Initiative for Learning and Teaching to Abha Ahuja and Melanie I Stefan at the Curriculum Fellows Program at Harvard Medical School.