# Statistical Concepts for Biologists

## Box Plot

In descriptive statistics, a boxplot is a convenient way to graphically depict numeric data through their quantiles. The spacings between the different parts of the box indicate the degree of spread and skewness in the data and show outliers.

Box plots are non-parametric: box plots display variation in samples of a
statistical population without making any assumptions of the underlying
statistical distribution. They are also known as box-and-whisker plots because of
the lines (whiskers) extending from the boxes indicating variability outside
the upper and lower quartiles.^{[1][2]}

## Visualization

There are two versions of box plots, one that displays the plot without the outliers and one with the outliers. See this page for a detail description of the two versions of box plots.

Now, let's take a look at how to draw a box plot from Khan
Acadamy.^{[3]} (note: they use the version of box plot without the
outliers.)

The relative position and size of the box, the whisker and the median can
provide us information regarding the shape of the data.^{[4]}

- Symetric data:

- Left-skewed (negatively-skewed) data:

- Right-skewed (positively-skewed) data:

How do the mean, median and mode compare in these three shapes of data? Here is a demostration!

## Example

Box plots are widely used in life science research. Look at the example below and consider the sample Q & A.

**Example 1:** Link
In this paper, the authors investigated the toxic effect of bisphenol A (BPA)
on male reproduction by monitoring the level of two n-6 fatty acids, LA and AA,
in the testes of exposed and control groups of rats. They concluded that BPA
caused testicular n-6 fatty acid composition variation and decreased
antioxidant enzyme levels.

**Sample Q & A:**

**Q:** Describe the box plots in Figure 6.

**A:** The LA relative level of exposed group, the AA relative level of
unexposed group and the AA/LA ratio of the unexposed group have smaller
variation compared to their counterpart, as reflected by a narrower box width
and shorter whiskers. The median AA/LA ratio data from the exposed group lies
at the lower part of the box, which means the data is skewed to the right (i.e.
the half of the data that are greater than the median span a wider range than the
other half of the data that are smaller than the median). We would expect to
see a mean value larger than the median for this group.

Now look at another example and try to answer the following questions yourself.

**Example 2:** Link
In this paper, the authors investigated the influence of mutations in the
*TP53* gene on the outcomes of breast carcinoma patients. Figure 1 are box
plots showing correlations of tumor proliferation status verses the expression
levels of various genes.

**Questions:**

- In Figure1a, what shape of the p53 mRNA expression data would it be for tumors with low proliferation and high proliferation, respectively? Would you expected it to be symmetric, left-skewed or right-skewed?
- In Figure1d, what are the median p53 mRNA expression levels for wild-type and mutated respectively? Would you expect the difference of the means between wild-type and mutated be greater or smaller than the difference of the medians?