Box Plots
Box Plots
Understanding the Concept of a Box Plot
- A box plot, also known as a box and whisker plot, is a graphical representation used in statistics to visualise the distribution of a dataset.
- The plot showcases five key data points: the minimum, the lower quartile (Q1), the median (Q2), the upper quartile (Q3), and the maximum.
- Quartiles partition the data into quarters. The lower quartile (Q1) is the median of the lower half of the data, not including the overall median. The upper quartile (Q3) is the median of the upper half.
- The box represents the interquartile range (IQR) which is calculated by subtracting the lower quartile (Q1) from the upper quartile (Q3). It shows the middle 50% of the dataset.
- The lines (or whiskers) extend from the box to the minimum and maximum data values.
Constructing a Box Plot
- To construct a box plot, you first need to find the five number summary: minimum, Q1, median, Q3, and maximum.
- Draw a horizontal or vertical number line, depending on your preference.
- Mark the five points on the line to map the range of the dataset.
- Draw box from Q1 to Q3 and place a vertical line in the box at the median.
Features and Interpretation
- A box plot provides a summary of the distribution and skewness of the dataset. If the median is not in the centre, the data could be positively skewed (with more lower values) or negatively skewed (with more higher values).
- The plot can also give information about the spread of the data - the wider the box or the whiskers, the more spread out the data is.
- It can also help identify outliers, which are values unusual when compared to the rest of the dataset. Outliers typically exist outside of the whiskers.
Revising Box Plots
- Engage in regular practice of constructing and interpreting box plots.
- Understand how quartiles divide data and implications of their positions.
- Pay attention to the role of the median as an indicator of dataset skewness.
- Regularly study real-world datasets and their box plots to become comfortable determining outliers and data spread.