Using Statistics to Analyse and Compare Data Sets
Using Statistics to Analyse and Compare Data Sets
Comparing Data Sets:
- Comparative statistics refers to the use of statistical measures to determine relationships, similarities and differences between two or more sets of data.
- It involves the use of descriptive statistics including mean, median, mode and range, as well as standard deviation and percentiles.
- When comparing datasets, it’s often helpful to visualise the data using graphs or charts. Bar graphs, histograms, or scatter plots can all be helpful in comparing data.
Analysing Data Sets:
- Data analysis is the process of interpreting the meaning of the data we have collected, organised, and displayed in the form of a table, bar chart, line graph, or other representation.
- Anecdotal evidence should not be used in lieu of a thorough analysis of the data.
- A statistical hypothesis is a claim about a statistical characteristic of a population. Statistical hypothesis testing is a method used in making statistical decisions using experimental data.
- Inferential statistics involves drawing conclusions from data that are subject to random variation. This is accomplished by deducing properties of an underlying probability distribution.
Correlation and Regression:
- Correlation is a statistic that measures the degree to which two variables move in relation to each other. If positive, they increase together. If negative, when one increases, the other decreases.
- Regression analysis is a statistical process for estimating the relationships among variables. It includes methods like linear and multiple linear regression.
- The coefficient of determination (r^2) is a measure of how well the regression predictions approximate the real data points.
Common Pitfalls:
- A common misconception is that correlation equals causation. Even if two sets of data are strongly correlated, they may not have a cause-and-effect relationship.
- Outliers can distort the overall picture of the data. These extreme values can have a significant impact on mean calculations or correlation coefficients.
- Misinterpretation: A careful interpretation of results is essential. For example, a high r^2 value does not necessarily establish that a model is capable of good predictions.
In summary, using statistics to compare and analyse data sets involves carefully identifying the correct measurements for comparison, appropriately analysing and interpreting these measurements, and avoiding common pitfalls in data analysis.