# Inferential Testing

## Statistical Testing & the Sign Test

**Statistical testing:** Statistical tests are used to determine whether the result of an experiment is **significant**, statistically speaking. If a difference is found between the scores of two groups, then it may be that this is because of the tested difference (for example, age), but it might be due to chance factors instead. To determine this, **statistical tests** will be used. These tests use a **probability level** (level of significance) set by the researcher. Usually in psychology this will be 5% (or 0.05) as this is generally thought to be acceptable. This means that, having done the statistical test, there is only a 5% (or less) probability that the results occurred due to chance factors, so the result is highly likely to have happened due to the IV- it is therefore a statistically significant result. In some cases, if researching a sensitive topic, researchers may use a more ‘strict’ level of measurement (for example, 1% or 0.01), to be even more sure that the results are significant. This may be used on drug trials, for instance.

In a statistical test, the calculation is done, and the result of the test is known as the **calculated (or observed) value**. This is then compared with a table of **critical values**. The calculated value must be greater than or lower than (depending on the test) the critical value for the result to be significant. To find out what critical value is to be used, the researcher needs to know the probability level, the number of participants, and whether the hypothesis was one-tailed (directional) or two-tailed (non-directional).

**The sign test:** This test is used when investigating a difference (not an association), using a repeated measures design, and the data are **nominal** (arranged into categories).

For example, a researcher investigated whether listening to music hinders the ability to solve puzzles. The participants had to complete a puzzle in silence, and a different one when listening to music. The time taken to solve the puzzles was calculated in each case. The hypothesis was that ‘participants will take longer to complete a puzzle when listening to music than they will to complete a puzzle in silence’.

Step 1: The data needs to be converted to nominal data. To do this, it can be worked out which participants took more time to complete the puzzle in the silence condition and which took more time in the music condition. The score for the silence condition is subtracted from the music condition, and the sign of the result (+ or -) is recorded.

**Step 2**: The pluses and minuses are added up. The number of pluses is 10 and the number of minuses are 2. 10 participants took longer to complete a puzzle whilst listening to music, whereas 2 took longer to complete a puzzle in silence. Two participants showed no difference between the conditions, so their scores are not counted.

**Step 3**: The sum of the less frequent sign, in this case, the minuses, is used as ‘S’. In this case S= 2.

**Step 4**: The calculated value (S= 2) is compared to a table of critical values:

Level of significance for **one-tailed test**

Level of significance for a **two-tailed test**

*N*

The number of participants **(N) is 13**- this is because two participants scored the same in both conditions, so their results are discarded. The **level of significance is 0.05**, as this is the generally accepted level used for psychology experiments. As the effect of music on solving puzzles is not a sensitive issue there would be no need to make the probability level any stricter. The **hypothesis was directional, meaning it was a one-tailed test**. Therefore, the critical value is **3**.

To be significant, the calculated value (S) must be **equal to or less than** the critical value. As S=2, this is less than 3 (2 < 3), so the result is significant. It can be concluded that listening to music does have a significant effect on the ability to solve puzzles.

## Probability & Significance

**Probability and significance:** Statistical tests will be used to accept either the alternative/experimental hypothesis (predicting there will be a difference or association) or the null hypothesis (predicting there will be no difference or no association). Tests will use a **significance level**, which is the confidence with which the alternative or null hypothesis can be accepted. As seen in the above example, 0.05 or 5% is the significance level generally used in psychology. This means that there is a 95% probability that the results of the experiment are significant, or are not significant, following the result of the statistical test. A researcher could never be 100% sure of this, as this would involve testing every single member of the population in every possible circumstance. 95% probability is seen as an acceptable level for most psychological research.

**Use of statistical tables and critical values:** As seen in the example of the sign test, the result of a statistical test (the observed/calculated value) must be compared to a **critical value** in order for the result to be calculated as significant or not. If the statistical test has an ‘r’ in the name, the observed value must be equal to or greater than the critical value for significance to be shown. If not, the observed value must be equal to or less than the critical value for significance to be shown. To work out what the critical value is, the researcher must know:

- Whether a one-tailed (directional) or two-tailed (non-directional) hypothesis is being used
- The number of participants in the study (N)- for some tests ‘degrees of freedom’ (df) are used instead
- The level of significance- which will be (unless stated otherwise) 0.05 or 5%.

**Type I and Type II errors:** As the probability level used is never 100%, there is always a chance that the researcher may mistakenly accept on of the hypothesis.

**Type I error:**the alternative/experimental hypothesis is mistakenly accepted, so the null is mistakenly rejected. Therefore, the researcher says that there is a significant difference between the groups, but in reality there isn’t (the null hypothesis should have been accepted). The chance of this in psychological research is usually 5%, due to the conventional significance level of 0.05. Type I errors are more likely when the significance level is too lenient- for example, 10% (0.1).**Type II error:**the null hypothesis is mistakenly accepted, so the alternative/experimental is mistakenly rejected. Therefore, the researcher says that there is not a significant difference between the groups, but in reality there is (the alternative hypothesis should have been accepted). Type !! errors are more likely when the significance level is too strict- for example, 1% (0.01). A 5% level is a good balance of the risk of making a Type I or Type II error.

## Factors Affecting the Choice of Statistical Test

To choose which statistical test to use, there are three factors to consider:

**Difference or correlation:** This relates to the aim of the investigation, and the method used. The hypothesis will reveal whether the researcher is investigating a difference or correlation.

**Experimental design:** If using repeated measures or matched pairs, the design is a **related** one. If an independent measures design has been used, it is an **unrelated** design. This consideration won’t apply if the study is investigating a correlation.

**Level of measurement:** Whether the data being used in the test is **nominal**, **ordinal** or **interval**.

## Tests of Difference: Mann-Whitney & Wilcoxon

**Mann-Whitney:** Used when looking for a **difference**, using an **independent groups** design, and the data is **ordinal**. For example, a researcher aimed to investigate whether there is a difference in the amount of chocolate chips found in cookies of two different brands (i.e. value supermarket own brand vs. expensive brand name). No previous research had been done, so the alternative hypothesis was that ‘there will be a difference in the amount of chocolate chips in the supermarket brand cookies compared to the brand name cookies’. 10 supermarket cookies (NA) were counted in terms of the number of chocolate chips, and 8 brand name cookies (NB) were checked. The researcher conducted a Mann-Whitney statistical test and produced a calculated value (known as ‘U’) of 7.5. The critical values table for a two-tailed test at 0.05 is as follows:

NA | 7 | 8 | 9 | 10 |

NB | ||||

7 | 8 | 10 | 12 | 14 |

8 | 10 | 13 | 15 | 17 |

9 | 12 | 15 | 17 | 20 |

10 | 14 | 17 | 20 | 23 |

The critical value where NA is 10 and NB is 8 is **17**. For significance to be shown, the calculated value of U must be **equal to or less than** the critical value. As U= **7.5**, the results are significant at the 5% level, so the alternative hypothesis can be accepted and the null rejected.

**Wilcoxon:** Used when looking for a **difference**, using a **repeated measures** design, and the data is **ordinal**. For example, a researcher aimed to investigate whether smiling whilst watching cartoons means that the cartoons seem funnier. Participants had to watch a cartoon whilst frowning, then rate how funny they found it out of 10. Then they watched another cartoon whilst smiling, and rated how funny they found it out of 10. No previous research had been done, so the alternative hypothesis was that ‘there will be a difference in the humour ratings of cartoons watched whilst frowning compared to those watched whilst smiling’. 14 participants (N) took part in the study. The researcher conducted a Wilcoxon statistical test and produced a calculated value (known as ‘T’) of 27. The critical values table is as follows:

One-tailed test | 0.05 | 0.025 | 0.01 |

Two-tailed test | 0.10 | 0.05 | 0.02 |

N= 13 | 21 | 17 | 12 |

N= 14 | 25 | 21 | 15 |

N= 15 | 30 | 25 | 19 |

The critical value for a two-tailed test at 0.05, where N= 14, is **21**. For significance to be shown, the calculated value of T must be **equal to or less than** the critical value. As T= **27**, the results are not significant at the 5% level, so the alternative hypothesis must be rejected, and the null accepted- ‘there is no difference in the humour ratings of cartoons watched whilst frowning compared to those watched whilst smiling’.

## Parametric Tests of Difference: Unrelated & Related Tests

**Unrelated t-test:** Used when looking for a

**difference**, using an

**independent groups**design, and the data is

**interval**. It is assumed that participants are drawn from a

**normally distributed sample**. It is also assumed that the standard deviations in both groups will be similar (this is known as

**homogeneity of variance**). For example, a researcher aimed to investigate whether eating chocolate affects the time taken to solve a puzzle. Half of the participants ate a chocolate bar before completing a wordsearch, and half completed the wordsearch without eating the chocolate. The time taken to complete the wordsearch was recorded for each participant. No previous research had been done, so the alternative hypothesis was that ‘there will be a difference in the time taken to complete a wordsearch after eating a chocolate bar compared to not eating a chocolate bar’. 20 participants took part in the study, 10 in each group. This means that the degrees of freedom (df) is 10+10 minus 2=

**18**. The researcher conducted an unrelated

*t*-test and produced a calculated value (known as ‘t’) of 1.780. The critical values table is as follows:

One-tailed test | 0.05 | 0.025 |

Two-tailed test | 0.10 | 0.05 |

df= 17 | 1.740 | 2.110 |

df= 18 | 1.734 | 2.101 |

df= 19 | 1.729 | 2.093 |

The critical value for a two-tailed test at 0.05, where df= 18, is **2.101**. For significance to be shown, the calculated value of t must be **equal to or more than** the critical value. As t= **1.780**, the results are not significant at the 5% level, so the alternative hypothesis must be rejected, and the null accepted- ‘there is no difference in the time taken to complete a wordsearch after eating a chocolate bar compared to not eating a chocolate bar’.

**Related t-test:** Used when looking for a

**difference**, using a

**repeated measures**design, and the data is

**interval**. It is assumed that participants are drawn from a

**normally distributed sample**. It is also assumed that the standard deviations in both groups will be similar (this is known as

**homogeneity of variance**). For example, a researcher aimed to investigate whether heart rate decreases following meditation. Participants’ heart rates were measured before and after a 15-minute meditation session. Previous research had suggested heart rate would lower following meditation, so the alternative hypothesis was that ‘there will be a reduction in heart rate following a meditation session’. 25 participants took part in the study. This means that the degrees of freedom (df) is 25 minus 1=

**24**. The researcher conducted a related

*t*-test and produced a calculated value (known as ‘t’) of 1.822. The critical values table is as follows:

One-tailed test | 0.05 | 0.025 |

Two-tailed test | 0.10 | 0.05 |

df= 23 | 1.714 | 2.069 |

df= 24 | 1.711 | 2.064 |

df= 25 | 1.708 | 2.060 |

The critical value for a one-tailed test at 0.05, where df= 24, is **1.711**. For significance to be shown, the calculated value of t must be **equal to or more than** the critical value. As t= **1.822**, the results are significant at the 5% level, so the alternative hypothesis can be accepted.

## Tests of Correlation: Spearman’s & Pearson’s

**Spearman’s rho:** Used when looking for a **correlation**, and the data of one or both of the co-variables is **ordinal** or **interval** level. For example, a researcher aimed to investigate whether there is a correlation between shoe size and how many oranges can be peeled in five minutes. Participants were given five minutes to peel oranges, and the amount they peeled was recorded alongside their shoe size. There was no previous research, so the alternative hypothesis was that ‘there will be a correlation between shoe size and amount of oranges that can be peeled in five minutes’. 15 participants (N) took part in the study. The researcher conducted a Spearman’s rho statistical test and produced a calculated value (known as ‘rho’) of 0.277. The critical values table is as follows:

One-tailed test | 0.05 | 0.025 |

Two-tailed test | 0.10 | 0.05 |

N= 14 | 0.464 | 0.538 |

N= 15 | 0.443 | 0.521 |

N= 16 | 0.429 | 0.503 |

The critical value for a two-tailed test at 0.05, where N= 15, is **0.521**. For significance to be shown, the calculated value of rho (ignoring whether there is a + or - sign) must be **equal to or more than** the critical value. As rho= **0.277**, the results are not significant at the 5% level, so the alternative hypothesis must be rejected, and the null accepted- ‘there is no correlation between shoe size and amount of oranges that can be peeled in five minutes’.

**Pearson’s r:** Used when looking for a **correlation**, and the data of one or both of the co-variables is **interval** level. For example, a researcher aimed to investigate whether there is a correlation between height and the time taken to walk 400 meters. Participants’ heights were measured, then they were timed in walking around a 400m athletics track. Previous research suggested that taller participants would walk 400m more quickly, so the alternative hypothesis was that ‘there will be a positive correlation between height and time taken to walk 400m’. 12 participants took part in the study. This means that the degrees of freedom (df) is 12 minus 2= **10**. The researcher conducted a Pearson’s r statistical test and produced a calculated value (known as ‘r’) of 0.691. The critical values table is as follows:

One-tailed test | 0.05 | 0.025 |

Two-tailed test | 0.10 | 0.05 |

df= 9 | 0.521 | 0.602 |

df= 10 | 0.497 | 0.576 |

df= 11 | 0.476 | 0.553 |

The critical value for a one-tailed test at 0.05, where df= 10, is **0.497**. For significance to be shown, the calculated value of r must be **equal to or more than** the critical value. As r= **0.691**, the results are significant at the 5% level, so the alternative hypothesis can be accepted in this case. If the result of r was negative (for example, -0.691), then the alternative hypothesis would have been rejected, even though the result was significant- this is because there is a significant negative correlation, rather than a positive one, as predicted by the alternative hypothesis.

## Test of Association: Chi-Squared

**Chi-Squared:** Used when looking for a **difference __or** association__, and the data is **nominal** level (in categories). For example, a researcher aimed to investigate whether there is a difference between digit (finger) ratios between males and females. Participants’ digit ratios were calculated by dividing the length of their index finger by the length of their ring finger on their right hand. Those who had a ratio of 1 or greater were recorded, as were those who had a ratio of less than 1. Previous research suggested that males have shorter index fingers than females (leading to a lower digit ratio), so the alternative hypothesis was that ‘more females will have a digit ratio of 1 or above than males’. 33 male and 37 female participants took part in the study. The results were recorded in a 2x2 **contingency table**:

Male | Female | Totals | |

Digit ratio ≥ 1 | 6 | 28 | 34 |

Digit ratio < 1 | 27 | 9 | 36 |

Totals | 33 | 37 | 70 |

As the contingency table is 2x2, this means the degrees of freedom (df) is (rows-1) x (columns -1), so (1-1) x (1-1) = **1**. The researcher conducted a Chi-Squared statistical test and produced a calculated value (known as ‘χ^{2}’) of 23.1. The critical values table is as follows:

One-tailed test | 0.05 | 0.025 |

Two-tailed test | 0.10 | 0.05 |

df= 1 | 2.71 | 3.84 |

df= 2 | 4.60 | 5.99 |

df= 3 | 6.25 | 7.82 |

The critical value for a one-tailed test at 0.05, where df= 1, is **2.71**. For significance to be shown, the calculated value of χ^{2} must be **equal to or more than** the critical value. As χ^{2}= **23.1**, the results are significant at the 5% level, so the alternative hypothesis can be accepted in this case.

## Designing Your Own Study

Within this topic, you may be asked to design a study, which may be worth several marks. When faced with any ‘design a study’ exam question, use the following framework to ensure that you cover all bases:

1: Hypothesis: State your hypothesis and then say whether it is directional or non-directional and why you have chosen this type.

2: Independent and dependent variables: write out succinct, operationalised variables.

3: Experimental design: state whether you will use a repeated measures, independent groups or matched participants design and why this is appropriate.

4: Sample: note what sampling technique you would use and what sample of people you will take (e.g. volunteer sample of 16-19 year olds); try to justify this.

5: Experimental method and procedure: state whether you are using a lab experiment, field experiment, naturalistic observation, case study, natural experiment, correlational analysis, questionnaire, interview etc. Explain how you will carry out the study in practical terms. If you can, try to explain what controls you will use to minimise extraneous variables.

6: Materials: if you haven’t already done so in the experimental method and procedure section, note any materials that you may need to use, e.g. clip board; computer; questionnaire.

7: Results: state the descriptive statistics you will use (graph, chart, table, scatterplot, mean, median, mode, range).

Learn this acronym to help you remember what you need to include:

__H__appy __I__guanas __E__at __S__mall __E__laborate __M__eals __R__egularly

#### Exam Question

A group of 20 five-year-old children on a housing estate have attended a special early-years education project since they were three years old. At the time their parents volunteered for the programme, a control group of 20 children was found by selecting every tenth family from a list of 200 other families on the estate. The two groups were fairly similar in IQ score at the start of the project. The researchers predict that, among other things, the IQ scores of the project group will now be higher than that of the control group. The IQ of the two groups at age 5 is measured using a standardised test. The mean of all 40 children is 100. The following results are found:

Special project Control group children

Above mean 16 12

Below mean 4 8

- What is the IV and what is the DV in this study?
- Attendance
- Suggest a directional hypothesis for this study.
- IQ
- Suggest a non-directional hypothesis for this study.
- Attendance
- Suggest a null hypothesis for this study.
- Attendance
- Has the control group been randomly selected? Give a reason for your answer.
- Your answer should include: No / Equal / Chance
- Describe one important way in which the two groups differ. Why does this difference matter?
- Volunteer
- Suggest one possible extraneous variable.
- Your answer should include: Parents / Lessons