# Data Handling & Analysis

## Kinds of Data

**Qualitative data:** Data in the form of words, which is rich and detailed. Often this is produced from case studies, and unstructured interviews and observations.

**Quantitative data:** Data in the form of numbers, which is often produced from lab experiments or closed questions.

The two types of data can overlap, for example interviewing participants who have taken part in a lab study, or converting the responses to open questions into some form of qualitative data.

**Evaluation:**

- Qualitative data is rich, in detail, and properly reflects human experiences and behaviours, so is higher in internal validity than quantitative. However, quantitative data is much easier to analyse and draw conclusions from, and is less open to bias and subjective opinion than qualitative data.

**Primary data:** Data that has been collected by the researcher for the purposes of the study (e.g., conducting interviews, running a lab experiment).

**Secondary data:** Data collected by someone other than the researcher (data that already exists), for example census information. The researcher makes use of this as part of their study, but the information was not collected for the purpose of that study. One example of this is a meta-analysis, which is where a researcher looks at the results of a number of studies on a particular topic in order to establish general trends and conclusions.

**Evaluation:**

- Primary data perfectly fits the study, as it has been designed for this specific purpose, and the researcher has control over it. Although, this requires more time and effort for the researcher.
- Secondary data is potentially less time-consuming and expensive, but the quality of it cannot be controlled by the researcher and it may not perfectly match the needs/aims of the study.
- Meta-analyses can be useful as it reflects a (potentially) very large sample, making it easier to generalise results. On the other hand, there may be a variance in the quality of the studies, and it may be that studies are included because they show significant results (as they are more likely to be published), so not considering all the studies which may have shown no significant result.

## Measures of Central Tendency

These measure the typical score in a data set (the average).

**Mean:** Calculated by adding up all of the scores, then dividing by the number of scores there are. For example, 5, 8, 6, 3, 8, 6, 7, 7 gives a mean of 6.25.

**Evaluation:**Takes into account all of the data, so is the most ‘sensitive’ measure, but, is affected by extreme values (e.g. 5, 8, 6, 3, 8, 6, 7, 75 would give an unrepresentative mean of 14.75).

**Median:** The scores are put in numerical order, and the middle score is taken as the median. If there are two middle scores, they are added together and divided by 2 to give the median. For example, 3, 5, 6, 6, 7, 7, 8, 8 gives a median of 6.5 (6+7 divided by 2).

**Evaluation:**The median is much less affected by extreme scores than the mean, and is easy to calculate, but, is less sensitive as not all scores are taken into account.

**Mode:** The mode is the most commonly occurring score in a set of data. If there are two modes, the data set is bi-modal. If all the scores are different then there is no mode.

**Evaluation:**Easy to work out, but it is not very sensitive, and there may be several modes in a data set.

## Measures of Dispersion

These measure how far the scores in a data set are spread out.

**Range:** The difference between the lowest and highest score in a data set. Usually, 1 is added to the difference, to allow for the fact that scores are often rounded up or down in research. The range for the data set mentioned previously would be 5 (7-3, +1).

**Evaluation:**Easy to calculate, but only takes the highest and lowest score into account, so can be affected by ‘outliers’ (extreme values). For example, the range for 5, 8, 6, 3, 8, 6, 7, 75 would be 71, but this is not representative of how most of the scores are spread.

**Standard deviation:** Measures the spread of scores around the mean, in other words the average distance of each of the scores from the mean. The higher the standard deviation, the more spread out the scores are, suggesting a large variation in the results. The lower the standard deviation, the more similar all the participant’s scores were.

**Evaluation:**A more precise measure of dispersion than the range, but, as the mean is being used, it can be affected by extreme values, as the mean has been distorted.

## Mathematical Calculations

**Percentages:** Calculated by dividing a score or number by the total, then multiplying by 100. For example, to work out the percentage of participants who got full marks on a memory test, the number who got full marks (12) is divided by the total number of participants (30), then multiplied by 100 (40%).

**Decimals:** The percentage sign is removed, and a decimal point moves two places to the left (for example 40% becomes 0.4).

**Fractions:** If there is one decimal place in the number, it is divided by 10. If there are two, then it is divided by 100. If there are three, it is divided by 1000. In the example above, 0.4 becomes 4/10. This can be further reduced to 2/5 (as 5 cannot be divided equally- it is the ‘lowest common denominator’, meaning that two-fifths of participants got full marks.

**Ratios:** These are expressed as follows (using the above example)- 4:30, which is then reduced, as with fractions- becoming in this example 2:15, as 15 cannot be divided equally.

**Estimations:** This is where a judgement is made, for example on what the mean or range might be.

**Mathematical symbols:** Include the following:

- = (equality)
- > (greater than)
- < (less than)
- >> (much greater than)
- << (much less than)
- ∝ (proportional to)
- ≈ (approximately equal)

**Probability:** The accepted level of probability in psychological research is 5%, often represented as p= 0.05, meaning there is a 5% possibility that the results of an experiment were caused by chance factors, rather than the IV. Another way of representing this is p≤ 0.05, meaning there is a 5% or less possibility the results occurred by chance.

**Significant figures/decimal places:** An appropriate amount of decimal places to use is usually 2-3. For example, 1.326486 could be represented as 1.33- this is using three significant figures, rounding to two decimal places. If the next number is 5 or higher, the previous number is rounded up (as in this example). If lower than 5, it is rounded down.

**Positive, negative and zero correlations:** A positive correlation occurs when both variables increase in number. Negative occurs when one variable increases and the other decreases. No correlation is when neither variable increases or decreases with the other.

## Presentation & Display of Quantitative Data

**Tables:** A way of presenting data. Raw data tables are the records of each participant’s results. Summary tables are used to present descriptive statistics such as the mean, range and so on. A summary paragraph below the table usually explains the results.

**Graphs: bar charts:** Are used to visually represent data such as the mean scores of two conditions. These are used when the data are in discrete categories (for example, mean score on a memory test for 20-25 year-olds, compared to mean score for 60-65 year-olds). The DV is plotted on the vertical y-axis, and the IV on the horizontal x-axis, and the bars do not touch. The graph has a label for each axis and a title describing what it shows.

**Graphs: histograms:** Like a bar chart, but it displays continuous data, so the bars are touching (for example, the percentages of scores on a memory test).

**Graphs: line graphs:** Used to represent continuous data, to represent the change in something over time. A continuous line is used instead of bars.

**Scattergrams:** Used to represent correlational data, showing the relationship between two variables. The two co-variables can appear on either the x or y-axis.

## Distributions

**Normal distributions:** Certain variables should produce normal distributions, which form a bell-shaped curve. Variables such as height and IQ of a population form normal distributions. Most people are located in the middle of the curve, and the mean, median and mode are all the same.

**Skewed distributions:** Some variables and tests produce skewed distributions, where the majority of results appear on the left or the right hand side of the graph. A **positive skew** is when most of the scores are on the left, and there is a long ‘tail’ on the right. This would happen in the case of a test which was difficult, so most people get a low score. In this situation, the mean is ‘pulled’ to the right, and is higher than the mode, as some people got high scores. A **negative skew** is when most of the scores are on the right, and there is a long ‘tail’ on the left. This would happen in the case of a test which was easy, so most people get a high score. In this situation, the mean is ‘pulled’ to the left, and is lower than the mode, as some people got low scores.

## Analysis & Interpretation of Correlations

Correlations measure the association between two co-variables, the results of which are plotted on a scattergram. To determine the strength of a correlation, a measure known as the **correlation coefficient** is calculated. This is a numerical value between -1 and +1. If the number is negative, there is a negative correlation between the two variables- as one increases, the other decreases. If it is positive, there is a positive correlation- as one variable increases, so does the other. The closer to 1 (or -1) the number, the stronger the correlation. A value of 0 means that there is no correlation at all, and the closer the number is to 0, the weaker the correlation is. Strong correlations would be 0.8, or -0.75, for example. Weak correlations would be 0.15, or -0.09, for example. However, if the sample size is very large, even a seemingly weak correlation could be statistically significant- a statistical test is the only way to know this.

## Levels of Measurement

This is a way of classifying quantitative (numerical) data. The level of measurement is a key factor in deciding which inferential test to use.

**Nominal:** The level of data used when categorising something. Named categories are established by the researcher and an item is counted when it falls into this category. For example, the number of males and females in a psychology class, or the number of monolingual, bilingual and multilingual students in the school. Each ‘item’ only appears in one category.

**Ordinal:** This is when data is ranked so that it is possible to see the order of scores in relation to one another. For example, in a 100m race, ranking who came first, second, third and so on. There is not an equal interval between each unit- for example, the person who won the race may have finished 0.1 seconds ahead of the 2^{nd} place runner, but this runner may have finished 0.3 seconds ahead of the 3^{rd} place runner. Due to this, the ranks rather than the raw scores are used in the statistical test.

**Interval:** This is a more sophisticated level of data. It not only gives the rank order of scores but it also details the precise intervals between scores. The measurement being used might be temperature or weight, where there is a universally accepted scale of measurement. For example, in the 100m race the finishing times of runners would be interval data: Clarke, N- 11.4 seconds; Smith, H- 11.9 seconds; Lloyd, P- 12.1 seconds.

## Content Analysis & Coding

**Coding:** This generates quantitative data. Data is categorised into units for the purposes of analysis, for example when studying the portrayal of gender in TV adverts, a list of characteristics may be drawn up (‘aggressive’, ‘competitive’, ‘domestic’) and these behaviours are recorded when they appear in the adverts.

**Thematic analysis:** This generates qualitative data. Recurring themes will be identified using coding, then these will be described in greater detail. For example, ‘women are portrayed as the primary child-carer in adverts’ or ‘men primarily appear in a professional, working role in adverts’. These themes may then be tested by conducting further analyses, to be sure that they represent the content of the data.