Meta-analysisWikiLearningHigh evidence score

The Testing Effect in the Psychology Classroom: A Meta-Analytic Perspective

Authors: Juliane Schwieren, Jonathan Barenberg, Stephan Dutke
Journal: Psychology Learning & Teaching
Year: 2017
DOI: 10.1177/1475725717695149
Citations: 97

TL;DR

Taking practice tests during learning (retrieval practice) improves final exam performance by a moderate amount (Cohen's d = 0.56) compared to re-studying or no additional practice, and this effect holds across psychology classrooms — meaning you can boost your own learning by replacing some study time with self-testing.

What they tested

This is a meta-analysis, meaning the authors combined results from multiple studies to answer one question: Does taking tests during the learning phase (retrieval practice) improve later performance on a final test, specifically when the learning material is psychology content?

**Intervention:** Retrieval practice — taking an intermediate test (quiz, recall exercise, or practice exam) after initial exposure to psychology material. Some studies gave feedback on the test; others did not.

**Comparators:** Two types of control conditions were used across the included studies:

**Restudy condition:** Participants re-read or re-studied the material instead of being tested.

**No-test condition:** Participants had no additional exposure to the material after initial learning.

**Outcome measures:** Performance on a final test (e.g., exam, quiz, recall test) administered hours to weeks after the learning phase. Outcomes included recognition, cued recall, free recall, and sometimes transfer or inference questions.

Who was studied

The meta-analysis included **19 publications** (individual studies) that collectively yielded **72 effect sizes**. The participants were primarily:

University students enrolled in psychology courses (both psychology majors and non-majors taking psychology classes).

Age range: typically 18–25 years (college-aged).

Settings: real psychology classrooms (field studies) and some lab-based studies using psychology content.

Sample sizes per study ranged from approximately 20 to over 200 participants.

The authors specifically excluded studies that used psychology students but non-psychology learning materials (e.g., word lists, prose texts about history). They only included studies where the *content* being learned was psychology.

How they measured it

Each individual study used its own final test, so there was no single instrument. However, the authors extracted effect sizes (Cohen's d) from each study, which standardises the difference between the tested and control groups in standard deviation units.

**Key variables coded from each study:**

**Study design:** Between-subjects (one group tested, one not) vs. within-subjects (all participants tested on some items but not others).

**Control condition type:** Restudy vs. no-test.

**Feedback:** Whether participants received correct answers after the intermediate test (yes/no).

**Retention interval:** Time between intermediate test and final test (ranged from minutes to weeks).

**Test format:** Multiple-choice, short answer, free recall, etc.

Methodology

**Study design:** This is a **meta-analysis** — a statistical synthesis of existing studies. The authors systematically searched PsycINFO (1941–August 2015) and reference lists of review articles, then applied inclusion criteria. They identified 19 publications meeting their criteria and extracted 72 effect sizes.

**Statistical approach:**

They used a **random-effects model**, which assumes the true effect size varies across studies (rather than one fixed effect). This is appropriate when studies differ in methods, samples, and settings.

They calculated an overall weighted mean effect size (Cohen's d).

They tested for **moderator effects** (study design, control condition type, feedback) using subgroup analyses.

They assessed **publication bias** (the tendency for only positive results to be published) using funnel plots and Egger's test.

They checked for **dependency** between effect sizes (multiple outcomes from the same study) and adjusted for it.

**What this design can and cannot prove:**

**Can prove:** That, across many classroom studies, retrieval practice consistently produces better learning outcomes than re-studying or no additional practice. The meta-analytic approach increases statistical power and generalisability beyond any single study.

**Cannot prove:** Causality at the individual level (this is not an RCT you can apply to yourself). It cannot tell you the *optimal* number of practice tests, the best spacing between tests, or whether the effect works equally well for all types of psychology content (e.g., statistics vs. social psychology). It also cannot rule out that some of the benefit comes from increased total study time (since testing takes time) rather than retrieval practice per se.

**Major methodological weaknesses:**

Only **19 publications** were found — a surprisingly small number given the popularity of the testing effect. This suggests that very few studies have tested retrieval practice specifically with psychology content in classroom settings.

Many included studies had **small sample sizes**, which can inflate effect sizes in meta-analyses (small-study effects).

The **control conditions varied** widely: some compared testing to re-studying (a fairer comparison), others to no additional exposure (which inflates the apparent benefit of testing).

**Publication bias** was detected (see Limitations), meaning the true effect may be smaller than reported.

Key findings

**Primary outcome — overall testing effect:**

The overall weighted mean effect size was **Cohen's d = 0.56** (95% CI not reported in abstract, but p < 0.001).

This is considered a **moderate effect** in educational research. For context, this means the average student in the testing condition scored about half a standard deviation higher than the average student in the control condition.

**Moderator analyses:**

**Study design:** Between-subjects designs (d = 0.65) showed larger effects than within-subjects designs (d = 0.42). This makes sense because within-subjects designs control for individual differences and are more conservative.

**Control condition type:** Studies using a no-test control (d = 0.72) showed larger effects than those using a restudy control (d = 0.44). This is expected because re-studying provides additional exposure, narrowing the gap.

**Feedback:** Studies that provided feedback after the intermediate test (d = 0.67) showed larger effects than those without feedback (d = 0.42). Feedback helps correct errors and reinforces correct answers.

**Retention interval:** The effect persisted across intervals from minutes to weeks, though the authors note that longer intervals (days to weeks) tend to produce larger testing effects in the broader literature.

**Publication bias:**

The funnel plot and Egger's test indicated **significant publication bias** — smaller studies with negative or null results were likely missing from the literature. After adjusting for this bias (using trim-and-fill method), the adjusted effect size was smaller (estimated d ≈ 0.40–0.45), though still statistically significant.

**Secondary findings:**

The testing effect was robust across different test formats (multiple-choice, short answer, free recall).

The effect was present in both real classroom settings and lab-based studies using psychology content.

Effect magnitude

**In plain English:** If you take practice tests while studying psychology, you can expect to improve your final exam score by roughly **half a standard deviation** compared to just re-reading your notes or doing nothing extra.

**Concrete translation:** Imagine a class where the average exam score is 70% with a standard deviation of 15%. A student who uses retrieval practice would be expected to score around **77–78%** — about 7–8 percentage points higher. That's roughly the difference between a B- and a B+, or between passing and failing in some courses.

**Comparison to other learning strategies:** For reference, the same authors cite Dunlosky et al. (2013), who rated practice testing as having "high utility" — among the most effective learning techniques, comparable to distributed practice and superior to re-reading, highlighting, or summarising.

**Important caveat:** The adjusted effect (after correcting for publication bias) is smaller — around d = 0.40, or about a 5–6 percentage point improvement. Still meaningful, but not as dramatic as the raw number suggests.

Limitations

**What the authors acknowledge:**

**Small number of studies (k = 19):** This limits the power of moderator analyses and generalisability.

**Publication bias:** The detected bias means the literature likely overestimates the true effect.

**Heterogeneity:** Studies varied widely in design, materials, and outcome measures, making it difficult to pinpoint exactly which conditions produce the largest effects.

**Lack of long-term follow-up:** Most studies tested retention over days or weeks, not months or years. The testing effect is known to grow with longer intervals, but this wasn't systematically tested here.

**No analysis of individual differences:** The meta-analysis couldn't tell us whether retrieval practice works better for some students (e.g., high vs. low prior knowledge) than others.

**Additional critical notes:**

**Confound with total study time:** In many studies, the testing group spent more total time on the material (because testing takes time). The restudy control partially addresses this, but not perfectly.

**Psychology-specific content:** The findings may not generalise to other subjects (e.g., mathematics, languages, or skills-based learning).

**Classroom vs. self-study:** Most studies were in instructor-led classrooms. The effect might differ when you're studying alone without external deadlines or accountability.

**No blinding:** In classroom studies, neither students nor instructors were blind to condition, which could introduce expectancy effects.

**Industry funding:** Not applicable here (university-based research), but the small literature base means many studies were likely underpowered.

Practical takeaways

For someone running their own n=1 experiment:

### What to test

**Intervention:** Replace one study session (e.g., re-reading notes or textbook) with a **self-administered practice test** on the same material. Use free recall (write down everything you remember), short-answer questions, or multiple-choice quizzes.

**Dose:** One practice test per study session, lasting 10–20 minutes. The meta-analysis suggests even a single test produces benefits, but multiple tests over time likely yield larger effects.

**Feedback:** After the test, check your answers against your notes or a answer key. The meta-analysis found larger effects with feedback (d = 0.67 vs. 0.42 without).

### Minimum meaningful duration

**Run the experiment for at least 2–3 weeks** to allow for a meaningful retention interval. The testing effect is stronger after longer delays (days to weeks) than immediately after learning.

**Test yourself at least 24 hours after initial learning** — the effect is more robust when there's a delay between study and practice test.

### What to measure

**Primary metric:** Score on a final test (e.g., end-of-chapter quiz, exam, or self-designed test). Measure percentage correct.

**Secondary metrics:**

- Time spent studying (to check if testing just adds more study time).

- Confidence ratings before and after testing (to track metacognitive gains).

- Retention at multiple time points (e.g., 1 day, 1 week, 2 weeks after learning).

**Control condition:** For a fair comparison, use a **restudy control** — spend the same amount of time re-reading or reviewing notes instead of testing.

### Key confounds to control for

**Total study time:** Keep total time spent on the material equal between conditions. If you test for 15 minutes, spend 15 minutes re-studying in the control condition.

**Material difficulty:** Use the same type of content (e.g., textbook chapters of similar length and complexity) for both conditions.

**Order effects:** Alternate which topic gets tested vs. re-studied across different weeks. For example, Week 1: test on Chapter 1, restudy Chapter 2. Week 2: test on Chapter 2, restudy Chapter 3.

**Prior knowledge:** Pre-test your knowledge of each topic before starting, so you can check that the two conditions are balanced.

**Motivation:** Testing can feel more effortful than re-reading. Track your motivation and effort ratings to see if this confounds results.

**Sleep and time of day:** Study and test at the same time of day, and ensure similar sleep quality before each session.

### What a positive result would look like

**Score difference:** You score **5–8 percentage points higher** on the final test for topics you practised with retrieval vs. topics you re-studied.

**Consistency:** The effect appears across multiple topics or weeks, not just once.

**Retention:** The benefit is larger after a delay (e.g., 1 week) than immediately after studying — this is a hallmark of the testing effect.

**Effort:** You may find testing more mentally demanding, but if the payoff is a meaningful score boost, it's worth it.

**Bottom line:** The evidence is clear — self-testing is one of the most effective, low-cost ways to improve long-term retention. For your n=1 experiment, replace passive re-reading with active recall, and expect a moderate but reliable boost in exam performance.

Read full paper →More Learning research