What is the replication crisis in psychology?

The replication crisis is a methodological reckoning that occurred when large-scale audits revealed that only about 36 percent of landmark psychological findings could be successfully reproduced by independent researchers.

What were the main findings regarding ego depletion and power posing?

Rigorous, multi-laboratory replication attempts failed to find evidence for the biological or behavioral effects of both ego depletion and power posing, exposing them as likely false positives.

What questionable research practices contributed to these failed replications?

The failures were driven by questionable practices such as using very small sample sizes (the N-heuristic), 'p-hacking' data until finding significant results, and the selective publication of positive findings.

How did the psychological community respond to the replication crisis?

The crisis catalyzed a global credibility revolution that championed the Open Science movement, introducing reform tools like pre-registering hypotheses and methodologies to eliminate bias.

Updated 2026-06-14

Key takeaways

Landmark psychological concepts like power posing and ego depletion failed rigorous replication tests, triggering a field-wide replication crisis.
These initial failures were largely driven by questionable research practices, including extremely small sample sizes, hidden data, and p-hacking.
In response, the scientific community launched a credibility revolution utilizing open science methods like pre-registration to eliminate publication bias.
When modern researchers use strict open science best practices and adequately large sample sizes, psychological research achieves an 86 percent replicability rate.
This push for transparency has expanded globally, aided by free Diamond Open Access publishing in Latin America and grassroots reproducibility networks in Africa.

Psychology experienced a profound reckoning when famous concepts like power posing and ego depletion failed rigorous replication tests. Investigators discovered these original studies were heavily distorted by flawed methods, including tiny sample sizes, hidden data, and selective analysis. In response, the scientific community launched a credibility revolution focused on transparent data sharing and pre-registered methodologies. Ultimately, by prioritizing rigorous open science over flashy results, the field is successfully rebuilding itself into a highly reliable discipline.

How the Replication Crisis Reshaped Psychology

When landmark behavioral studies like "power posing" and "ego depletion" failed to replicate in rigorous follow-up experiments, it triggered a painful reckoning known as the replication crisis. Uncovering these flawed research practices did not destroy the behavioral sciences, but rather exposed the dangers of small sample sizes and selective reporting. Ultimately, this crisis catalyzed a global "credibility revolution" focused on open data, pre-registered methodologies, and transparent international collaboration.

The Allure of Simple Solutions to Complex Behaviors

In the late 1990s and early 2010s, behavioral science experienced a golden age of public attention and media saturation. Researchers were discovering seemingly profound, highly intuitive insights about human nature that promised easy interventions for everyday problems. The appeal was obvious: if complex human struggles with self-control, confidence, and success could be hacked with minor behavioral tweaks, the implications for self-improvement were limitless. Two of the most famous phenomena to emerge from this era were ego depletion and power posing. Both concepts were backed by peer-reviewed literature, both made intuitive sense, and both promised massive personal benefits from minimal effort.

Ego Depletion: Is Willpower a Muscle?

The concept of ego depletion was introduced to the world in a 1998 paper by social psychologists Roy Baumeister, Ellen Bratslavsky, Mark Muraven, and Dianne Tice ¹². The researchers proposed a highly relatable "strength model" of self-control. They posited that willpower operates much like a physical muscle: it draws upon a limited pool of conscious mental energy or resources ¹. When that energy is exhausted by repeated use - a state they termed "ego depletion" - the muscle fatigues, and an individual's capacity to exert self-control on subsequent, unrelated tasks is severely impaired ¹².

In a now-famous experiment, Baumeister and colleagues brought hungry participants into a laboratory that smelled of freshly baked chocolate chip cookies. Some participants were allowed to eat the cookies, while others were forced to exert extreme self-control by eating raw radishes instead ². Afterward, all participants were asked to solve a puzzle that, unbeknownst to them, was actually impossible. The researchers found that the participants who had exhausted their willpower resisting the cookies quit the puzzle much faster than those who had been allowed to indulge ².

The idea that self-control is a finite, depletable resource resonated deeply with both the public and the academic community. It seemingly explained everything from why we break our diets at the end of a stressful workday to why consumers make impulsive purchases ¹. The study initiated a massive wave of research spanning consumer behavior, dieting, and athletic performance, and for over a decade, the ego-depletion effect was considered a foundational truth of human psychology ¹³.

Power Posing: Faking It Till You Make It

A little over a decade later, another blockbuster psychological concept arrived: power posing. In a 2010 paper published in the journal Psychological Science, researchers Dana Carney, Amy Cuddy, and Andy Yap claimed that briefly adopting expansive, "high-power" physical postures could fundamentally alter a person's neuroendocrinology and behavior ⁴⁵.

The study, based on a remarkably small sample of 42 participants, reported that individuals who held expansive poses (like leaning back with hands behind the head, or standing with hands on hips like Wonder Woman) for just two minutes showed measurable physiological changes. Specifically, they exhibited an increase in testosterone (a hormone associated with dominance), a decrease in cortisol (a hormone associated with stress), and an increased willingness to take risks in a gambling task compared to participants who held contractive, "low-power" poses ⁵.

The researchers framed this as a powerful "life hack" with immediate real-world applications for high-stakes situations like job interviews and public speaking ⁴⁵. The concept exploded into the mainstream following a 2012 TED talk by Amy Cuddy. Driven by her passionate delivery and the memorable mantra "fake it till you become it," the presentation became one of the most viewed in TED's history ⁴⁶.

For years, both theories were treated as established scientific fact. Hundreds of subsequent studies appeared to build upon these foundations, creating vast literatures of conceptual replications and extensions ¹⁴. However, beneath the surface of these celebrated findings, a methodological crisis was quietly brewing, one that would soon threaten to upend the entire discipline.

The Replication Crisis Hits Psychology

Historically, the reproducibility of empirical results has been the ultimate cornerstone of the scientific method ⁶. If a phenomenon is real and robust, an independent laboratory following the exact same procedures should be able to observe it. In the early 2010s, facing growing skepticism about the statistical validity of "flashy" behavioral research, psychologists began undertaking large-scale, systematic efforts to replicate classic studies ⁶⁷.

The results of these audits were deeply alarming. In 2015, the Open Science Collaboration published a landmark project attempting to replicate 100 published psychological studies. They found that only 36 percent of the original significant findings could be successfully reproduced ⁸⁹. Furthermore, among the studies that did replicate, the effect sizes in the replications were, on average, half the magnitude of the originals ⁶. This widespread failure to reproduce published scientific results became known as the "replication crisis" ⁶¹¹.

The pillars of both ego depletion and power posing quickly buckled under this new wave of rigorous scrutiny.

Research chart 1

The Collapse of the Willpower Muscle

For ego depletion, the first major blow came in the form of independent scrutiny of the underlying statistics. While a 2010 meta-analysis of 83 studies by Martin Hagger and Nikos Chatzisrantis had previously reported a moderate effect size (d = 0.62) for ego depletion, researchers like Evan Carter and Michael McCullough soon pointed out that this literature was likely plagued by severe publication bias ¹³.

To definitively test the effect, Hagger and Chatzisrantis subsequently organized a high-profile, pre-registered replication study involving 23 independent laboratories. The results were devastating: across the massive combined sample, they found zero evidence of an ego-depletion effect ¹³. Hoping to settle the debate with an even more robust test, a subsequent multi-lab replication project led by Kathleen Vohs involved 36 distinct laboratories and tested 3,531 participants ¹. This massive undertaking also failed to find a meaningful ego-depletion effect, returning an effect size of d = 0.06 - a result an order of magnitude smaller than the original estimates and practically indistinguishable from zero ¹.

In light of the sheer volume of failed replications, many researchers concluded that the ego-depletion effect as a universal physiological phenomenon might simply not exist, representing a massive false-positive generated by a flawed scientific culture ³¹².

The Deflation of Power Posing

Power posing faced a nearly identical trajectory. In 2015, a research team led by Eva Ranehill attempted a direct conceptual replication of the original 2010 Carney, Cuddy, and Yap study ⁸¹⁰. Ranehill's team used a significantly larger sample size of 200 participants and utilized rigorous, computerized procedures to eliminate any potential experimenter bias or demand characteristics ¹⁰¹¹.

The Ranehill replication found absolutely no support for the original biological or behavioral effects ⁸¹⁰. While participants did self-report feeling more powerful after striking a pose - a subjective measure highly susceptible to placebo effects and the participants' own assumptions about the experiment - there were no significant changes in testosterone, cortisol, or objective risk-taking behavior ¹⁰¹².

The evidence against the physiological claims of power posing quickly compounded as multiple other labs failed to replicate the hormonal shifts ⁸¹³. In 2016, Dana Carney, the lead author of the original 2010 study, posted a public statement completely abandoning the theory. She explicitly stated she no longer believed that power poses produced the hormonal or behavioral effects they had originally claimed, leaving Amy Cuddy as the sole visible proponent of the theory in the public sphere ⁴⁵.

How Did the Original Studies Get It So Wrong?

For the general public, the abrupt reversal of seemingly established science was profoundly confusing. If the original papers were peer-reviewed and published in prestigious journals by researchers at Ivy League universities, how could they be entirely wrong?

The answer lies in the statistical and structural norms of psychological research during that era. In the vast majority of cases, researchers were not committing intentional, malicious fraud. Rather, they were engaging in a suite of widely accepted but mathematically flawed methodologies known collectively as Questionable Research Practices (QRPs) ¹⁷¹⁸. The replication crisis was essentially the field waking up to the mathematical consequences of its own leniency.

The Danger of the "N-Heuristic" and Small Sample Sizes

A primary structural driver of the replication crisis was the field's heavy reliance on extremely small sample sizes ¹⁴¹⁵. The original power posing study relied on just 42 participants divided into two conditions ⁵. In behavioral science, small samples suffer from drastically low statistical power, making them highly susceptible to random noise and natural human variation ¹²¹⁴.

While researchers and the public often mistakenly assume that finding a significant result in a small sample means the effect must be incredibly strong, meta-scientific analyses reveal the exact opposite. Small samples tend to wildly exaggerate effect sizes due to sampling error ¹⁴²¹. If an effect is observed in a sample of 40 people by pure chance, the mathematics of the test will make the effect look massive. However, when that same study is run with 2,000 people, the natural variation smooths out, and the true - often negligible - effect emerges ¹⁵. This phenomenon highlights the danger of what methodologists call the "N-Heuristic," where researchers historically prioritized quick, low-cost studies over adequately powered investigations ¹⁵.

The Multiverse of P-Hacking

Perhaps the most destructive practice uncovered during the credibility revolution is "p-hacking" (also known as researcher degrees of freedom). This occurs when researchers consciously or unconsciously make flexible data analysis decisions until their data yields a statistically significant result, traditionally denoted by a p-value of less than .05 ¹⁰¹⁷¹⁶.

If a study doesn't immediately yield a significant finding, a researcher might exclude certain participants as "outliers," control for different demographic variables, look at a different dependent variable, or collect a few more data points and check again ¹⁶¹⁷. By exploring countless analytical pathways, researchers inadvertently capitalize on chance, guaranteeing that something will look significant eventually.

To demonstrate just how prevalent and impactful p-hacking can be, methodologists Marcus Credé and Leigh Phillips conducted a "multiverse analysis" of the original Carney, Cuddy, and Yap power posing data ¹⁶¹⁸. A multiverse analysis looks at every single plausible way a dataset could be analyzed. Credé and Phillips demonstrated that there were 54 different ways to analyze the power pose hormone data depending on how outliers were identified, how the dependent variable was specified (e.g., final hormone level vs. change in hormone level), and whether gender was controlled for ¹⁶¹⁸.

Depending on which specific combination of analytical choices a researcher made, the effect of power posing on testosterone ranged from a massive, highly significant effect to absolutely zero ¹⁶. The original authors had simply reported the one specific, optimistic pathway through the data that yielded a significant result, ignoring the vast "multiverse" of analytical pathways that showed no effect ¹⁶¹⁸.

Feature	Original Studies (Pre-2015 Norms)	Rigorous Replications (Post-2015 Norms)
Sample Size (N)	Typically underpowered (e.g., N=42)	Massive, highly powered (e.g., N=200 to N=3,500+)
Data Transparency	Data held privately by researchers	Open datasets, shared materials, and open code
Analysis Plan	Flexible, determined after data collection (P-hacking)	Pre-registered publicly before data collection begins
Publication Bias	"File-drawer" effect; primarily positive results published	Registered Reports; published regardless of outcome
Result Replicability	Low (estimated ~36% success rate)	High (up to 86% success with rigorous methods)

The File Drawer Problem and Flat P-Curves

Finally, the scientific literature was heavily distorted by severe publication bias, commonly referred to as the "file drawer problem" ¹⁷. Academic journals have historically favored publishing novel, surprising, and statistically significant results ¹⁷. Consequently, if a researcher ran an ego depletion study and found nothing, that study was relegated to a filing cabinet, never to be published or shared.

Roy Baumeister, the chief architect of ego depletion, readily admitted to this practice. In personal communications regarding his research, he stated that his laboratory ran multiple studies, acknowledging that "some of which did not work, and some of which worked better than others." He defended dropping the insignificant results by stating, "You may think that not reporting the less successful studies is wrong, but that is how the field works" ¹². By hiding the failed experiments, the published literature created a powerful illusion of overwhelming, uniform evidence for a phenomenon that may have simply been the result of random statistical chance ³¹².

To quantify the scale of the file drawer problem, researchers Uri Simonsohn and Joe Simmons applied a statistical tool called a "p-curve analysis" to the power posing literature. A p-curve looks at the distribution of significant p-values across a body of literature to determine if the findings possess actual "evidential value" or are merely the result of selective reporting ¹²²⁵.

If an effect is real, there should be vastly more studies with highly significant p-values (e.g., p < .01) than barely significant ones (e.g., p = .04). When Simonsohn and Simmons analyzed the 33 supportive studies frequently cited by power posing defenders, they found the p-curve was completely flat ²⁵¹⁹. A flat p-curve indicates that the entire body of literature is statistically indistinguishable from a scenario where the true effect size is zero and the published results exist solely due to selective reporting and p-hacking ¹²¹⁹.

The Defenders and the Culture War

The revelation that massive swaths of textbook psychology might be false did not go over smoothly. Instead, a bitter debate erupted within the scientific community, taking on the characteristics of an academic culture war.

Researchers whose entire careers, TED talks, and book deals were built on phenomena like ego depletion and power posing reacted defensively. Baumeister and other proponents of the willpower muscle argued that the replication failures did not invalidate their theory. Instead, they argued that the replication teams lacked the "expertise" to properly execute the psychological manipulations, failing to perfectly recreate the delicate psychological conditions required to elicit the effect ¹²⁷. They suggested that subtle differences in the tasks used, the instructions given, or the context of the laboratory had destroyed the effect ¹³.

Critics quickly pointed out the logical inconsistency in this defense: if a psychological effect is supposedly robust enough to dictate human behavior in chaotic, everyday life - such as deciding whether to buy a car or break a diet - it should not completely disappear simply because a laboratory used a slightly different computer task ²⁷. Furthermore, methodologists noted a conceptual crisis: many of the tasks used in ego depletion research had never been independently validated as actual measures of self-control, making it impossible to derive unambiguous predictions ³.

Cuddy similarly defended power posing, arguing that critics were ignoring the evidence and focusing too heavily on physiological markers like hormones rather than the subjective, self-reported feelings of power ⁴⁵. She also argued that holding a pose for three minutes, as some replications required, was too long and uncomfortable, somehow reversing the confidence-boosting effects seen at two minutes ²⁸.

The discourse occasionally turned toxic. Princeton psychologist Susan Fiske famously published a scathing critique of the reform movement, referring to independent researchers and statisticians who pointed out anomalies as "methodological terrorists" and the "self-appointed data police," accusing them of bullying researchers and undermining the public's trust in science ⁶.

However, the reformers pressed on. Independent watchdogs, such as the organization Retraction Watch, began meticulously tracking scientific retractions and fraudulent data. Founded in 2010 when journal retractions were incredibly rare, Retraction Watch grew into a massive database that, by 2024, had cataloged over 50,000 retracted papers globally ²⁰³⁰. This data proved that the rot in scientific literature was not isolated to a few quirky social psychology studies. It extended into medicine, biology, and computer science, fueled by paper mills, forged data, and an academic culture that prioritized publication quantity over uncompromising truth ²⁰³⁰.

The Credibility Revolution: Rebuilding Science

Far from destroying the field, the replication crisis catalyzed what is now known as the "credibility revolution" ¹¹. The painful realization that standard, unquestioned practices were producing a literature littered with false positives led to a systemic, community-driven overhaul of how behavioral science is conducted and evaluated.

The cornerstone of this revolution is the Open Science movement. Spearheaded by organizations like the Center for Open Science (COS), the movement advocates for total transparency across the entire research lifecycle ²¹²².

Pre-registration and Registered Reports

One of the most powerful methodological tools to emerge from this era is the practice of "pre-registration." Before a researcher collects a single data point, they must publicly log their exact hypothesis, intended sample size, and strict data analysis plan on platforms like the Open Science Framework (OSF) ¹⁷²¹. This permanently eliminates the ability to p-hack or selectively report data, as peer reviewers and the scientific community can compare the final published paper against the original, time-stamped registered plan to ensure no corners were cut ¹⁷³³.

Journals have also introduced a revolutionary publication model known as "Registered Reports." Under this model, a journal reviews the introduction and methodology of a proposed study before it is actually conducted. If the methodology is sound and the question is important, the journal guarantees publication in advance, regardless of whether the final result is positive, negative, or completely null ¹⁷²³. This elegant solution directly neutralizes the file-drawer problem, realigning incentives so that researchers are rewarded for asking good questions rigorously, rather than merely finding shiny, statistically significant anomalies.

Rigor Yields Replicability

There is strong empirical evidence that these new methods actually work. In a massive six-year study published in late 2023 in the journal Nature Human Behaviour, a coalition of top laboratories from institutions like UC Berkeley and Stanford attempted to discover and replicate 16 novel psychological findings ⁷. They did not use the old playbook. Instead, they used strict open science best practices, including massive sample sizes and rigorous pre-registration ⁷.

The result was an astonishing 86 percent replicability rate ⁷²⁷. The authors concluded that this high rate was the absolute maximum achievable given standard effect sizes, proving that when researchers abandon questionable shortcuts and adhere to rigorous methodological standards, psychological science can be highly reliable ⁷.

A Global Shift Toward Open Science

The shockwaves of the replication crisis have extended far beyond Western psychology departments, sparking a global policy shift toward transparent research infrastructure ²⁴²⁵. In 2021, UNESCO published a landmark recommendation endorsing open science, which all member states accepted, recognizing that the democratization of scientific knowledge is critical for accelerating innovation and solving global crises ²⁶²⁷²⁸.

However, the transition to open science has occurred unevenly around the world, hindered by stark disparities in internet connectivity, funding, and institutional support ²⁴²⁶²⁷. While the United States and Western Europe still account for roughly 85 percent of open publication and data repositories, other regions are rapidly pioneering their own unique models to circumvent the paywalls of traditional commercial publishers ²⁷²⁹.

Latin America's Diamond Open Access Leadership

Latin America has long been a global pioneer in open science, establishing robust, non-commercial infrastructures decades before the replication crisis forced the issue in the Global North ²⁹³⁰. Initiatives like the Scientific Electronic Library Online (SciELO), created in Brazil in 1996, and Redalyc in Mexico (2003), provide vast, interconnected digital libraries of open-access journals ³⁰⁴².

These platforms rely overwhelmingly on the "Diamond Open Access" model. Unlike standard open access models in the US and Europe where researchers or their grants must pay exorbitant Article Processing Charges (APCs) to publish their work - a heavy, often exclusionary burden for scientists in developing nations - Diamond Open Access is completely free for both the reader and the author ²⁷³¹³². Analyses of open science publishing reveal that nearly 90 percent of Latin American journals indexed in SciELO utilize this highly equitable Diamond model ³¹. Subsidized entirely by academic institutions, government funding, and university presses, Latin America has successfully insulated a massive portion of its scientific output from the commercial logic and profit motives of Western publishing oligopolies ²⁹³¹.

Open Access Model	Who Pays to Read?	Who Pays to Publish?	Regional Dominance
Traditional Subscription	Reader / Institution	Free for Author	Global North (Historically)
Gold Open Access	Free for Reader	Author pays APCs	Global North (Increasingly)
Diamond Open Access	Free for Reader	Free for Author (Institution-funded)	Latin America

Grassroots Networks in Africa

In Africa, where scientific funding and infrastructure face significant constraints, the push for open science and replicability is being driven by powerful grassroots community organizing ³³⁴⁶. Organizations like the African Reproducibility Network (AREN), officially established in 2022, are actively working to bridge the gap in open science advocacy ⁴⁶³⁴.

AREN operates a comprehensive, tiered training program to develop Local Network Leads (LNLs) across the continent ⁴⁶³⁵. Rather than relying on top-down mandates, the program trains grassroots researchers in practical open science tools, such as how to properly pre-register studies, share data transparently, and conduct rigorous power analyses ³⁵³⁶. By the end of 2024, the program successfully trained 28 researchers representing 15 different African countries ³⁵. These highly trained champions return to their home institutions to establish local communities of practice, teaching their peers how to navigate the shifting requirements of global research standards ³⁵³⁷. By fostering local expertise and acknowledging regional challenges, these networks are building sustainable, culturally relevant open science ecosystems that elevate the quality of global research ³³³⁷.

Bottom line

The spectacular collapse of blockbuster theories like ego depletion and power posing served as a painful but profoundly necessary reckoning for the behavioral sciences. By exposing the invisible dangers of small sample sizes, selective reporting, and p-hacking, the replication crisis forced the academic community to abandon a culture that rewarded flashy, fragile findings in favor of strict methodological rigor. Today, propelled by the rise of pre-registration, Registered Reports, and equitable global publishing models, the scientific enterprise is steadily rebuilding its foundation to ensure that the discoveries of tomorrow are built on robust, reproducible facts rather than statistical illusions.

About this research

This article was produced using AI-assisted research using mmresearch.app and reviewed by human. (NobleWeasel_97)