Before we get started, do me a favor and grab a pen or a pencil. Now hold it between your teeth, as if you were about to try to write with it. Don’t let your lips touch it. Sit with it, and pay attention to how you feel. Are you glum? Cheerful? Confused? Is that any different than how you felt before? Do you feel like this weird smile tricked your brain into a slight jump in happiness?
For a long time, psychologists thought exercises like this one did make us happier. If that were true, it would have implications for what emotion is, how we experience it and where emotions come from. Psychologists have believed that “facial feedback” from emotional expressions like smiling (or frowning) gives the brain information that heightens, or even sparks, an emotional experience.
It made so much sense that it was almost too good to check.
But then scientists did check. What they found poked holes in one of psychology’s textbook findings — which raised a whole new set of questions. Now, a huge group of scientists has banded together to try to get to the bottom of smiles, even if it means working with people who think they’re wrong.
The idea that smiling can make you feel happier has a long history. In 1872, Darwin mused about whether an emotion that was expressed would be felt more intensely than one that was repressed. Early psychologists were musing about it in the 1880s. More than a hundred studies have been published on the topic. And it’s a trope of pop wisdom: “Smile, though your heart is aching,” sang Nat King Cole in 1954. “You’ll find that life is still worthwhile, if you’ll just smile.”
In 1988, social psychologist Fritz Strack published a study that seemed to confirm that facial feedback was real. The researchers asked participants to do more or less what I asked you to do earlier: hold a pen in their mouths in a position that forced them either to bare their teeth in a facsimile of a smile or to purse their lips around the pen. To ensure that no one was clued in to the researchers’ interest in smiles, the experimenters told participants that they were exploring how people with physical disabilities might write or perform other ordinary tasks.
When both groups were shown a set of newspaper comics — specifically, illustrations from Gary Larson’s The Far Side — the teeth-barers rated the images as funnier than the lip-pursers did. This was a big deal for the facial feedback hypothesis: Even though participants weren’t thinking about smiling or their mood, just moving their face into a smile-like shape seemed to affect their emotions. And so the finding made its way into psychology textbooks and countless news headlines. Decades of corroboration followed, as researchers published other experiments that also showed support for the facial feedback hypothesis.
But in 2016, all at once, 17 labs failed to replicate the pen study.
Those 17 studies, coordinated by Dutch psychologist E.J. Wagenmakers, repeated the original study as closely as possible to see if its result held up, with just a few changes. They found a new set of cartoons and pre-tested them to check they were about as funny as the old set. They also changed how they checked up on the participants’ pen technique: The original had an experimenter watching over things, but Wagenmakers and his team filmed participants instead.
When all 17 studies failed to replicate the original result, the effect was “devastating for the emotion literature,” said Nicholas Coles, a psychology grad student whose research focuses on the facial feedback effect. “Almost all emotion theories suggest that facial feedback should influence emotions.” While there are plenty of other methods for looking at facial feedback, many of them are more likely to make participants figure out the real purpose of the experiment, which makes their results trickier to interpret. The pen study had been solid — until it wasn’t.
These kinds of failed attempts to replicate other researchers’ results have been piling up in psychology’s “replication crisis,” which has called the reliability of psychology’s back catalogue into question. Past experiments may be unreliable because they relied on small sample sizes, buried boring or inconclusive results, or used statistical practices that make chance findings look like meaningful signals in what is really random noise. The result has been a morass of uncertainty: Which findings will hold up? And when one doesn’t, what precisely does that mean?
Wagenmakers and his team are just one of the many collaborations hoping to reshape psychology in the image of more established sciences like physics and genetics, where huge international consortia are already commonplace. Some collaborations, like the “Many Labs” projects, conduct multi-lab replications similar to the attempt to confirm the pen study and cover a broad swath of famous psychology studies. Others — like the ManyBabies Consortium, which conducts infant research — concentrate on a niche.
Then there’s the Psychological Science Accelerator, which is more focused on creating the infrastructure for collaboration, allowing its members to democratically elect studies to be run across its network of 548 labs in 72 countries. A recent paper by a group of reforming researchers called this kind of crowdsourced science one of the routes to “scientific utopia.”
Across six multi-lab replication projects, each trying to replicate multiple studies, only 47 percent of the 190 original results were successfully replicated. The failed attempt to replicate the pen study is in good company.
But as powerful as multi-lab replication efforts like these are, they aren’t necessarily the last word. When psychology tries to solve its replication crisis, it can sometimes create a crisis of a different kind, opening up a knowledge vacuum where an apparently reliable finding had previously stood.
Fritz Strack, the lead researcher on the original pen-in-mouth study from 1988, doesn’t think that Wagenmakers’s study tells us all that much — the world is constantly changing, and re-running an old experiment could produce new results not because the idea being tested is flawed but because the experiment itself is now out of step with the times. Although he suggested the replication effort himself, and advised on the design and the materials of the study, he refused to be fully involved. Instead, he said, he wanted the freedom to comment on the problems as he saw them without pulling any punches.
When the results were released, Strack found plenty of things to critique. He was concerned that newspaper cartoons would not have packed the same humor punch these days that they did in the Midwest of the 1980s. The filming, he said, was another problem: It could be that filming made participants unusually self-conscious, affecting their experience of the task.
Strack thinks that it’s a mistake to focus on testing a method rather than a hypothesis. A method that fails might have been a bad test of the hypothesis, but the hypothesis is really what counts.
In this case, the hypothesis was that facial feedback can create an emotional effect even when people aren’t aware that their facial expression is an emotional one. Perhaps, Strack argued, his exact methods from the 1980s are no longer the best way to test that.
“Exact” replications are impossible, he said. “Things are changing — times are changing, the zeitgeist is changing, the culture is changing, the participants are changing. It’s not under your control.” What if you did the pen study with memes instead of cartoons? What if you didn’t use cameras? What would the differences tell us about facial feedback and when it comes into play?
Strack has been vocally critical of the credibility revolution, arguing that the term “replication crisis” is overblown. He says he prefers to focus on arguments about the quality of the research methods, rather than the statistical framework that is at the core of the credibility revolution’s concerns.
But similar critiques of massive replications come from inside the movement. Psychologist Tal Yarkoni, an ardent reformer, thinks that large-scale research efforts would do more good if they were used to test a huge array of different ways of getting at a question. A failed attempt to replicate a particular experiment doesn’t really tell you anything about the underlying theory, he said; all it tells you is that one particular design works or doesn’t work.
Wagenmakers doesn’t think his team’s replication is the final word on the facial feedback theory, either. “It’s a sign of good research that additional questions are raised,” he said. But he does think a failed replication like the one he led shifts the burden of proof. Now, he says, proponents of the facial feedback hypothesis should be the ones coming to the table with new evidence. Otherwise, “the replicating team will be like a dog playing fetch,” he said. “A person throws a ball and the [replication] team brings it back, but oh, it’s not quite right! I’m going to throw it in another direction. … It could go on forever. It’s clearly not a solution to the problem.”
Multi-lab studies can look large and impressive, said psychologist Charles Ebersole, who coordinated two Many Labs projects in grad school. Even so, it’s not clear how much confidence people should have in their results — the studies are big, which can improve confidence in their outcomes, but they’re subject to flaws and limitations just like smaller studies are. “Some people do an excellent job of not listening to [multi-lab studies] at all; maybe that’s the right answer? Some people bet a lot on them; maybe that’s the right answer? I don’t know.”
The way out of the replication crisis clearly isn’t brute replication alone.
When Wagenmakers and his colleagues published their replication study in 2016, Coles was digging deeply into the facial feedback literature. He planned to combine all of the existing literature into a giant analysis that could give a picture of the whole field. Was there really something promising going on with the facial feedback hypothesis? Or did the experiments that found a big fat zero cancel out the exciting findings? He was thrilled to be able to throw so much new data from 17 replication efforts into the pot.
He came up from his deep dive with intriguing findings: Overall, across hundreds of results, there was a small but reliable facial feedback effect. This left a new uncertainty hanging over the facial feedback hypothesis. Might there still be something going on — something that Wagenmakers’s replication attempt had missed?
Coles didn’t think that either Wagenmakers’s replication or his own study could put the matter to rest. The technique he used, called a meta-analysis, comes with its own problems. Specifically, if the studies thrown into the mix aren’t great to start with, the result isn’t particularly reliable — or, as Coles put it, “crap in, crap out.”
So he set about designing a different kind of multi-lab collaboration. He wanted not just to replicate the original study, but to test it in a new way. And he wanted to test it in a way that would convince both the skeptics and those who still stood by the original result. He started to pull together a large team of researchers that included Strack. He also asked Phoebe Ellsworth, a researcher who was testing the facial feedback effect as far back as the 1970s, to come on board as a critic.
This partnership founded in disagreement is meant to get the game of fetch out of the way before the study even gets off the ground. Coles’s group, called the Many Smiles Collaboration, is far from the only one using this tactic; although some massive collaborations try to replicate old studies as closely as possible, others choose to workshop a new experiment methodology in excruciating detail before pulling the trigger. Ideally, this means that everyone will be convinced by the results, regardless of what they were personally rooting for or expecting. “It isn’t groupthink,” said Coles. “We’re actually trying to get at the truth.”
The Many Smiles Collaboration is based on the pen study from 1988, but with considerable tweaking. Through a lengthy back-and-forth between collaborators, peer reviewers and the journal editor, the team has refined the original plan, eventually arriving at a method that everyone agrees is a good test of the hypothesis. If it finds no effect, said Strack, “that would be a strong argument that maybe the facial feedback hypothesis is not true.”
An early pilot of the Many Smiles study indicated that the hypothesis might not be on its last legs just yet: The results suggested that smiling can affect feelings of happiness. Later this year, all the collaborators will kick into gear to see if the pilot’s findings can be repeated across 21 labs in 19 countries. If they find the same results, will that be enough to convince even the skeptics that it’s not just a fluke?
Well … maybe. A study like Wagenmakers’s sounds, in principle, like enough to lay a scientific question to rest, but it wasn’t. A study like Coles’s sounds like it could be definitive too, but it probably won’t be. Even Big Science can’t make science simple. “I’m still a little unsure, even though I’ve now replicated the effects successfully in my own labs,” said Coles. “I’ll hold my breath until the full data set comes in.”
Cathleen O’Grady is a South African freelance science journalist based in Scotland. She writes about people, animals and statistics. @cathleenogrady