Links & What I've Been Reading Q4 2020

Forecasting

Arpit Gupta on prediction markets vs 538 in the 2020 election: "Betting markets are pretty well calibrated—state markets that have an estimate of 50% are, in fact, tossups in the election. 538 is at least 20 points off—if 538 says that a state has a ~74% chance of going for Democrats, it really is a tossup." Also, In Defense of Polling: How I earned $50,000 on election night using polling data and some Python code. Here is a giant spreadsheet that scores 538/Economist vs markets. And here is a literal banana arguing against the very idea of polls.

Markets vs. polls as election predictors: An historical assessment (2012). Election prediction markets stretch back to the 19th century, and they used to be heavily traded and remarkably accurate despite the lack of any systematic polling information. Once polling was invented, volumes dropped and prediction markets lost their edge. Perhaps things are swinging in the other direction again?

Metaculus is organizing a "large-scale, comprehensive forecasting tournament dedicated to predicting advances in artificial intelligence" with $50k in prize money.

Covid

Philippe Lemoine critiques Flaxman et al.'s "Estimating the effects of non-pharmaceutical interventions on COVID-19 in Europe" in Nature.

However, as far as can tell, Flaxman et al. don’t say what the country-specific effect was for Sweden either in the paper or in the supplementary materials. This immediately triggered my bullshit detector, so I went and downloaded the code of their paper to take a closer look at the results and, lo and behold, my suspicion was confirmed. In this chart, I have plotted the country-specific effect of the last intervention in each country:

Flaxman responds on Andrew Gelman's blog. Lemoine responds to the response.

Alex Tabarrok has been beating the drum for delaying the second dose and vaccinating more people with a single dose instead.

Twitter thread on the new, potentially more infectious B.1.1.7 variant.

Sound pollution decreased due to COVID-19, and "birds responded by producing higher performance songs at lower amplitudes, effectively maximizing communication distance and salience".

An ancient coronavirus-like epidemic drove adaptation in East Asians from 25,000 to 5,000 years ago. Plus Razib Khan commentary.

Scent dog identification of samples from COVID-19 patients: "The dogs were able to discriminate between samples of infected (positive) and non-infected (negative) individuals with average diagnostic sensitivity of 82.63% and specificity of 96.35%." [N=1012] Unfortunately they didn't try it on asymptomatic/pre-symptomatic cases.

And a massive update on everything Covid from Zvi Mowshowitz, much of it infuriating. Do approach with caution though, the auction argument in particular seems questionable.

Innovations and Innovation

Peer Rejection in Science: a collection of "key discoveries have been at some point rejected, mocked, or ignored by leading scientists and expert commissions."

Somewhat related: if GAI is close, why aren't large companies investing in it? NunoSempere comments with some interesting historical examples of breakthrough technologies that received very little investment or were believed to be impossible before they were realized.

Deepmind solves protein folding. And a couple of great blog posts by Mohammed AlQuraishi, where he talks about why AlphaFold is important, why this innovation didn't come from pharmaceutical companies or the academy, and more:

  • First, from 2018, AlphaFold @ CASP13: “What just happened?”:

    I don’t think we would do ourselves a service by not recognizing that what just happened presents a serious indictment of academic science. [...] What is worse than academic groups getting scooped by DeepMind? The fact that the collective powers of Novartis, Pfizer, etc, with their hundreds of thousands (~million?) of employees, let an industrial lab that is a complete outsider to the field, with virtually no prior molecular sciences experience, come in and thoroughly beat them on a problem that is, quite frankly, of far greater importance to pharmaceuticals than it is to Alphabet.

  • Then, in 2020, AlphaFold2 @ CASP14: “It feels like one’s child has left home.”:

    Once a solution is solved in any way, it becomes hard to justify solving it another way, especially from a publication standpoint.

"These improvements drop the turn-around time from days to twelve hours and the cost for whole genome sequencing (WGS) from about $1000 to $15, as well as increase data production by several orders of magnitude." If this is real (and keep in mind $15 is not the actual price end-users would pay) we can expect universal whole-genome sequencing, vast improvements in PGSs, and pervasive usage of genetics in medicine in the near future.

Extrapolating GPT-N performance: "Close-to-optimal performance on these benchmarks seems like it’s at least ~3 orders of magnitude compute away [...] Taking into account both software improvements and potential bottlenecks like data, I’d be inclined to update that downwards, maybe an order of magnitude or so (for a total cost of ~$10-100B). Given hardware improvements in the next 5-10 years, I would expect that to fall further to ~$1-10B."

Fund people, not projects I: The HHMI and the NIH Director's Pioneer Award. "Ultimately it's hard to disagree with Azoulay & Li (2020), we need a better science of science! The scientific method needs to examine the social practice of science as well, and this should involve funders doing more experiments to see what works. Rather than doing whatever is it that they are doing now, funders should introduce an element of explicit randomization into their process."

It will take more than a few high-profile innovations to end the great stagnation. "And if you sincerely believe that we are in a new era of progress, then argue for it rigorously! Show it in the data. Revisit the papers that were so convincing to you a year ago, and go refute them directly."

The Rest

Why Are Some Bilingual People Dyslexic in English but Not Their Other Language? I'm not entirely sure about the explanations proposed in the article, but it's fascinating nonetheless.

The Centre for Applied Eschatology: "CAE is an interdisciplinary research center dedicated to practical solutions for existential or global catastrophe. We partner with government, private enterprise, and academia to leverage knowledge, resources, and diverse interests in creative fusion to bring enduring and universal transformation. We unite our age’s greatest expertise to accomplish history’s greatest task."

Labor share has been decreasing over the past decades, but without a corresponding increase in the capital share of income. Where does the money go? This paper suggests: housing costs. Home ownership as investment may have seemed like a great idea in the past, but now we're stuck in this terrible equilibrium where spiraling housing costs are causing huge problems but it would be political suicide to do anything about it. It's easy to say "LVT now!" but good luck formulating a real plan to make it reality.

1/4 of animals used in research are included in published papers. Someone told me this figure is surprisingly high. Unfortunately there's no data in the paper breaking down unpublished null results vs bad data/failed experiments/etc.

@Evolving_Moloch reviews Rutger Bregman's Humankind. "Bregman presents hunter-gatherer societies as being inherently peaceful, antiwar, equal, and feminist likely because these are commonly expressed social values among educated people in his own society today. This is not history but mythology."

@ArtirKel reviews Vinay Prasad's Malignant, with some comments on progress in cancer therapy and the design of clinical trials. "The whole system is permeated by industry-money, with the concomitant perverse incentives that generates."

@Cerebralab2 reviews Nick Lane's Power, Sex, Suicide: Mitochondria and the meaning of life. "The eukaryotic cell appeared much later (according to the mainstream view) and in the space of just a few hundred million years—a fraction of the time available to bacteria—gave rise to the great fountain of life we see all around us."

Is the great filter behind us? The Timing of Evolutionary Transitions Suggests Intelligent Life Is Rare. "Together with the dispersed timing of key evolutionary transitions and plausible priors, one can conclude that the expected transition times likely exceed the lifetime of Earth, perhaps by many orders of magnitude. In turn, this suggests that intelligent life is likely to be exceptionally rare." (Highly speculative, and there are some assumptions one might reasonably disagree with.)

How to Talk When a Machine is Listening: Corporate Disclosure in the Age of AI. "Companies [...] manage the sentiment and tone of their disclosures to induce algorithmic readers to draw favorable conclusions about the content."

On the Lambda School stats. "If their outcomes are actually good, why do they have to constantly lie?"

On the rationalism-to-trad pipeline. (Does such a pipeline actually exist?) "That "choice" as a guiding principle is suspect in itself. It's downstream from hundreds of factors that have nothing to do with reason. Anything from a leg cramp to an insult at work can alter the "rational" substrate significantly. Building civilizations on the quicksand of human whim is hubris defined."

There is a wikipedia article titled List of nicknames used by Donald Trump.

Why Is There a Full-Scale Replica of the Parthenon in Nashville, Tennessee?

We Are What We Watch: Movie Plots Predict the Personalities of Those who “Like” Them. An amusing confirmation of stereotypes: low extraversion people are anime fanatics, low agreeableness people like Hannibal, and low openness people just have terrible taste (under a more benevolent regime they might perhaps be prohibited from consuming media).

A short film based on Blindsight. Won't make any sense if you haven't read the book, but it looks great.

12 Hours of Powerline Noise from Serial Experiments Lain.

And here is a Japanese idol shoegaze group. They're called ・・・・・・・・・ and their debut album is 「」.

What I've Been Reading

  • The History of the Decline and Fall of the Roman Empire by Edward Gibbon. Fantastic. Consistently entertaining over almost 4k pages. Gibbon's style is perfect. I took it slow, reading it over 364 days...and I would gladly keep going for another year. Full review forthcoming.

  • The Adventures and Misadventures of Maqroll by Álvaro Mutis. A lovely collection of 7 picaresque novellas that revolve around Maqroll, a cosmopolitan vagabond of the seas. Stylistically rich and sumptuous. It's set in a kind of parallel maritime world, the world of traders and seamen and port city whores. Very melancholy, with doomed business ventures at the edge of civilization, doomed loves, doomed lives, and so on. While the reader takes vicarious pleasure in the "nomadic mania" of Maqroll, the underlying feeling is one of fundamental dissatisfaction with what life has to offer—ultimately the book is about our attempts to overcome it. Reminiscent of Conrad, but also Herzog's diaries plus a bit of Borges in the style.

  • Pandora’s Box: A History of the First World War by Jörn Leonhard. A comprehensive, single-volume history of WWI from a German author. It goes far beyond military history: besides the battles and armaments it covers geopolitics and diplomacy, national politics, economics, public opinion, morale. All fronts and combatants are explored, and all this squeezed into just 900 pages (some things are inevitably left out - for example no mention is made of Hoover's famine relief efforts). Its approach is rather abstract, so if you're looking for a visceral description of the trenches this isn't the book for you. The translation isn't great, and it can get a bit dry and repetitive, but overall it's a very impressive tome. n.b. the hardcover edition from HUP is astonishingly bad and started falling apart immediately. (Slightly longer review on goodreads.)

  • To Hold Up the Sky by Liu Cixin. A new short story collection. Not quite at the same level as the Three Body Trilogy, but there are some good pieces. I particularly enjoyed two stories about strange and destructive alien artists: Sea of Dreams (in which an alien steals all the water on earth for an orbital artwork), and Cloud of Poems (in which a poetry contest ultimately destroys the solar system, sort of a humanistic scifi take on The Library of Babel).

  • Creating Future People: The Ethics of Genetic Enhancement by Jonathan Anomaly. A concise work on the ethical dilemmas posed by genetic enhancement technology. It's written by a philosopher, but uses a lot of ideas from game theory and economics to work out the implications of genetic enhancement. Despite its short length, it goes into remarkable practical detail on things like how oxytocin affects behavior, the causes of global wealth inequality, and the potential of genetic editing to decrease the demand for plastic surgery. On the other hand, I did find it somewhat lacking (if not evasive) in its treatment of more general and abstract philosophical questions, such as: under what conditions is it acceptable to hurt people today in order to help future people?

  • The Life and Opinions of Tristram Shandy, Gentleman by Laurence Sterne. Famously the "first postmodern novel", this fictional biography from the 1760s is inventive, bawdy, and generally really strange and crazy. Heavily influenced by Don Quixote, parodies various famous writers of the time. Schopenhauer loved it and a young Karl Marx drew inspiration from it when writing Scorpion and Felix! I admire its ethos, and it's sometimes very funny. But ironic shitposting is still shitposting, and 700 pages of shitposting is a bit...pleonastic. At one point the narrator explains his digressions in the form of a line, one for each volume:

  • Illuminations by Walter Benjamin. Doesn't really live up to the hype. The essays on Proust and Baudelaire are fine, the hagiography of Brecht feels extremely silly in retrospect. The myopia of extremist 1930s politics prevents him from seeing very far.

  • Omensetter's Luck by William Gass. An experimental novel that does a great job of evoking 19th century rural America. Omensetter is a beguiling, larger-than-life figure, a kind of natural animal of a man. Nowhere near as good as The Tunnel and not much easier to read either. It's clearly an early piece, before Gass had fully developed his style.

  • The Silence: A Novel by Don DeLillo. Not so much a novel as a sketch of one. Not even undercooked, completely raw. It's about a sudden shutdown of all technology. Stylistically uninteresting compared to his other work. Here's a good review.

  • Little Science, Big Science by Derek John de Solla Price. Purports to be about the science of science, but really mostly an exploration of descriptive statistics over time - number of scientists, the distribution of their productivity and intelligence, distribution across countries, citations, and so on. Should have been a blog post. Nice charts, worth skimming just for them. (Very much out of print, but you can grab a pdf from the internet archive or libgen).

  • Experiment and the Making of Meaning: Human Agency in Scientific Observation and Experiment by David Gooding. Skimmed it. Written in 1990 but feels very outdated, nobody cares about observation sentences any more and they didn't in 1990 either. Some interesting points about the importance of experiment (as opposed to theory) in scientific progress. On the other hand all the fluffy stuff about "meaning" left me completely cold.

  • The Subjective Side of Science: A Philosophical Inquiry into the Psychology of the Apollo Moon Scientists by Ian Mitroff. Based on a series of structured interviews with geologists working on the Apollo project. Remarkably raw. Mitroff argues in favor of the subjective side, the biased side, of how scientists actually perform science in the real world. On personality types, relations between scientists, etc. There are some silly parts, like an attempt to tie Jungian psychology with the psychological clusters in science, and a very strange typology of scientific approaches toward the end, but overall it's above-average for the genre.

  • On Writing: A Memoir of the Craft by Stephen King. A pleasant autobiography combined with some tips on writing. The two parts don't really fit together very well. This was my first King book, I imagine it's much better if you're a fan (he talks about his own novels quite a lot).

  • The Lord Chandos Letter And Other Writings by Hugo von Hofmannsthal. A collection of short stories plus the titular essay. If David Lynch had been a symbolist writer, these are the kinds of stories he would have produced. Vague, mystical, dreamlike, impressionistic. I found them unsatisfying, and they never captured my interest enough to try to disentangle the symbols and allegories. The final essay about the limitations of language is worth reading, however.

  • Ghost Soldiers: The Forgotten Epic Story of World War II's Most Dramatic Mission by Hampton Sides. An account of a daring mission to rescue POWs held by the Japanese in the Philippines. The mission itself is fascinating but fairly short, and the book is padded with a lot of background info that is nowhere near as interesting (though it does set the scene). Parts of it are brutal and revolting beyond belief.

  • We Are Legion (We Are Bob) by Dennis Taylor. An interesting premise: a story told from the perspective of a sentient von Neumann probe. That premise is sort-of squandered by a juvenile approach filled to the brim with plot holes, and an inconclusive story arc: it's just setting up the sequels. Still, it's pretty entertaining. If you want a goofy space opera audiobook to listen to while doing other stuff, I'd recommend it.

  • Rocket Men: The Daring Odyssey of Apollo 8 and the Astronauts Who Made Man's First Journey to the Moon by Robert Kurson. Focused on the personalities, personal lives, and families of the three astronauts on Apollo 8: Frank Borman, William Anders, and James Lovell, set against the tumultuous political situation of late 1960s America. Written in a cinematic style, there's little on the technical/organizational aspects of Apollo 8. Its treatment of the cold war is rather naïve. The book achieves its goals, but I was looking for something different. Ray Porter (who I usually like) screws up the narration of the audiobook with an over-emotive approach, often emphasizing the wrong words. Really strange.




Book Review: The Idiot

In 1969, Alfred Appel declared that Ada or Ardor was "the last 19th-century Russian novel". Now we have in our hands a new last 19th-century Russian novel—perhaps even the final one. And while Nabokov selected the obvious and trivial task of combining the Russian and American novels, our "Dostoyevsky" (an obvious pseudonym) has given himself the unparalleled and interminably heroic mission of combining the Russian novel with the Mexican soap opera. I am pleased to report that he has succeeded in producing a daring postmodern pastiche that truly evokes the 19th century.

The basic premise of The Idiot is lifted straight from Nietzsche's Antichrist 29-31:

To make a hero of Jesus! And even more, what a misunderstanding is the word 'genius'! Our whole concept, our cultural concept, of 'spirit' has no meaning whatever in the world in which Jesus lives. Spoken with the precision of a physiologist, even an entirely different word would be yet more fitting here—the word idiot. [...] That strange and sick world to which the Gospels introduce us — a world like that of a Russian novel, in which refuse of society, neurosis and ‘childlike’ idiocy seem to make a rendezvous.

The novel opens with Prince Lev Nikolayevich Myshkin, the titular idiot, returning by train to Russia after many years in the hands of a Swiss psychiatrist. He is a Christlike figure, as Nietzsche puts it "a mixture of the sublime, the sickly, and the childlike", a naïf beset on all sides by the iniquities of the selfish and the corruptive influence of society. Being penniless, he seeks out a distant relative and quickly becomes entangled in St. Petersburg society.

If this were really a 19th century novel, it would follow a predictable course from this point: the "idiot" would turn out to be secretly wiser than everyone, the "holy fool" would speak truths inaccessible to normal people, his purity would highlight the corruption of the world around him, his naiveté would ultimately be form of nobility, and so on.

Instead, Myshkin finds himself starring in a preposterous telenovela populated by a vast cast of absurdly melodramatic characters. He quickly receives an unexpected inheritance that makes him wealthy, and is then embroiled in a web of love and intrigue. As in any good soap opera, everything is raised to extremes in this book: there are no love triangles because three vertices would not be nearly enough; instead there are love polyhedrons, possibly in four or seven dimensions.

Myshkin's first love interest is the intimidating, dark, and self-destructive Nastasya Fillipovna. An orphan exploited by her guardian, she is the talk of the town and chased by multiple suitors, including the violent Rogozhin and the greedy Ganya. Myshkin thinks she's insane but pities her so intensely that they have an endless and tempestuous on-again off-again relationship, which includes Nastasya skipping out on multiple weddings. In the construction of this character I believe I detect the subtle influence of the yandere archetype from Japanese manga.

The second woman in Myshkin's life is the young and wealthy Aglaya Ivanovna: proud, snobbish, and innocent, she cannot resist mocking Myshkin, but at the same time is deeply attracted to him. Whereas Nastasya loves Myshkin but thinks she's not good enough for him, Aglaya loves him but thinks she's too good for him.

The main cast is rounded off by a bunch of colorful characters, including the senile general Epanchin, various aristocrats, a boxer, a religious maniac, and Ippolit the nihilist who spends 600 pages in a permanent state of almost-dying (consumption, of course) and even gets a suicide fakeout scene that would make the producers of The Young and the Restless blush.

As Myshkin's relationships develop, he is always kind, non-judgmental, honest and open with his views. But this is not the story of good man taken advantage of, but rather the story of a man who is simply incapable of living in the real world. Norm Macdonald, after seeing the musical Cats, exclaimed: "it's about actual cats!" The reader of The Idiot will inevitably experience the same shock of recognition, muttering "Mein Gott, it's about an actual idiot!" His behavior ends up hurting not only himself, but also the people around him, the people he loves. In the climax, Nastasya and Aglaya battle for Myshkin's heart, but it's a disaster as he makes all the wrong choices.

That's not to say that it's all serious; the drama is occasionally broken up by absurdist humor straight out of Monty Python:

Everyone realized that the resolution of all their bewilderment had begun.

‘Did you receive my hedgehog?’ she asked firmly and almost angrily.

Postmodern games permeate the entire novel: for example, what initially appears to be an omniscient narrator is revealed in the second half to simply be another character (a deeply unreliable one at that); one who sees himself as an objective reporter of the facts, but is really a gossip and rumourmonger. Toward the end he breaks the fourth wall and starts going on bizarre digressions that recall Tristram Shandy: at one point he excuses himself to the reader for digressing too far, then digresses even further to complain about the quality of the Russian civil service. The shifts in point of view become disorienting and call attention to the artificial nature of the novel. Critically, he never really warms up to Myshkin:

In presenting all these facts and refusing to explain them, we do not in the least mean to justify our hero in the eyes of our readers. More than that, we are quite prepared to share the indignation he aroused even in his friends.

Double Thoughts and Evolutionary Psychology

The entire novel revolves around the idea of the "double thought", an action with two motives: one pure and conscious, the other corrupt and hidden. Keller comes to Myshkin in order to confess his misdeeds, but also to use the opportunity to borrow money. Awareness of the base motive inevitably leads to guilt and in some cases self-destructive behavior. This is how Myshkin responds:

You have confused your motives and ideas, as I need scarcely say too often happens to myself. I can assure you, Keller, I reproach myself bitterly for it sometimes. When you were talking just now I seemed to be listening to something about myself. At times I have imagined that all men were the same,’ he continued earnestly, for he appeared to be much interested in the conversation, ‘and that consoled me in a certain degree, for a DOUBLE motive is a thing most difficult to fight against. I have tried, and I know. God knows whence they arise, these ideas that you speak of as base. I fear these double motives more than ever just now, but I am not your judge, and in my opinion it is going too far to give the name of baseness to it—what do you think? You were going to employ your tears as a ruse in order to borrow money, but you also say—in fact, you have sworn to the fact— that independently of this your confession was made with an honourable motive.1

The "double thought" is an extension of the concept of self-deception invented by evolutionary psychologist Robert Trivers, and, simply put, this book could not have existed without his work. Trivers has been writing about self-deception since the 70s in academic journals and books (including his 2011 book The Folly of Fools). The basic idea is that people subconsciously deceive themselves about the true motives of their actions, because it's easier to convince others when you don't have to lie.

Dostoyevsky's innovation lies in examining what happens when someone becomes aware of their subconscious motives and inevitably feels guilty. There is empirical evidence that inhibition of guilt makes deception more effective, but this novel inverts that problem and asks the question: what happens when that inhibition fails and guilt takes over? The author's penetrating psychological analysis finds a perfect home in the soap opera setting, as the opposition of extreme emotions engendered by the double thought complements the melodrama. Dostoyevsky even goes a step further, and argues that self-consciousness of the double thought is a double thought in itself: "I couldn't help thinking ... that everyone is like that, so that I even began patting myself on the back". There is no escape from the signaling games we play. The complexity of unconscious motives is a recurring theme:

Don't let us forget that the causes of human actions are usually immeasurably more complex and varied than our subsequent explanations of them.

In a move of pure genius, Dostoyevsky plays with this idea on three levels in parallel: first, the internal contrast between pure and corrupt motives within each person; second, the external contrast between the pure Idiot Myskin and the corrupt society around him; third, on the philosophical level of Dionysus versus The Crucified. And in the end he comes down squarely in the camp of Dionysus and against Myshkin. Just as the Idiot is not ultimately good, so consciousness and the innocent motivations are not good either: the novel decides the issue strongly in favor of the corrupt motive, in favor of instinct over ratiocination, in favor of Dionysus over Apollo, in favor of the earthly over Christianity. We must live our lives in this world and deal with it as it is.

Double Anachronism

In the brilliant essay The Argentine Writer and Tradition, Borges writes that "what is truly native can and often does dispense with local color". Unfortunately Mssr. Dostoyevsky overloads his novel with local color, which in the end only highlights its artificiality. The lengths to which he has gone to make this novel appear as if it were a real product of the 19th century are admirable, but by overextending himself, he undermines the convincing (though fantastic) anachronism; like a double thought, the underlying deception leaks out and ruins everything. In a transparent and desperate reach for verisimilitude, he has included a series of references to real crimes from the 1860s. One cannot help but imagine the author bent over some dusty newspaper archive in the bowels of the National Library on Nevsky Prospekt, mining for details of grisly murders and executions.

Unfortunately The Idiot is anachronistic in more ways than one: as the juvenile influence of Der Antichrist hints, Dostoyevsky is a fervent anti-Christian who epitomizes the worst excesses of early-2000s New Atheism. Trivers wrote the foreword to Dawkins's The Selfish Gene, so it is no surprise that Dostoyevsky would be part of that intellectual tradition. But the heavy-handed anti-religious moralizing lacks nuance and gets old fast. His judgment of Myshkin, the representative of ideal Christianity, is heavy, but on top of that he also rants about the Catholic church, the representative of practical Christianity. He leaves no wiggle room in his condemnations.

And the lesson he wants to impart is clear: that Christianity is not only impractical and hypocritical, but actively ruins the lives of the people it touches. But these views hardly pass the smell test. While Dostoyevsky has mastered evolutionary psychology, he seems to have ignored cultural evolution. As Joe Henrich lays out in his latest book, The WEIRDest People in the World: How the West Became Psychologically Peculiar and Particularly Prosperous, the real-world influence of Christianity as a pro-social institution is a matter that deserves a far more nuanced examination. So if I could give Mssr. Dostoyevsky one piece of advice it would be this: less Dawkins, more Henrich please. After all, a devotee of Nietzsche should have a more subtle view of these things.


  1. 1.See also Daybreak 523: "With everything that a person allows to become visible one can ask: What is it supposed to hide?"



Metascience and Philosophy

It has been said that philosophy of science is as useful to scientists as ornithology is to birds. But perhaps it can be useful to metascientists?

State of Play

Philosophy

In the 20th century, philosophy of science attracted first-rate minds: scientists like Henri Poincaré, Pierre Duhem, and Michael Polanyi, as well as philosophers like Popper, Quine, Carnap, Kuhn, and Lakatos. Today the field is a backwater, lost in endless debates about scientific realism which evoke the malaise of medieval angelology.1 Despite being part of philosophy, however, the field made actual progress, abandoning simplistic early models for more sophisticated approaches with greater explanatory power. Ultimately, philosophers reached one of two endpoints: some went full relativist,2 while others (like Quine and Laudan) bit the bullet of naturalism and left the matter to metascientists and psychologists.3 "It is an empirical question, which means promote which ends".

Metascience

Did the metascientists actually pick up the torch? Sort of. There is some overlap, but (with the exception of the great Paul Meehl) they tend to focus on different problems. The current crop of metascientists is drawn, like sharks to blood, to easily quantifiable questions about the recent past (with all those p-values sitting around how could you resist analyzing them?). They focus on different fields, and therefore different problems. They seem hesitant to make normative claims. Less tractable questions about forms of progress, norms, theory selection, etc. have fallen by the wayside. Overall I think they underrate the problems posed by philosophers.

Rational Reconstruction

In The History of Science and Its Rational Reconstructions Lakatos proposed that theories of scientific methodology function as historiographical theories and can be criticized or compared to each other by using the theories to create "rational historical reconstructions" of scientific progress. The idea is simple: if a theory fails to rationally explain the past successes of science, it's probably not a good theory, and we should not adopt its normative tenets. As Lakatos puts it, "if the rationality of science is inductive, actual science is not rational; if it is rational, it is not inductive." He applied this "Pyrrhonian machine de guerre" not only to inductivism and confirmationism, but also to Popper.

The main issue with falsification boils down to the problem of auxiliary hypotheses. On the one hand you have underdetermination (the Duhem-Quine thesis): testing hypotheses in isolation is not possible, so when a falsifying result comes out it's not clear where the modus tollens should be directed. On the other hand there is the possibility of introducing new auxiliary hypotheses to "protect" an existing theory from falsification. These are not merely abstract games for philosophers, but very real problems that scientists have to deal with. Let's take a look at a couple of historical examples from the perspective of naïve falsificationism.

First, Newton's laws. They were already falsified at the time of publication: they failed to correctly predict the motion of the moon. In the words of Newton, "the apse of the Moon is about twice as swift" as his predictions. Despite this falsification, the Principia attracted followers who worked to improve the theory. The moon was no small problem and took two decades to solve with the introduction of new auxiliary hypotheses.

A later episode involving Newton's laws illustrates how treacherous these auxiliary hypotheses can be. In 1846 Le Verrier (I have written about him before) solved an anomaly in the orbit of Uranus by hypothesizing the existence of a new planet. That planet was Neptune and its discovery was a wonderful confirmation of Newton's laws. A decade later Le Verrier tried to solve an anomaly in the orbit of Mercury using the same method. The hypothesized new planet was never found and Newton's laws remained at odds with the data for decades (yet nobody abandoned them). The solution was only found in 1915 with Einstein's general relativity: Newton should have been abandoned this time!

Second, Prout's hypothesis: in 1815 William Prout proposed that the atomic weights of all elements were multiples of the atomic weight of hydrogen. A decade later, chemists measured the atomic weight of chlorine at 35.45x that of hydrogen and Prout's hypothesis was clearly falsified. Except, a century after that, isotopes were discovered: variants of chemical elements with different neutron numbers. Turns out that natural chlorine is composed of 76% 35Cl and 24% 37Cl, hence the atomic weight of 35.45. Whoops! So here we have a case where falsification depends on an auxiliary hypothesis (no isotopes) which the experimenters have no way of knowing.4

Popper tried to rescue falsificationism through a series of unsatisfying ad-hoc fixes: exhorting scientists not to be naughty when introducing auxiliary hypotheses, and saying falsification only applies to "serious anomalies". When asked what a serious anomaly is, he replied: "if an object were to move around the Sun in a square"!5

Problem, officer?

There are a few problems with rational reconstruction, and while I don't think any of them are fatal, they do mean we have to tread carefully.

External factors: no internal history of science can explain the popularity of Lysenkoism in the USSR—sometimes we have to appeal to external factors. But the line between internal and external history is unclear, and can even depend on your methodology of choice.

Meta-criterion choice: what criteria do you use to evaluate the quality of a rational reconstruction? Lakatos suggested using the criteria of each theory (eg use falsificationism to judge falsificationism) but he never makes a good case for that vs a standardized set of meta-criteria.

Case studies: philosophers tend to argue using case studies and it's easy to find one to support virtually any position, even if its normative suggestions are suboptimal. Lots of confirmation bias here. The illustrious Paul Meehl correctly argues for the use of "actuarial methods" instead. "Absent representative sampling, one lacks the database needed to best answer or resolve these types of inherently statistical questions." The metascientists obviously have a great methodological advantage here.

Fake history: the history of science as we read it today is sanitized if not fabricated.6 Successes are remembered and failures thrown aside; chaotic processes of discovery are cleaned up for presentation. As Peter Medawar noted in Is the scientific paper a fraud?, the "official record" of scientific progress contains few traces of the messy process that actually generated said progress.7 He further argues that there is a desire to conform to a particular ideal of induction which creates a biased picture of how scientific discovery works.

Falsification in Metascience

Now, let's shift our gaze to metascience. There's a fascinating subgenre of psychology in which researchers create elaborate scientific simulations and observe subjects as they try to make "scientific discoveries". The results can help us understand how scientific reasoning actually happens, how people search for hypotheses, design experiments, create new concepts, and so on. My favorite of these is Dunbar (1993), which involved a bunch of undergraduate students trying to recreate a Nobel-winning discovery in biochemistry.8

Reading these papers one gets the sense that there is a falsificationist background radiation permeating everything. When the subjects don't behave like falsificationists, it's simply treated as an error or a bias. Klahr & Dunbar scold their subjects: "our subjects frequently maintained their current hypotheses in the face of negative information". And within the tight confines of these experiments it's usually true that it is an error. But this reflects the design of the experiment rather than any inherent property of scientific reasoning or progress, and extrapolating these results to real-world science in general would be a mistake.

Sociology offers a cautionary tale about what happens when you take this kind of reasoning to an extreme: the strong programme people started with an idealistic (and wrong) philosophy of science, they then observed that real-world science does not actually operate like that, and concluded that it's all based on social forces and power relations, descending into an abyss of epistemological relativism. To reasonable people like you and me this looks like an excellent reductio ad absurdum, but sociologists are a special breed and one man’s modus ponens is another man’s modus tollens. The same applies to over-extensions of falsificationism. Lakatos:

...those trendy 'sociologists of knowledge' who try to explain the further (possibly unsuccessful) development of a theory 'falsified' by a 'crucial experiment' as the manifestation of the irrational, wicked, reactionary resistance by established authority to enlightened revolutionary innovation.

One could also argue that the current focus on replication is too narrow. The issue is obscured by the fact that in the current state of things the original studies tend to be very weak, the "theories" do not have track records of success, and the replications tend to be very strong, so the decision is fairly easy. But one can imagine a future scenario in which failed replications should be treated with far more skepticism.

There are also some empirical questions in this area that are ripe for the picking: at which point do scientists shift their beliefs to the replication over the original? What factors do they use? What do they view a falsification as actually refuting (ie where do they direct the modus tollens)? Longitudinal surveys, especially in the current climate of the social sciences, would be incredibly interesting.

Unit of Progress

One of the things philosophers of science are in agreement about is that individual scientists cannot be expected to behave rationally. Recall the example of Prout and the atomic weight of chlorine above: Prout simply didn't accept the falsifying results, and having obtained a value of 35.83 by experiment, rounded it to 36. To work around this problem, philosophers instead treated wider social or conceptual structures as the relevant unit of progress: "thinking style groups" (Fleck), "paradigms" (Kuhn), "research programmes" (Lakatos), "research traditions" (Laudan), etc. When a theory is tested, the implications of the result depend on the broader structure that theory is embedded in. Lakatos:

We have to study not the mind of the individual scientist but the mind of the Scientific Community. [...] Kuhn certainly showed that psychology of science can reveal important-and indeed sad-truths. But psychology of science is not autonomous; for the-rationally reconstructed-growth of science takes place essentially in the world of ideas, in Plato's and Popper's 'third world'.

Psychologists are temperamentally attracted to the individual, and this is reflected in their metascientific research methods which tend to focus on individual scientists' thinking, or isolated papers. Meehl, for example, simply views this as an opportunity to optimize individuals' cognitive performance:

The thinking of scientists, especially during the controversy or theoretical crises preceding Kuhnian revolutions, is often not rigorous, deep, incisive, or even fair-minded; and it is not "objective" in the sense of interjudge reliability. Studies of resistance to scientific discovery, poor agreement in peer review, negligible impact of most published papers, retrospective interpretations of error and conflict all suggest suboptimal cognitive performance.

Given the importance of broader structures however, things that seem irrational from the individual perspective might make sense collectively. Institutional design is criminally under-explored, and the differences in attitudes both over time and over the cross section of scientists are underrated objects of study.

You might retort that this is a job for the sociologists, but look at what they have produced: on the one hand they gave us Robert Merton, and on the other hand the strong programme. They don't strike me as particularly reliable.

Fields & Theories

Almost all the scientists doing philosophy of science were physicists or chemists, and the philosophers stuck to those disciplines in their analyses. Today's metascientists on the other hand mostly come from psychology and medicine. Not coincidentally, they tend to focus on psychology and medicine. These fields tend to have different kinds of challenges compared to the harder sciences: the relative lack of theory, for example, means that today's metascientists tend to ignore some of the most central parts of philosophy of science, such as questions about Lakatos's "positive heuristic" and how to judge auxiliary hypotheses, questions about whether the logical or empirical content of theories is preserved during progress, questions about how principles of theory evaluation change over time, and so on.

That's not to say no work at all has been done in this area, for example Paul Meehl9 tried to construct a quantitative index of a theory's track record that could then be used to determine how to respond to a falsifying result. There's also some similar work from a Bayesian POV. But much more could be done in this direction, and much of it depends on going beyond medicine and the social sciences. "But Alvaro, I barely understand p-values, I could never do the math needed to understand physics!" If the philosophers could do it then so can the psychologists. But perhaps these problems require broader interdisciplinary involvement: not only specialists from other fields, but also involvement from neuroscience, computational science, etc.

What is progress?

One of the biggest questions the philosophers tried to answer was how progress is made, and how to even define it. Notions of progress as strictly cumulative (ie the new theory has to explain everything explained by the old one) inevitably lead to relativism, because theories are sometimes widely accepted at an "early" stage when they have limitations relative to established ones. But what is the actual process of consensus formation? What principles do scientists actually use? What principles should they use? Mertonian theories about agreement about standards/aims are clearly false, but we don't have anything better to replace them. This is another question that depends on looking beyond psychology, toward more theory-oriented fields.

Looking Ahead

Metascience can continue the work and actually solve important questions posed by philosophers:

  • Is there a difference between mature and immature fields? Should there be?
  • What guiding assumptions are used for theory choice? Do they change over time, and if yes how are they accepted/rejected? What is the best set of rules? Meehl's suggestions are a good starting point: "We can construct other indexes of qualitative diversity, formal simplicity, novel fact predictivity, deductive rigor, and so on. Multiple indexes of theoretical merit could then be plotted over time, intercorrelated, and related to the long-term fate of theories."
  • Can we tell, in real time, which fields are progressing and which are degenerating? If not, is this an opening for irrationalism? What factors should we use to decide whether to stick with a theory on shaky ground? What factors should we use to judge auxiliary hypotheses?10 Meehl started doing good work in this area, let's build on it.
  • Does null hypothesis testing undermine progress in social sciences by focusing on stats rather than the building of solid theories as Meehl thought?
  • Is it actually useful, as Mitroff suggests, to have a wide array of differently-biased scientists working on the same problems? (At least when there's lots of uncertainty?)
  • Gholson & Barker 1985 applied Lakatos and Laudan's theories to progress in physics and psychology (arguing that some areas of psychology do have a strong theoretical grounding), but this should be taken beyond case studies: comparative approaches with normative conclusions. Do strong theories really help with progress in the social sciences? Protzko et al 2020 offer some great data with direct normative applications, much more could be done in this direction.
  • And hell, while I'm writing this absurd Christmas list let me add a cherry on top: give me a good explanation of how abduction works!

Recommended reading:

  • Imre Lakatos, The Methodology of Scientific Research Programmes [PDF] [Amazon]

  1. 1.Scientific realism is the view that the entities described by successful scientific theories are real.
  2. 2.Never go full relativist.
  3. 3.Quine abandoned the entirety of epistemology, "as a chapter of psychology".
  4. 4.Prout's hypothesis ultimately turned out to be wrong for other reasons, but it was much closer to the truth than initially suggested by chlorine.
  5. 5.The end-point of this line is the naked appeal to authority for deciding what is a serious anomaly and what is not.
  6. 6.Fictions like the idea that Newton's laws were derived from and compatible with Kepler's laws abound. Even in a popular contemporary textbook for undergrads you can find statements like "Newton demonstrated that [Kepler's] laws are a consequence of the gravitational force that exists between any two masses." But of course the planets do not follow perfect elliptical orbits in Newtonian physics, and empirical deviations from Kepler were already known in Newton's time.
  7. 7.Fleck is also good on this point.
  8. 8.Klahr & Dunbar (1988) and Mynatt, Doherty & Tweeny (1978) are also worth checking out. Also, these experiments could definitely be taken further, as a way of rationally reconstructing past advances in the lab.
  9. 9.Did I mention how great he is?
  10. 10.Lakatos: "It is very difficult to decide, especially since one must not demand progress at each single step, when a research programme has degenerated hopelessly or when one of two rival programmes has achieved a decisive advantage over the other."



The Riddle of Sweden's COVID-19 Numbers

Comparing Sweden's COVID-19 statistics to other European countries, two peculiar features emerge:

  1. Despite very different policies, Sweden has a similar pattern of cases.
  2. Despite a similar pattern of cases, Sweden has a very different pattern of deaths.

Sweden's Strategy

What exactly has Sweden done (and not done) in response to COVID-19?

  • The government has banned large public gatherings.
  • The government has partially closed schools and universities: lower secondary schools remained open while older students stayed at home.
  • The government recommends voluntary social distancing. High-risk groups are encouraged to isolate.
  • Those with symptoms are encouraged to stay at home.
  • The government does not recommend the use of masks, and surveys confirm that very few people use them (79% "not at all" vs 2% in France, 0% in Italy, 11% in the UK).
  • There was a ban on visits to care homes which was lifted in September.
  • There have been no lockdowns.

How has it worked? Well, Sweden is roughly at the same level as other western European countries in terms of per capita mortality, but it's also doing much worse than its Nordic neighbors. Early apocalyptic predictions have not materialized. Economically it doesn't seem to have gained much, as its Q2 GDP drop was more or less the same as that of Norway and Denmark.1

Case Counts

Sweden has followed a trajectory similar to other Western countries with the first wave in April, a pause during the summer (Sweden took longer to settle down, however), and now a second wave in autumn.2

The fact that the summer drop-off in cases happened in Sweden without lockdowns and without masks suggests that perhaps those were not the determining factors? It doesn't necessarily mean that lockdowns are ineffective in general, just that in this particular case the no-lockdown counterfactual probably looks similar.

The similarity of the trajectories plus the timing points to a common factor: climate.

Seasonality?

This sure looks like a seasonal pattern, right? And there are good a priori reasons to think COVID-19 will be slow to spread in summer: the majority of respiratory diseases all but disappear during the warmer months. This chart from Li, Wang & Nair (2020) shows the monthly activity of various viruses sorted by latitude:

The exact reasons are unclear, but it's probably a mix of temperature, humidity,3 behavioral factors, UV radiation, and possibly vitamin D.

However, when it comes to COVID-19 specifically there are reasons to be skeptical. The US did not have a strong seasonal pattern:

And in the southern hemisphere, Australia's two waves don't really fit a clear seasonal pattern. [Edit: or perhaps it does fit? Their second wave was the winter wave; climate differences and lockdowns could explain the differences from the European pattern?]

The WHO (yes, yes, I know) says it's all one big wave and covid-19 has no seasonal pattern like influenza. A report from the National Academy of Sciences is also very skeptical about seasonality, making comparisons to SARS and MERS which do not exhibit seasonal patterns.

A review of 122 papers on the seasonality of COVID-19 is mostly inconclusive, citing lack of data and problems with confounding from control measures, social, economic, and cultural conditions. The results in the papers themselves "offer mixed offer mixed statistical support (none, weak, or strong relationships) for the influence of environmental drivers." Overall I don't think there's compelling evidence in favor of climatic variables explaining a large percentage of variation in COVID-19 deaths. So if we can't attribute the summer "pause" and autumn "second wave" in Europe to seasonality, what is the underlying cause?

Schools?

If not the climate, then I would suggest schools, but the evidence suggests they play a very small role. I like this study from Germany which uses variation in the timing of summer breaks across states, finding no evidence for an effect on new cases. This paper utilizes the partial school closures in Sweden and finds open schools had only "minor consequences". Looking at school closures during the SARS epidemic the results are similar. The ECDC is not particularly worried about schools, arguing that outbreaks in educational facilities are "exceptional events" that are "limited in number and size".

So what are we left with? Confusion.

Deaths

This chart shows daily new cases and new deaths for all of Europe:

There's a clear relationship between cases & deaths, with a lag of a few weeks as you would expect (and a change in magnitude due to increased testing and decreasing mortality rates). Here's what Sweden's chart looks like:

What is going on here? Fatality rates have been dropping everywhere, but cases and deaths appear to be completely disconnected in Sweden. Even the first death peak doesn't coincide with the first case peak, but that's probably because of early spread in nursing homes.

Are they undercounting deaths? I don't think so, total deaths seem to be below normal levels (data from euromomo):

So how do we explain the lack of deaths in Sweden?

Age?

Could it be that only young people are catching it in Sweden? I haven't found any up to date, day-by-day breakdowns by age, but comparing broad statistics for Sweden and Europe as a whole, they look fairly similar. Even if age could explain it, why would that be the case in Sweden and not in other countries? Why aren't the young people transmitting it to vulnerable old people? Perhaps it's happening and the lag is long enough that it's just not reflected in the data yet?

[Edit: thanks to commenter Frank Suozzo for pointing out that cases are concentrated in lower ages. I have found data from July 31 on the internet archive; comparing it to the latest figures, it appears that old people have managed to avoid getting covid in Sweden! Here's the chart showing total case counts:]

Improved Treatment?

Mortality has declined everywhere, and part of that is probably down to improved treatment. But I don't see Sweden doing anything unique which could explain the wild discrepancy.

Again I'm left confused about these cross-country differences. If you have any good theories I would love to hear them. Looks like age is the answer.

  1. 1.I think the right way to look at this is to say that Sweden has underperformed given its cultural advantages. The differences between Italian-, French-, and German-speaking cantons in Switzerland suggest a large role for cultural factors. Sweden should've followed a trajectory similar to its neighbors rather than one similar to Central/Southern Europe. Of course it's hard to say how things will play out in the long run.
  2. 2.Could this all be just because of increased testing? No. While testing has increased, the rate of positive tests has also risen dramatically. The second wave is not a statistical artifact.
  3. 3.Humidity seems very important, at least when it comes to influenza. See eg Absolute Humidity and the Seasonal Onset of Influenzain the Continental United States and Absolute humidity modulates influenza survival, transmission, and seasonality. There's even experimental evidence here, some papers: High Humidity Leads to Loss of Infectious Influenza Virus from Simulated Coughs, Humidity as a non-pharmaceutical intervention for influenza A.



When the Worst Man in the World Writes a Masterpiece

Boswell's Life of Johnson is not just one of my favorite books, it also engendered some of my favorite book reviews. While praise for the work is universal, the main question commentators try to answer is this: how did the worst man in the world manage to write the best biography?

The Man

Who was James Boswell? He was a perpetual drunk, a degenerate gambler, a sex addict, whoremonger, exhibitionist, and rapist. He gave his wife an STD he caught from a prostitute.

Selfish, servile and self-indulgent, lazy and lecherous, vain, proud, obsessed with his aristocratic status, yet with no sense of propriety whatsoever, he frequently fantasized about the feudal affection of serfs for their lords. He loved to watch executions and was a proud supporter of slavery.

“Where ordinary bad taste leaves off,” John Wain comments, “Boswell began.” The Thrales were long-time friends and patrons of Johnson; a single day after Henry Thrale died, Boswell wrote a poem fantasizing about the elderly Johnson and the just-widowed Hester: "Convuls'd in love's tumultuous throws, / We feel the aphrodisian spasm". The rest of his verse is of a similar quality; naturally he considered himself a great poet.

Boswell combined his terrible behavior with a complete lack of shame, faithfully reporting every transgression, every moronic ejaculation, every faux pas. The first time he visited London he went to see a play and, as he happily tells us himself, he "entertained the audience prodigiously by imitating the lowing of a cow."

By all accounts, including his own, he was an idiot. On a tour of Europe, his tutor said to him: "of young men who have studied I have never found one who had so few ideas as you."

As a lawyer he was a perpetual failure, especially when he couldn't get Johnson to write his arguments for him. As a politician he didn't even get the chance to be a failure despite decades of trying.

His correspondence with Johnson mostly consists of Boswell whining pathetically and Johnson telling him to get his shit together.

He commissioned a portrait from his friend Joshua Reynolds and stiffed him on the payment. His descendants hid the portrait in the attic because they were ashamed of being related to him.

Desperate for fame, he kept trying to attach himself to important people, mostly through sycophancy. In Geneva he pestered Rousseau,1 leading to this conversation:

Rousseau: You are irksome to me. It’s my nature. I cannot help it.
Boswell: Do not stand on ceremony with me.
Rousseau: Go away.

Later, Boswell was given the task of escorting Rousseau's mistress Thérèse Le Vasseur to England—they had an affair on the way.

When Adam Smith and Edward Gibbon were elected to The Literary Club, Boswell considered leaving because he thought the club had now "lost its select merit"!

On the positive side, his humor and whimsy made for good conversation; he put people at ease; he gave his children all the love his own father had denied him; and, somehow, he wrote one of the great works of English literature.

The Masterpiece

The Life of Samuel Johnson, LL.D. was an instant sensation. While the works of Johnson were quickly forgotten,2 his biography has never been out of print in the 229 years since its initial publication. It went through 41 editions just in the 19th century.

Burke told King George III that he had never read anything more entertaining. Coleridge said "it is impossible not to be amused with such a book." George Bernard Shaw compared Boswell's dramatization of Johnson to Plato's dramatization of Socrates, and placed old Bozzy in the middle of an "apostolic succession of dramatists" from the Greek tragedians through Shakespeare and ending, of course, with Shaw himself.

It is a strange work, an experimental collage of different modes: part traditional biography, part collection of letters, and part direct reports of Johnson's life as observed by Boswell.3 His inspiration came not from literature, but from the minute naturalistic detail of Flemish paintings. It is difficult to convey its greatness in compressed form: Boswell is not a great writer at the sentence level, and all the famous quotes are (hilarious) Johnsonian bon mots. The book succeeds through a cumulative effect.

Johnson was 54 years old when he first met Boswell, and most of his major accomplishments (the poetry, the dictionary, The Rambler) were behind him; his wife had already died; he was already the recipient of a £300 pension from the King; his edition of Shakespeare was almost complete. All in all they spent no more than 400 days together. Boswell had limited material to work with, but what he doesn't capture in fact, he captures in feeling: an entire life is contained in this book: love and friendship, taverns and work, the glory of success and recognition, the depressive bouts of failure and penury, the inevitable tortures of aging and death.

Out of a person, Boswell created a literary personality. His powers of characterization are positively Shakespearean, and his Johnson resembles none other but the bard's greatest creation: Sir John Falstaff. Big, brash, and deeply flawed, but also lovable. He would "laugh like a rhinoceros":

Johnson could not stop his merriment, but continued it all the way till he got without the Temple-gate. He then burst into such a fit of laughter that he appeared to be almost in a convulsion; and in order to support himself, laid hold of one of the posts at the side of the foot pavement, and sent forth peals so loud, that in the silence of the night his voice seemed to resound from Temple-bar to Fleet ditch.

And around Johnson he painted an entire dramatic cast, bringing 18th century London to life: Garrick the great actor, Reynolds the painter, Beauclerk with his banter, Goldsmith with his insecurities. Monboddo and Burke, Henry and Hester Thrale, the blind Mrs Williams and the Jamaican freedman Francis Barber.

Borges (who was also a big fan) finds his parallels not in Shakespeare and Falstaff, but in Cervantes and Don Quixote. He (rather implausibly) suggests that every Quixote needs his Sancho, and "Boswell appears as a despicable character" deliberately to create a contrast.4

And in the 1830s, two brilliant and influential reviews were written by two polar opposites: arch-progressive Thomas Babington Macaulay and radical reactionary Thomas Carlyle. The first thing you'll notice is their sheer magnitude: Macaulay's is 55 pages long, while Carlyle's review in Fraser's Magazine reaches 74 pages!5 And while they both agree that it's a great book and that Boswell was a scoundrel, they have very different theories about what happened.

Macaulay

Never in history, Macaulay says, has there been "so strange a phænomenon as this book". On the one hand he has effusive praise:

Homer is not more decidedly the first of heroic poets, Shakspeare is not more decidedly the first of dramatists, Demosthenes is not more decidedly the first of orators, than Boswell is the first of biographers. He has no second. He has distanced all his competitors so decidedly that it is not worth while to place them.

On the other hand, he spends several paragraphs laying into Boswell with gusto:

He was, if we are to give any credit to his own account or to the united testimony of all who knew him, a man of the meanest and feeblest intellect. [...] He was the laughing-stock of the whole of that brilliant society which has owed to him the greater part of its fame. He was always laying himself at the feet of some eminent man, and begging to be spit upon and trampled upon. [...] Servile and impertinent, shallow and pedantic, a bigot and a sot, bloated with family pride, and eternally blustering about the dignity of a born gentleman, yet stooping to be a talebearer, an eavesdropper, a common butt in the taverns of London.

Macaulay's theory is that while Homer and Shakespeare and all the other greats owe their eminence to their virtues, Boswell is unique in that he owes his success to his vices.

He was a slave, proud of his servitude, a Paul Pry, convinced that his own curiosity and garrulity were virtues, an unsafe companion who never scrupled to repay the most liberal hospitality by the basest violation of confidence, a man without delicacy, without shame, without sense enough to know when he was hurting the feelings of others or when he was exposing himself to derision; and because he was all this, he has, in an important department of literature, immeasurably surpassed such writers as Tacitus, Clarendon, Alfieri, and his own idol Johnson.

Of the talents which ordinarily raise men to eminence as writers, Boswell had absolutely none. There is not in all his books a single remark of his own on literature, politics, religion, or society, which is not either commonplace or absurd. [...] Logic, eloquence, wit, taste, all those things which are generally considered as making a book valuable, were utterly wanting to him. He had, indeed, a quick observation and a retentive memory. These qualities, if he had been a man of sense and virtue, would scarcely of themselves have sufficed to make him conspicuous; but, because he was a dunce, a parasite, and a coxcomb, they have made him immortal.

The work succeeds partly because of its subject: if Johnson had not been so extraordinary, then airing all his dirty laundry would have just made him look bad.

No man, surely, ever published such stories respecting persons whom he professed to love and revere. He would infallibly have made his hero as contemptible as he has made himself, had not his hero really possessed some moral and intellectual qualities of a very high order. The best proof that Johnson was really an extraordinary man is that his character, instead of being degraded, has, on the whole, been decidedly raised by a work in which all his vices and weaknesses are exposed.

And finally, Boswell provided Johnson with a curious form of literary fame:

The reputation of [Johnson's] writings, which he probably expected to be immortal, is every day fading; while those peculiarities of manner and that careless table-talk the memory of which, he probably thought, would die with him, are likely to be remembered as long as the English language is spoken in any quarter of the globe.

Carlyle

Carlyle rates Johnson's biography as the greatest work of the 18th century. In a sublime passage that brings tears to my eyes, he credits the Life with the power of halting the inexorable passage of time:

Rough Samuel and sleek wheedling James were, and are not. [...] The Bottles they drank out of are all broken, the Chairs they sat on all rotted and burnt; the very Knives and Forks they ate with have rusted to the heart, and become brown oxide of iron, and mingled with the indiscriminate clay. All, all has vanished; in every deed and truth, like that baseless fabric of Prospero's air-vision. Of the Mitre Tavern nothing but the bare walls remain there: of London, of England, of the World, nothing but the bare walls remain; and these also decaying (were they of adamant), only slower. The mysterious River of Existence rushes on: a new Billow thereof has arrived, and lashes wildly as ever round the old embankments; but the former Billow with its loud, mad eddyings, where is it? Where! Now this Book of Boswell's, this is precisely a revocation of the edict of Destiny; so that Time shall not utterly, not so soon by several centuries, have dominion over us. A little row of Naphtha-lamps, with its line of Naphtha-light, burns clear and holy through the dead Night of the Past: they who are gone are still here; though hidden they are revealed, though dead they yet speak. There it shines, that little miraculously lamplit Pathway; shedding its feebler and feebler twilight into the boundless dark Oblivion, for all that our Johnson touched has become illuminated for us: on which miraculous little Pathway we can still travel, and see wonders.

Carlyle disagrees completely with Macaulay: it is not because of his vices that Boswell could write this book, but rather because he managed to overcome them. He sees in Boswell a hopeful symbol for humanity as a whole, a victory in the war between the base and the divine in our souls.

In fact, the so copious terrestrial dross that welters chaotically, as the outer sphere of this man's character, does but render for us more remarkable, more touching, the celestial spark of goodness, of light, and Reverence for Wisdom, which dwelt in the interior, and could struggle through such encumbrances, and in some degree illuminate and beautify them.

Boswell's shortcomings were visible: he was "vain, heedless, a babbler". But if that was the whole story, would he really have chosen Johnson? He could have picked more illustrious targets, richer ones, perhaps some powerful statesman or an aristocrat with a distinguished lineage. "Doubtless the man was laughed at, and often heard himself laughed at for his Johnsonism". Boswell must have been attracted to Johnson by nobler motives. And to do that he would have to "hurl mountains of impediment aside" in order to overcome his nature.

The plate-licker and wine-bibber dives into Bolt Court, to sip muddy coffee with a cynical old man, and a sour-tempered blind old woman (feeling the cups, whether they are full, with her finger); and patiently endures contradictions without end; too happy so he may but be allowed to listen and live.

The Life is not great because of Boswell's foolishness, but because of his love and his admiration, an admiration that Macaulay considered a disease. Boswell wrote that in Johnson's company he "felt elevated as if brought into another state of being".

His sneaking sycophancies, his greediness and forwardness, whatever was bestial and earthy in him, are so many blemishes in his Book, which still disturb us in its clearness; wholly hindrances, not helps. Towards Johnson, however, his feeling was not Sycophancy, which is the lowest, but Reverence, which is the highest of human feelings.

On Johnson's personality, Carlyle writes: "seldom, for any man, has the contrast between the ethereal heavenward side of things, and the dark sordid earthward, been more glaring". And this is what Johnson wrote about Falstaff in his Shakespeare commentary:

Falstaff is a character loaded with faults, and with those faults which naturally produce contempt. [...] the man thus corrupt, thus despicable, makes himself necessary to the prince that despises him, by the most pleasing of all qualities, perpetual gaiety, by an unfailing power of exciting laughter, which is the more freely indulged, as his wit is not of the splendid or ambitious kind, but consists in easy escapes and sallies of levity, which make sport but raise no envy.

Johnson obviously enjoyed the comparison to Falstaff, but would it be crazy to also see Boswell in there? The Johnson presented to us in the Life is a man who had to overcome poverty, disease, depression, and a constant fear of death, but never let those things poison his character. Perhaps Boswell crafted the character he wished he could become: Johnson was his Beatrice—a dream, an aspiration, an ideal outside his grasp that nonetheless thrust him toward greatness. Through a process of self-overcoming Boswell wrote a great book on self-overcoming.

Mediocrities Everywhere...I Absolve You

The story of Boswell is basically the plot of Amadeus, with the role of Salieri being played by Macaulay, by Carlyle, by me, and—perhaps even by yourself, dear reader. The line between admiration, envy, and resentment is thin, and crossing it is easier when the subject is a scoundrel. But if Bozzy could set aside resentment for genuine reverence, perhaps there is hope for us all. And yet...it would be an error to see in Boswell the Platonic Form of Mankind.

Shaffer and Forman's film portrays Mozart as vulgar, arrogant, a womanizer, bad with money—but, like Bozzy, still somehow quite likable. In one of the best scenes of the film, we see Mozart transform the screeching of his mother-in-law into the Queen of the Night Aria; thus Boswell transformed his embarrassments into literary gold. He may be vulgar, but his productions are not. He may be vulgar, but he is not ordinary.

Perhaps it is in vain that we seek correlations between virtues and talents: perhaps genius is ineffable. Perhaps it's Ramanujans all the way down. You can't even say that genius goes with independence: there's nothing Boswell wanted more than social approval. I won't tire you with clichés about the Margulises and the Musks.

Would Johnson have guessed that he would be the mediocrity, and Bozzy the genius? Would he have felt envy and resentment? What would he say, had he been given the chance to read in Carlyle that Johnson's own writings "are becoming obsolete for this generation; and for some future generation may be valuable chiefly as Prolegomena and expository Scholia to this Johnsoniad of Boswell"?


If you want to read The Life of Johnson, I recommend a second-hand copy of the Everyman's Library edition: cheap, reasonably sized, and the paper & binding are great.


  1. 1.In the very first letter Boswell wrote to Rousseau, he described himself as "a man of singular merit".
  2. 2.They were "rediscovered" in the early 1900s.
  3. 3.While some are quick to dismiss the non-direct parts, I think they're necessary, especially the letters which illuminate a different side of Johnson's character.
  4. 4.Lecture #10 in Professor Borges: A Course on English Literature.
  5. 5.What happened to the novella-length book review? Anyway, many of those pages are taken up by criticism of John Wilson Croker's incompetent editorial efforts.



Links & What I've Been Reading Q3 2020

High Replicability of Newly-Discovered Social-behavioral Findings is Achievable: a replication of 16 papers that followed "optimal practices" finds a high rate of replicability and virtually identical effect sizes as the original studies.

How do you decide what to replicate? This paper attempts to build a model that can be used to pick studies to maximize utility gained from replications.

Guzey on that deworming study, tracks which variables are reported across 5 different drafts of the paper starting in 2011. "But then you find that these variables didn’t move in the right direction. What do you do? Do you have to show these variables? Or can you drop them?"

I've been enjoying the NunoSempre forecasting newsletter, a monthly collection of links on forecasting.

COVID-19 made weather forecasts worse by limiting the metereological data coming from airplanes.

The 16th paragraph in this piece on the long-term effects of coronavirus mentions that 2 out of 3 people with "long-lasting" COVID-19 symptoms never had COVID to begin with.

An experiment with working 120 hours in a week goes surprisingly well.

Gwern's giant GPT-3 page. The Zizek Navy Seal Copypasta is incredible, as are the poetic imitations.

Ethereum is a Dark Forest. "In the Ethereum mempool, these apex predators take the form of “arbitrage bots.” Arbitrage bots monitor pending transactions and attempt to exploit profitable opportunities created by them."

Tyler Cowen in conversation with Nicholas Bloom, lots of fascinating stuff on innovation and progress. "Just in economics — when I first started in economics, it was standard to do a four-year PhD. It’s now a six-year PhD, plus many of the PhD students have done a pre-doc, so they’ve done an extra two years. We’re taking three or four years longer just to get to the research frontier." Immediately made me think of Scott Alexander's Ars Longa, Vita Brevis.

The Progress Studies for Young Scholars youtube channel has a bunch of interesting interviews, including Cowen, Collison, McCloskey, and Mokyr.

From the promising new Works in Progress magazine, Progress studies: the hard question.

I've written a parser for your Kindle's My Clippings.txt file. It removes duplicates, splits them up by book, and outputs them in convenient formats. Works cross-platform.

Generative bad handwriting in 280 characters. You can find a lot more of that sort of thing by searching for #つぶやきProcessing on twitter.

A new ZeroHPLovecraft short story, Key Performance Indicators. Black Mirror-esque.

A great skit about Ecclesiastes from Israeli sketch show The Jews Are Coming. Turn on the subs.

And here's some sweet Dutch prog-rock/jazz funk from the 70s.

What I've Been Reading

  • Piranesi by Susanna Clarke. 16 years after Jonathan Strange & Mr Norrell, a new novel from Susanna Clarke! It's short and not particularly ambitious, but I enjoyed it a lot. A tight fantastical mystery that starts out similar to The Library of Babel but then goes off in a different direction.

  • The Poems of T. S. Eliot: the great ones are great, and there's a lot of mediocre stuff in between. Ultimately a bit too grey and resigned and pessimistic for my taste. I got the Faber & Faber hardcover edition and would not recommend it, it's unwieldy and the notes are mostly useless.

  • Antkind by Charlie Kaufman. A typically Kaufmanesque work about a neurotic film critic and his discovery of an astonishing piece of outsider art. Memory, consciousness, time, doubles, etc. Extremely good and laugh-out-loud funny for the first half, but the final 3-400 pages were a boring, incoherent psychedelic smudge.

  • Under the Volcano by Malcolm Lowry. Very similar to another book I read recently, Lawrence Durrell's Alexandria Quartet. I prefer Durrell. Lowry doesn't have the stylistic ability to make the endless internal monologues interesting (as eg Gass does in The Tunnel), and I find the central allegory deeply misguided. Also, it's the kind of book that has a "central allegory".

  • Less than One by Joseph Brodsky. A collection of essays, mostly on Russian poetry. If I knew more about that subject I think I would have enjoyed the book more. The essays on his life in Soviet Russia are good.

  • Science Fictions: Exposing Fraud, Bias, Negligence and Hype in Science by Stuart Ritchie. Very good, esp. if you are not familiar with the replication crisis. Some quibbles about the timing and causes of the problems. Full review here.

  • The Idiot by "Dostoyevsky". Review forthcoming.

  • Borges and His Successors: The Borgesian Impact on Literature and the Arts: a collection of fairly dull essays with little to no insight.

  • Samuel Johnson: Literature, Religion and English Cultural Politics from the Restoration to Romanticism by J.C.D. Clark: a dry but well-researched study on an extraordinarily narrow slice of cultural politics. Not really aimed at a general audience.

  • Dhalgren by Samuel R. Delany. A wild semi-autobiographical semi-post-apocalyptic semi-science fiction monster. It's a 900 page slog, it's puerile, the endless sex scenes (including with minors) are pointless at best, the characters are uninteresting, there's barely any plot, the 70s counterculture stuff is just comical, and stylistically it can't reach the works it's aping. So I can see why some people hate it. But I actually enjoyed it, it has a compelling strangeness to it that is difficult to put into words (or perhaps I was just taken in by all the unresolved plot points?). Its sheer size is a quality in itself, too. Was it worth the effort? Could I recommend it? Probably not.

  • Novum Organum by Francis Bacon. While he did not actually invent the scientific method, his discussion of empiricism, experiments, and induction was clearly a step in that direction. The first part deals with science and empiricism and induction from an abstract perspective and it feels almost contemporary, like it was written by a time traveling 19th century scientist or something like that. The quarrel between the ancients and the moderns is already in full swing here, Bacon dunks on the Greeks constantly and upbraids people for blindly listening to Aristotle. Question received dogma and popular opinions, he says. He points to inventions like gunpowder and the compass and printing and paper and says that surely these indicate that there's a ton of undiscovered ideas out there, we should go looking for them. He talks about cognitive biases and scientific progress:

    we are laying the foundations not of a sect or of a dogma, but of human progress and empowerment.

    Then you get to the second part and the middle ages hit you like a freight train, you suddenly realize this is no contemporary man at all and his conception of how the world works is completely alien. Ideas that to us seem bizarre and just intuitively nonsensical (about gravity, heat, light, biology, etc.) are only common sense to him. He repeats absurdities about worms and flies arising spontaneously out of putrefaction, that light objects are pulled to the heavens while heavy objects are pulled to the earth, and so on. Not just surface-level opinions, but fundamental things that you wouldn't even think someone else could possibly perceive differently.

    You won't learn anything new from Bacon, but it's a fascinating historical document.

  • The Book of Marvels and Travels by John Mandeville. This medieval bestseller (published around 1360) combines elements of travelogue, ethnography, and fantasy. It's unclear how much of it people believed, but there was huge demand for information about far-off lands and marvelous stories. Mostly compiled from other works, it was incredibly popular for centuries. In the age of exploration (Columbus took it with him on his trip) people were shocked when some of the fantastical stories (eg about cannibals) actually turned out to be true. The tricks the author uses to generate verisimilitude are fascinating: he adds small personal touches about people he met, sometimes says that he doesn't know anything about a particular region because he hasn't been there, etc.




What's Wrong with Social Science and How to Fix It: Reflections After Reading 2578 Papers

I've seen things you people wouldn't believe.

Over the past year, I have skimmed through 2578 social science papers, spending about 2.5 minutes on each one. This was due to my participation in Replication Markets, a part of DARPA's SCORE program, whose goal is to evaluate the reliability of social science research. 3000 studies were split up into 10 rounds of ~300 studies each. Starting in August 2019, each round consisted of one week of surveys followed by two weeks of market trading. I finished in first place in 3 out 10 survey rounds and 6 out of 10 market rounds. In total, about $200,000 in prize money will be awarded.

The studies were sourced from all social science disciplines (economics, psychology, sociology, management, etc.) and were published between 2009 and 2018 (in other words, most of the sample came from the post-replication crisis era).

The average replication probability in the market was 54%; while the replication results are not out yet (250 of the 3000 papers will be replicated), previous experiments have shown that prediction markets work well.1

This is what the distribution of my own predictions looks like:2

My average forecast was in line with the market. A quarter of the claims were above 76%. And a quarter of them were below 33%: we're talking hundreds upon hundreds of terrible papers, and this is just a tiny sample of the annual academic production.

Criticizing bad science from an abstract, 10000-foot view is pleasant: you hear about some stuff that doesn't replicate, some methodologies that seem a bit silly. "They should improve their methods", "p-hacking is bad", "we must change the incentives", you declare Zeuslike from your throne in the clouds, and then go on with your day.

But actually diving into the sea of trash that is social science gives you a more tangible perspective, a more visceral revulsion, and perhaps even a sense of Lovecraftian awe at the sheer magnitude of it all: a vast landfill—a great agglomeration of garbage extending as far as the eye can see, effluvious waves crashing and throwing up a foul foam of p=0.049 papers. As you walk up to the diving platform, the deformed attendant hands you a pair of flippers. Noticing your reticence, he gives a subtle nod as if to say: "come on then, jump in".

They Know What They're Doing

Prediction markets work well because predicting replication is easy.3 There's no need for a deep dive into the statistical methodology or a rigorous examination of the data, no need to scrutinize esoteric theories for subtle errors—these papers have obvious, surface-level problems.

There's a popular belief that weak studies are the result of unconscious biases leading researchers down a "garden of forking paths". Given enough "researcher degrees of freedom" even the most punctilious investigator can be misled.

I find this belief impossible to accept. The brain is a credulous piece of meat4 but there are limits to self-delusion. Most of them have to know. It's understandable to be led down the garden of forking paths while producing the research, but when the paper is done and you give it a final read-over you will surely notice that all you have is a n=23, p=0.049 three-way interaction effect (one of dozens you tested, and with no multiple testing adjustments of course). At that point it takes more than a subtle unconscious bias to believe you have found something real. And even if the authors really are misled by the forking paths, what are the editors and reviewers doing? Are we supposed to believe they are all gullible rubes?

People within the academy don't want to rock the boat. They still have to attend the conferences, secure the grants, publish in the journals, show up at the faculty meetings: all these things depend on their peers. When criticising bad research it's easier for everyone to blame the forking paths rather than the person walking them. No need for uncomfortable unpleasantries. The fraudster can admit, without much of a hit to their reputation, that indeed they were misled by that dastardly garden, really through no fault of their own whatsoever, at which point their colleagues on twitter will applaud and say "ah, good on you, you handled this tough situation with such exquisite virtue, this is how progress happens! hip, hip, hurrah!" What a ridiculous charade.

Even when they do accuse someone of wrongdoing they use terms like "Questionable Research Practices" (QRP). How about Questionable Euphemism Practices?

  • When they measure a dozen things and only pick their outcome variable at the end, that's not the garden of forking paths but the greenhouse of fraud.
  • When they do a correlational analysis but give "policy implications" as if they were doing a causal one, they're not walking around the garden, they're doing the landscaping of forking paths.
  • When they take a continuous variable and arbitrarily bin it to do subgroup analysis or when they add an ad hoc quadratic term to their regression, they're...fertilizing the garden of forking paths? (Look, there's only so many horticultural metaphors, ok?)

The bottom line is this: if a random schmuck with zero domain expertise like me can predict what will replicate, then so can scientists who have spent half their lives studying this stuff. But they sure don't act like it.

...or Maybe They Don't?

The horror! The horror!

Check out this crazy chart from Yang et al. (2020):

Yes, you're reading that right: studies that replicate are cited at the same rate as studies that do not. Publishing your own weak papers is one thing, but citing other people's weak papers? This seemed implausible, so I decided to do my own analysis with a sample of 250 articles from the Replication Markets project. The correlation between citations per year and (market-estimated) probability of replication was -0.05!

You might hypothesize that the citations of non-replicating papers are negative, but negative citations are extremely rare.5 One study puts the rate at 2.4%. Astonishingly, even after retraction the vast majority of citations are positive, and those positive citations continue for decades after retraction.6

As in all affairs of man, it once again comes down to Hanlon's Razor. Either:

  1. Malice: they know which results are likely false but cite them anyway.
  2. or, Stupidity: they can't tell which papers will replicate even though it's quite easy.

Accepting the first option would require a level of cynicism that even I struggle to muster. But the alternative doesn't seem much better: how can they not know? I, an idiot with no relevant credentials or knowledge, can fairly accurately determine good research from bad, but all the tenured experts can not? How can they not tell which papers are retracted?

I think the most plausible explanation is that scientists don't read the papers they cite, which I suppose involves both malice and stupidity.7 Gwern has a nice write-up on this question citing some ingenious analyses based on the proliferation of misprints: "Simkin & Roychowdhury venture a guess that as many as 80% of authors citing a paper have not actually read the original". Once a paper is out there nobody bothers to check it, even though they know there's a 50-50 chance it's false!

Whatever the explanation might be, the fact is that the academic system does not allocate citations to true claims.8 This is bad not only for the direct effect of basing further research on false results, but also because it distorts the incentives scientists face. If nobody cited weak studies, we wouldn't have so many of them. Rewarding impact without regard for the truth inevitably leads to disaster.

There Are No Journals With Strict Quality Standards

Naïvely you might expect that the top-ranking journals would be full of studies that are highly likely to replicate, and the low-ranking journals would be full of p<0.1 studies based on five undergraduates. Not so! Like citations, journal status and quality are not very well correlated: there is no association between statistical power and impact factor, and journals with higher impact factor have more papers with erroneous p-values.

This pattern is repeated in the Replication Markets data. As you can see in the chart below, there's no relationship between h-index (a measure of impact) and average expected replication rates. There's also no relationship between h-index and expected replication within fields.

Even the crème de la crème of economics journals barely manage a ⅔ expected replication rate. 1 in 5 articles in QJE scores below 50%, and this is a journal that accepts just 1 out of every 30 submissions. Perhaps this (partially) explains why scientists are undiscerning: journal reputation acts as a cloak for bad research. It would be fun to test this idea empirically.

Here you can see the distribution of replication estimates for every journal in the RM sample:

As far as I can tell, for most journals the question of whether the results in a paper are true is a matter of secondary importance. If we model journals as wanting to maximize "impact", then this is hardly surprising: as we saw above, citation counts are unrelated to truth. If scientists were more careful about what they cited, then journals would in turn be more careful about what they publish.

Things Are Not Getting Better

Before we got to see any of the actual Replication Markets studies, we voted on the expected replication rates by year. Gordon et al. (2020) has that data: replication rates were expected to steadily increase from 43% in 2009/2010 to 55% in 2017/2018.

This is what the average predictions looked like after seeing the papers: from 53.4% in 2009 to 55.8% in 2018 (difference not statistically significant; black dots are means).

I frequently encounter the notion that after the replication crisis hit there was some sort of great improvement in the social sciences, that people wouldn't even dream of publishing studies based on 23 undergraduates any more (I actually saw plenty of those), etc. Stuart Ritchie's new book praises psychologists for developing "systematic ways to address" the flaws in their discipline. In reality there has been no discernible improvement.

The results aren't out yet, so it's possible that the studies have improved in subtle ways which the forecasters have not been able to detect. Perhaps the actual replication rates will be higher. But I doubt it. Looking at the distribution of p-values over time, there's a small increase in the proportion of p<.001 results, but nothing like the huge improvement that was expected.

Everyone is Complicit

Authors are just one small cog in the vast machine of scientific production. For this stuff to be financed, generated, published, and eventually rewarded requires the complicity of funding agencies, journal editors, peer reviewers, and hiring/tenure committees. Given the current structure of the machine, ultimately the funding agencies are to blame.9 But "I was just following the incentives" only goes so far. Editors and reviewers don't actually need to accept these blatantly bad papers.

Journals and universities certainly can't blame the incentives when they stand behind fraudsters to the bitter end. Paolo Macchiarini "left a trail of dead patients" but was protected for years by his university. Andrew Wakefield's famously fraudulent autism-MMR study took 12 years to retract. Even when the author of a paper admits the results were entirely based on an error, journals still won't retract.

Elisabeth Bik documents her attempts to report fraud to journals. It looks like this:

The Editor in Chief of Neuroscience Letters [Yale's Stephen G. Waxman] never replied to my email. The APJTM journal had a new publisher, so I wrote to both current Editors in Chief, but they never replied to my email.

Two papers from this set had been published in Wiley journals, Gerodontology and J Periodontology. The EiC of the Journal of Periodontology never replied to my email. None of the four Associate Editors of that journal replied to my email either. The EiC of Gerodontology never replied to my email.

Even when they do take action, journals will often let scientists "correct" faked figures instead of retracting the paper! The rate of retraction is about 0.04%; it ought to be much higher.

And even after being caught for outright fraud, about half of the offenders are allowed to keep working: they "have received over $123 million in federal funding for their post-misconduct research efforts".

Just Because a Paper Replicates Doesn't Mean it's Good

First: a replication of a badly designed study is still badly designed. Suppose you are a social scientist, and you notice that wet pavements tend to be related to umbrella usage. You do a little study and find the correlation is bulletproof. You publish the paper and try to sneak in some causal language when the editors/reviewers aren't paying attention. Rain is never even mentioned. Of course if someone repeats your study, they will get a significant result every time. This may sound absurd, but it describes a large proportion of the papers that successfully replicate.

Economists and education researchers tend to be relatively good with this stuff, but as far as I can tell most social scientists go through 4 years of undergrad and 4-6 years of PhD studies without ever encountering ideas like "identification strategy", "model misspecification", "omitted variable", "reverse causality", or "third-cause". Or maybe they know and deliberately publish crap. Fields like nutrition and epidemiology are in an even worse state, but let's not get into that right now.

"But Alvaro, correlational studies can be usef-" Spare me.

Second: the choice of claim for replication. For some papers it's clear (eg math educational intervention → math scores), but other papers make dozens of different claims which are all equally important. Sometimes the Replication Markets organisers picked an uncontroversial claim from a paper whose central experiment was actually highly questionable. In this way a study can get the "successfully replicates" label without its most contentious claim being tested.

Third: effect size. Should we interpret claims in social science as being about the magnitude of an effect, or only about its direction? If the original study says an intervention raises math scores by .5 standard deviations and the replication finds that the effect is .2 standard deviations (though still significant), that is considered a success that vindicates the original study! This is one area in which we absolutely have to abandon the binary replicates/doesn't replicate approach and start thinking more like Bayesians.

Fourth: external validity. A replicated lab experiment is still a lab experiment. While some replications try to address aspects of external validity (such as generalizability across different cultures), the question of whether these effects are relevant in the real world is generally not addressed.

Fifth: triviality. A lot of the papers in the 85%+ chance-to-replicate range are just really obvious. "Homeless students have lower test scores", "parent wealth predicts their children's wealth", that sort of thing. These are not worthless, but they're also not really expanding the frontiers of science.

So: while about half the papers will replicate, I would estimate that only half of those are actually worthwhile.

Lack of Theory

The majority of journal articles are almost completely atheoretical. Even if all the statistical, p-hacking, publication bias, etc. issues were fixed, we'd still be left with a ton of ad-hoc hypotheses based, at best, on (WEIRD) folk intuitions. But how can science advance if there's no theoretical grounding, nothing that can be refuted or refined? A pile of "facts" does not a progressive scientific field make.

Michael Muthukrishna and the superhuman Joe Henrich have written a paper called A Problem in Theory which covers the issue better than I ever could. I highly recommend checking it out.

Rather than building up principles that flow from overarching theoretical frameworks, psychology textbooks are largely a potpourri of disconnected empirical findings.

There's Probably a Ton of Uncaught Frauds

This is a fairly lengthy topic, so I made a separate post for it. tl;dr: I believe about 1% of falsified/fabricated papers are retracted, but overall they represent a very small portion of non-replicating research.

Power: Not That Bad

[Warning: technical section. Skip ahead if bored.]

A quick refresher on hypothesis testing:

  • α, the significance level, is the probability of a false positive.
  • β, or type II error, is the probability of a false negative.
  • Power is (1-β): if a study has 90% power, there's a 90% chance of successfully detecting the effect being studied. Power increases with sample size and effect size.
  • The probability that a significant p-value indicates a true effect is not 1-α. It is called the positive predictive value (PPV), and is calculated as follows: PPV=priorpowerpriorpower+(1prior)αPPV = \frac{prior \cdot power}{prior \cdot power + (1-prior) \cdot \alpha}

This great diagram by Felix Schönbrodt gives the intuition behind PPV:

This model makes the assumption that effects can be neatly split into two categories: those that are "real" and those that are not. But is this accurate? In the opposite extreme you have the "crud factor": everything is correlated so if your sample is big enough you will always find a real effect.10 As Bakan puts it: "there is really no good reason to expect the null hypothesis to be true in any population". If you look at the universe of educational interventions, for example, are they going to be neatly split into two groups of "real" and "fake" or is it going to be one continuous distribution? What does "false positive" even mean if there are no "fake" effects, unless it refers purely to the direction of the effect? Perhaps the crud factor is wrong, at least when it comes to causal effects? Perhaps the pragmatic solution is to declare that all effects with, say, d<.1 are fake and the rest are real? Or maybe we should just go full Bayesian?

Anyway, let's pretend the previous paragraph never happened. Where do we find the prior? There are a few different approaches, and they're all problematic.11

The exact number doesn't really matter that much (there's nothing we can do about it), so I'm going to go ahead and use a prior of 25% for the calculations below. The main takeaways don't change with a different prior value.

Now the only thing we're missing is the power of the typical social science study. To determine that we need to know 1) sample sizes (easy), and 2) the effect size of true effects (not so easy).14 I'm going to use the results of extremely high-powered, large-scale replication efforts:

Surprisingly large, right? We can then use the power estimates in Szucs & Ioannidis (2017): they give an average power of .49 for "medium effects" (d=.5) and .71 for "large effects" (d=.8). Let's be conservative and split the difference.

With a prior of 25%, power of 60%, and α=5%, PPV is equal to 80%. Assuming no fraud and no QRPs, 20% of positive findings will be false.

These averages hide a lot of heterogeneity: it's well-established that studies of large effects are adequately powered whereas studies of small effects are underpowered, so the PPV is going to be smaller for small effects. There are also large differences depending on the field you're looking at. The lower the power the bigger the gains to be had from increasing sample sizes.

This is what PPV looks like for the full range of prior/power values, with α=5%:

At the current prior/power levels, PPV is more sensitive to the prior: we can only squeeze small gains out of increasing power. That's a bit of a problem given the fact that increasing power is relatively easy, whereas increasing the chance that the effect you're investigating actually exists is tricky, if not impossible. Ultimately scientists want to discover surprising results—in other words, results with a low prior.

I made a little widget so you can play around with the values:

Alpha0.05
Power0.5
Prior0.25
False positives
True positives
False negatives
True negatives
PPV

Assuming a 25% prior, increasing power from 60% to 90% would require more than twice the sample size and would only increase PPV by 5.7 percentage points. It's something, but it's no panacea. However, there is something else we could do: sample size is a budget, and we can allocate that budget either to higher power or to a lower significance cutoff. Lowering alpha is far more effective at reducing the false discovery rate.15

Let's take a look at 4 different different power/alpha scenarios, assuming a 25% prior and d=0.5 effect size.16 The required sample sizes are for a one-sided t-test.

False Discovery Rate
α
0.050.005
Power0.523.1%2.9%
0.815.8%1.8%
Required Sample Size
α
0.050.005
Power0.545110
0.8100190

To sum things up: power levels are decent on average and improving them wouldn't do much. Power increases should be focused on studies of small effects. Lowering the significance cutoff achieves much more for the same increase in sample size.

Field of Dreams

Before we got to see any of the actual Replication Markets studies, we voted on the expected replication rates by field. Gordon et al. (2020) has that data:

This is what the predictions looked like after seeing the papers:

Economics is Predictably Good

Economics topped the charts in terms of expectations, and it was by far the strongest field. There are certainly large improvements to be made—a 2/3 replication rate is not something to be proud of. But reading their papers you get the sense that at least they're trying, which is more than can be said of some other fields. 6 of the top 10 economics journals participated, and they did quite well: QJE is the behemoth of the field and it managed to finish very close to the top. A unique weakness of economics is the frequent use of absurd instrumental variables. I doubt there's anyone (including the authors) who is convinced by that stuff, so let's cut it out.

EvoPsych is Surprisingly Bad

You were supposed to destroy the Sith, not join them!

Going into this, my view of evolutionary psychology was shaped by people like Cosmides, Tooby, DeVore, Boehm, and so on. You know, evolutionary psychology! But the studies I skimmed from evopsych journals were mostly just weak social psychology papers with an infinitesimally thin layer of evolutionary paint on top. Few people seem to take the "evolutionary" aspect really seriously.

Also underdetermination problems are particularly difficult in this field and nobody seems to care.

Education is Surprisingly Good

Education was expected to be the worst field, but it ended up being almost as strong as economics. When it came to interventions there were lots of RCTs with fairly large samples, which made their claims believable. I also got the sense that p-hacking is more difficult in education: there's usually only one math score which measures the impact of a math intervention, there's no early stopping, etc.

However, many of the top-scoring papers were trivial (eg "there are race differences in science scores"), and the field has a unique problem which is not addressed by replication: educational intervention effects are notorious for fading out after a few years. If the replications waited 5 years to follow up on the students, things would look much, much worse.

Demography is Good

Who even knew these people existed? Yet it seems they do (relatively) competent work. googles some of the authors Ah, they're economists. Well.

Criminology Should Just Be Scrapped

If you thought social psychology was bad, you ain't seen nothin' yet. Other fields have a mix of good and bad papers, but criminology is a shocking outlier. Almost every single paper I read was awful. Even among the papers that are highly likely to replicate, it's de rigueur to confuse correlation for causation.

If we compare criminology to, say, education, the headline replication rates look similar-ish. But the designs used in education (typically RCT, diff-in-diff, or regression discontinuity) are at least in principle capable of detecting the effects they're looking for. That's not really the case for criminology. Perhaps this is an effect of the (small number of) specific journals selected for RM, and there is more rigorous work published elsewhere.

There's no doubt in my mind that the net effect of criminology as a discipline is negative: to the extent that public policy is guided by these people, it is worse. Just shameful.

Marketing/Management

In their current state these are a bit of a joke, but I don't think there's anything fundamentally wrong with them. Sure, some of the variables they use are a bit fluffy, and of course there's a lack of theory. But the things they study are a good fit for RCTs, and if they just quintupled their sample sizes they would see massive improvements.

Cognitive Psychology

Much worse than expected; generally has a reputation as being one of the more solid subdisciplines of psychology, and has done well in previous replication projects. Not sure what went wrong here. It's only 50 papers and they're all from the same journal, so perhaps it's simply an unrepresentative sample.

Social Psychology

More or less as expected. All the silly stuff you've heard about is still going on.

Limited Political Hackery

Some of the most highly publicized social science controversies of the last decade happened at the intersection between political activism and low scientific standards: the implicit association test,17 stereotype threat, racial resentment, etc. I thought these were representative of a wider phenomenon, but in reality they are exceptions. The vast majority of work is done in good faith.

While blatant activism is rare, there is a more subtle background ideological influence which affects the assumptions scientists make, the types of questions they ask, and how they go about testing them. It's difficult to say how things would be different under the counterfactual of a more politically balanced professoriate, though.

Interaction Effects Bad

A paper whose main finding is an interaction effect is about 10 percentage points less likely to replicate. Their usage is not inherently wrong, sometimes it's theoretically justified. But all too often you'll see blatant fishing expeditions with a dozen double and triple ad hoc interactions thrown into the regression. They make it easy to do naughty things and tend to be underpowered.

Nothing New Under the Sun

All is mere breath, and herding the wind.

The replication crisis did not begin in 2010, it began in the 1950s. All the things I've written above have been written before, by respected and influential scientists. They made no difference whatsoever. Let's take a stroll through the museum of metascience.

Sterling (1959) analyzed psychology articles published in 1955-56 and noted that 97% of them rejected their null hypothesis. He found evidence of a huge publication bias, and a serious problem with false positives which was compounded by the fact that results are "seldom verified by independent replication".

Nunnally (1960) noted various problems with null hypothesis testing, underpowered studies, over-reliance on student samples (it doesn't take Joe Henrich to notice that using Western undergrads for every experiment might be a bad idea), and much more. The problem (or excuse) of publish-or-perish, which some portray as a recent development, was already in place by this time.18

The "reprint race" in our universities induces us to publish hastily-done, small studies and to be content with inexact estimates of relationships.

Jacob Cohen (of Cohen's d fame) in a 1962 study analyzed the statistical power of 70 psychology papers: he found that underpowered studies were a huge problem, especially for those investigating small effects. Successive studies by Sedlemeier & Gigerenzer in 1989 and Szucs & Ioannidis in 2017 found no improvement in power.

If we then accept the diagnosis of general weakness of the studies, what treatment can be prescribed? Formally, at least, the answer is simple: increase sample sizes.

Paul Meehl (1967) is highly insightful on problems with null hypothesis testing in the social sciences, the "crud factor", lack of theory, etc. Meehl (1970) brilliantly skewers the erroneous (and still common) tactic of automatically controling for "confounders" in observational designs without understanding the causal relations between the variables. Meehl (1990) is downright brutal: he highlights a series issues which, he argues, make psychological theories "uninterpretable". He covers low standards, pressure to publish, low power, low prior probabilities, and so on.

I am prepared to argue that a tremendous amount of taxpayer money goes down the drain in research that pseudotests theories in soft psychology and that it would be a material social advance as well as a reduction in what Lakatos has called “intellectual pollution” if we would quit engaging in this feckless enterprise.

Rosenthal (1979) covers publication bias and the problems it poses for meta-analyses: "only a few studies filed away could change the combined significant result to a nonsignificant one". Cole, Cole & Simon (1981) present experimental evidence on the evaluation of NSF grant proposals: they find that luck plays a huge factor as there is little agreement between reviewers.

I could keep going to the present day with the work of Goodman, Gelman, Nosek, and many others. There are many within the academy who are actively working on these issues: the CASBS Group on Best Practices in Science, the Meta-Research Innovation Center at Stanford, the Peer Review Congress, the Center for Open Science. If you click those links you will find a ton of papers on metascientific issues. But there seems to be a gap between awareness of the problem and implementing policy to fix it. You've got tons of people doing all this research and trying to repair the broken scientific process, while at the same time journal editors won't even retract blatantly fraudulent research.

There is even a history of government involvement. In the 70s there were battles in Congress over questionable NSF grants, and in the 80s Congress (led by Al Gore) was concerned about scientific integrity, which eventually led to the establishment of the Office of Scientific Integrity. (It then took the federal government another 11 years to come up with a decent definition of scientific misconduct.) After a couple of embarrassing high-profile prosecutorial failures they more or less gave up, but they still exist today and prosecute about a dozen people per year.

Generations of psychologists have come and gone and nothing has been done. The only difference is that today we have a better sense of the scale of the problem. The one ray of hope is that at least we have started doing a few replications, but I don't see that fundamentally changing things: replications reveal false positives, but they do nothing to prevent those false positives from being published in the first place.

What To Do

The reason nothing has been done since the 50s, despite everyone knowing about the problems, is simple: bad incentives. The best cases for government intervention are collective action problems: situations where the incentives for each actor cause suboptimal outcomes for the group as a whole, and it's difficult to coordinate bottom-up solutions. In this case the negative effects are not confined to academia, but overflow to society as a whole when these false results are used to inform business and policy.

Nobody actually benefits from the present state of affairs, but you can't ask isolated individuals to sacrifice their careers for the "greater good": the only viable solutions are top-down, which means either the granting agencies or Congress (or, as Scott Alexander has suggested, a Science Czar). You need a power that sits above the system and has its own incentives in order: this approach has already had success with requirements for pre-registration and publication of clinical trials. Right now I believe the most valuable activity in metascience is not replication or open science initiatives but political lobbying.19

  • Earmark 60% of funding for registered reports (ie accepted for publication based on the preregistered design only, not results). For some types of work this isn't feasible, but for ¾ of the papers I skimmed it's possible. In one fell swoop, p-hacking and publication bias would be virtually eliminated.20
  • Earmark 10% of funding for replications. When the majority of publications are registered reports, replications will be far less valuable than they are today. However, intelligently targeted replications still need to happen.
  • Earmark 1% of funding for progress studies. Including metascientific research that can be used to develop a serious science policy in the future.
  • Increase sample sizes and lower the significance threshold to .005. This one needs to be targeted: studies of small effects probably need to quadruple their sample sizes in order to get their power to reasonable levels. The median study would only need 2x or so. Lowering alpha is generally preferable to increasing power. "But Alvaro, doesn't that mean that fewer grants would be funded?" Yes.
  • Ignore citation counts. Given that citations are unrelated to (easily-predictable) replicability, let alone any subtler quality aspects, their use as an evaluative tool should stop immediately.
  • Open data, enforced by the NSF/NIH. There are problems with privacy but I would be tempted to go as far as possible with this. Open data helps detect fraud. And let's have everyone share their code, too—anything that makes replication/reproduction easier is a step in the right direction.
  • Financial incentives for universities and journals to police fraud. It's not easy to structure this well because on the one hand you want to incentivize them to minimize the frauds published, but on the other hand you want to maximize the frauds being caught. Beware Goodhart's law!
  • Why not do away with the journal system altogether? The NSF could run its own centralized, open website; grants would require publication there. Journals are objectively not doing their job as gatekeepers of quality or truth, so what even is a journal? A combination of taxonomy and reputation. The former is better solved by a simple tag system, and the latter is actually misleading. Peer review is unpaid work anyway, it could continue as is. Attach a replication prediction market (with the estimated probability displayed in gargantuan neon-red font right next to the paper title) and you're golden. Without the crutch of "high ranked journals" maybe we could move to better ways of evaluating scientific output. No more editors refusing to publish replications. You can't shift the incentives: academics want to publish in "high-impact" journals, and journals want to selectively publish "high-impact" research. So just make it impossible. Plus as a bonus side-effect this would finally sink Elsevier.
  • Have authors bet on replication of their research. Give them fixed odds, say 1:4—if it's good work, it's +EV for them. This sounds a bit distasteful, so we could structure the same cashflows as a "bonus grant" from the NSF when a paper you wrote replicates successfully.22

And a couple of points that individuals can implement today:

  • Just stop citing bad research, I shouldn't need to tell you this, jesus christ what the fuck is wrong with you people.
  • Read the papers you cite. Or at least make your grad students to do it for you. It doesn't need to be exhaustive: the abstract, a quick look at the descriptive stats, a good look at the table with the main regression results, and then a skim of the conclusions. Maybe a glance at the methodology if they're doing something unusual. It won't take more than a couple of minutes. And you owe it not only to SCIENCE!, but also to yourself: the ability to discriminate between what is real and what is not is rather useful if you want to produce good research.23
  • When doing peer review, reject claims that are likely to be false. The base replication rate for studies with p>.001 is below 50%. When reviewing a paper whose central claim has a p-value above that, you should recommend against publication unless the paper is exceptional (good methodology, high prior likelihood, etc.)24 If we're going to have publication bias, at least let that be a bias for true positives. Remember to subtract another 10 percentage points for interaction effects. You don't need to be complicit in the publication of false claims.
  • Stop assuming good faith. I'm not saying every academic interaction should be hostile and adversarial, but the good guys are behaving like dodos right now and the predators are running wild.

...My Only Friend, The End

The first draft of this post had a section titled "Some of My Favorites", where I listed the silliest studies in the sample. But I removed it because I don't want to give the impression that the problem lies with a few comically bad papers in the far left tail of the distribution. The real problem is the median.

It is difficult to convey just how low the standards are. The marginal researcher is a hack and the marginal paper should not exist. There's a general lack of seriousness hanging over everything—if an undergrad cites a retracted paper in an essay, whatever; but if this is your life's work, surely you ought to treat the matter with some care and respect.

Why is the Replication Markets project funded by the Department of Defense? If you look at the NSF's 2019 Performance Highlights, you'll find items such as "Foster a culture of inclusion through change management efforts" (Status: "Achieved") and "Inform applicants whether their proposals have been declined or recommended for funding in a timely manner" (Status: "Not Achieved"). Pusillanimous reports repeat tired clichés about "training", "transparency", and a "culture of openness" while downplaying the scale of the problem and ignoring the incentives. No serious actions have followed from their recommendations.

It's not that they're trying and failing—they appear to be completely oblivious. We're talking about an organization with an 8 billion dollar budget that is responsible for a huge part of social science funding, and they can't manage to inform people that their grant was declined! These are the people we must depend on to fix everything.

When it comes to giant bureaucracies it can be difficult to know where (if anywhere) the actual power lies. But a good start would be at the top: NSF director Sethuraman Panchanathan, SES division director Daniel L. Goroff, NIH director Francis S. Collins, and the members of the National Science Board. The broken incentives of the academy did not appear out of nowhere, they are the result of grant agency policies. Scientists and the organizations that represent them (like the AEA and APA) should be putting pressure on them to fix this ridiculous situation.

The importance of metascience is inversely proportional to how well normal science is working, and right now it could use some improvement. The federal government spends about $100b per year on research, but we lack a systematic understanding of scientific progress, we lack insight into the forces that underlie the upward trajectory of our civilization. Let's take 1% of that money and invest it wisely so that the other 99% will not be pointlessly wasted. Let's invest it in a robust understanding of science, let's invest it in progress studies, let's invest it in—the future.


Thanks to Alexey Guzey and Dormin for their feedback. And thanks to the people at SCORE and the Replication Markets team for letting me use their data and for running this unparalleled program.


  1. 1.Dreber et al. (2015), Using prediction markets to estimate the reproducibility of scientific research.
    Camerer et al. (2018), Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015.
  2. 2.The distribution is bimodal because of the way p-values are typically reported: there's a huge difference between p<.01 and p<.001. If actual p-values were reported instead of cutoffs, the distribution would be unimodal.
  3. 3.Even laypeople are half-decent at it.
  4. 4.Ludwik Fleck has an amusing bit on the development of anatomy: "Simple lack of 'direct contact with nature' during experimental dissection cannot explain the frequency of the phrase "which becomes visible during autopsy" often accompanying what to us seem the most absurd assertions."
  5. 5.Another possible explanation is that importance is inversely related to replication probability. In my experience that is not the case, however. If anything it's the opposite: important effects tend to be large effects, and large effects tend to replicate. In general, any "conditioning on a collider"-type explanation doesn't work here because these citations also continue post-retraction.
  6. 6.Some more:
  7. 7.Though I must admit that after reading the papers myself I understand why they would shy away from the task.
  8. 8.I can tell you what is rewarded with citations though: papers in which the authors find support for their hypothesis.
  9. 9.Perhaps I don't understand the situation at places like the NSF or the ESRC but the problem seems to be incompetence (or a broken bureaucracy?) rather than misaligned incentives.
  10. 10.Theoretically there's the possibility of overpowered studies being a problem. Meehl (1967) argues that 1) everything in psychology is correlated (the "crud factor"), and 2) theories only make directional predictions (as opposed to point predictions in eg physics). So as power increases the probability of finding a significant result for a directional prediction approaches 50% regardless of what you're studying.
  11. 11.In medicine there are plenty of cohort-based publication bias analyses, but I don't think we can generalize from those to the social sciences.
  12. 12.But RRs are probably not representative of the literature, so this is an overestimate. And who knows how many unpublished pilot studies are behind every RR?
  13. 13.Dreber et al. (2015) use prediction market probabilities and work backward to get a prior of 9%, but this number is based on unreasonable assumptions about false positives: they don't take into account fraud and QRPs. If priors were really that low, the entire replication crisis would be explained purely by normal sampling error: no QRPs!
  14. 14.Part of the issue is that the literature is polluted with a ton of false results, which actually pushes estimates of true effect sizes downwards. There's an unfortunate tendency to lump together effect sizes of real and non-existent effects (eg Many Labs 2: "ds were 0.60 for the original findings and 0.15 for the replications"), but that's a meaningless number.
  15. 15.False negatives are bad too, but they're not as harmful as false positives. Especially since they're almost never published. Also, there's been a ton of stuff written on lowering alpha, a good starting point is Redefine Statistical Significance.
  16. 16.These figures actually understate the benefit of a lower alpha, because it would also change the calculus around p-hacking. With an alpha of 5%, getting a false positive is quite easy. Simply stopping data collection once you have a significant result has a hit rate of over 20%! Add some dredging and HARKing to that and you can squeeze a result out of anything. With a lower alpha, the chances of p-hacking success will be vastly lower and some researchers won't even bother trying.
  17. 17.The original IAT paper is worth revisiting. You only really need to read page 1475. The construct validity evidence is laughable. The whole thing is based on N=26 and they find no significant correlation between the IAT and explicit measures of racism. But that's OK, Greenwald says, because the IAT is meant to find secret racists ("reveal explicitly disavowed prejudice")! The question of why a null correlation between implicit and explicit racial attitudes is to be expected is left as an exercise to the reader. The correlation between two racial IATs (male and female names) is .46 and they conveniently forget to mention the comically low test-retest reliability. That's all you need for 13k citations and a consulting industry selling implicit bias to the government for millions of dollars.
  18. 18.I suspect psychologists today would laugh at the idea of the 1960s being an over-competitive environment. Personally I highly doubt that this situation can be blamed on high (or increasing) productivity.
  19. 19.You might ask: well, why haven't the independent grant agencies already fixed the problem then? I'm not sure if it's a lack of competence, or caring, or power, or something else. But I find Garrett Jones' arguments on the efficacy of independent government agencies convincing: this model works well in other areas.
  20. 20."But Alvaro, what if I make an unexpected discovery during my investigation?" Well, you start writing a new registered report, and perhaps publish it as an exploratory result. You may not like it, but that's how we protect against false positives. In cases where only one dataset is available (eg historical data) we must rely on even stricter standards of evidence, to protect against multiple testing.
  21. 21.Another idea to steal from the SEC: whistleblower rewards.
  22. 22.This would be immediately exploited by publishing a bunch of trivial results. But that's a solvable problem. In any case, it's much better to have systematic, automatic mechanisms instead of relying on subjective factors and prosecuting of individual cases.
  23. 23.I believe the SCORE program intends to use the data from Replication Markets to train a ML model that predicts replicability. If scientists had the ability to just run that on every reference in their papers, perhaps they could go back to not reading what they cite.
  24. 24.Looking at Replication Markets data, about 1 in 4 studies with p>.001 had more than a 50% chance to replicate. Of course I'd consider 50-50 odds far too low a threshold, but you have to start somewhere. "But Alvaro, science is not done paper by paper, it is a cumulative enterprise. We should publish marginal results, even if they're probably not true. They are pieces of evidence that, brick by brick, raise the vast edifice that we call scientific knowledge". In principle this is a good argument: publish everything and let the meta-analyses sort it out. But given the reality of publication bias we must be selective. If registered reports became the standard, this problem would not exist.



How Many Undetected Frauds in Science?

0.04% of papers are retracted. At least 1.9% of papers have duplicate images "suggestive of deliberate manipulation". About 2.5% of scientists admit to fraud, and they estimate that 10% of other scientists have committed fraud. 27% of postdocs said they were willing to select or omit data to improve their results. More than 50% of published findings in psychology are false. The ORI, which makes about 13 misconduct findings per year, gives a conservative estimate of over 2000 misconduct incidents per year.

That's a wide range of figures, and all of them suffer from problems if we try to use them as estimates of the real rate of fraud. While the vast majority of false published claims are not due to fabrication, it's clear that there is a huge iceberg of undiscovered fraud hiding underneath the surface.

Part of the issue is that the limits of fraud are unclear. While fabrication/falsification are easy to adjudicate, there's a wide range of quasi-fraudulent but quasi-acceptable "Questionable Research Practices" (QRPs) such as HARKing which result in false claims being presented as true. Publishing a claim that has a ~0%1 chance of being true is the worst thing in the world, but publishing a claim that has a 15% chance of being true is a totally normal thing that perfectly upstanding scientists do. Thus the literature is inundated by false results that are nonetheless not "fraudulent". Personally I don't think there's much of a difference.

There are two main issues with QRPs: first, there's no clear line in the sand, which makes it difficult to single out individuals for punishment. Second, the majority of scientists engage in QRPs. In fact they have been steeped in an environment full of bad practices for so long that they are no longer capable of understanding that they are behaving badly:

Let him who is without QRPs cast the first stone.

The case of Brian Wansink (who committed both clear fraud and QRPs) is revealing: in the infamous post that set off his fall from grace, he brazenly admitted to extreme p-hacking. The notion that any of this was wrong had clearly never crossed his mind: he genuinely believed he was giving useful advice to grad students. When commenters pushed back, he justified himself by writing that "P-hacking shouldn’t be confused with deep data dives".

Anyway, here are some questions that might help us determine the size of the iceberg:

  • Are uncovered frauds high-quality, or do we only have the ability to find low-hanging fruit?
  • Are frauds caught quickly, or do they have long careers before anyone finds out?
  • Are scientists capable of detecting fraud or false results in general (regardless of whether they are produced by fraud, QRPs, or just bad luck)?
  • How much can we rely on whistleblowers?

Quality

Here's an interesting case recently uncovered by Elisabeth Bik: 8 different published, peer-reviewed papers, by different authors, on different subjects, with literally identical graphs. The laziness is astonishing! It would take just a few minutes to write an R script that generates random data so that each fake paper could at least have unique charts. But the paper mill that wrote these articles won't even do that. This kind of extreme sloppiness is a recurring theme when it comes to frauds that have been caught.

In general the image duplication that Bik uncovers tends to be rather lazy: people just copy paste to their heart's content and hope nobody will notice (and peer reviewers and editors almost certainly won't notice).

The Bell Labs physicist Jan Hendrik Schön was found out because he used identical graphs for multiple, completely different experiments.

This guy not only copy-pasted a ton of observations, he forgot to delete the excel sheet he used to fake the data! Managed to get three publications out of it.

Back to Wansink again: he was smart enough not to copy-paste charts, but he made other stupid mistakes. For example in one paper (The office candy dish) he reported impossible means and test statistics (detected through granularity testing). If he had just bothered to create a plausible sample instead of directly fiddling with summary statistics, there's a good chance he would not have been detected. (By the way, the paper has not been retracted, and continues to be cited. I Fucking Love Science!)

In general Wansink comes across as a moron, yet he managed to amass hundreds of publications, 30k+ citations, and half a dozen books. What percentile of fraud competence do you think Wansink represents?

The point is this: generating plausible random numbers is not that difficult! Especially considering the fact that these are intelligent people with extensive training in science and statistics. It seems highly likely that there are more sophisticated frauds out there.

Speed

Do frauds manage to have long careers before they get caught? I don't think there's any hard data on this (though someone could probably compile it with the Retraction Watch database). Obviously the highest-profile frauds are going to be those with a long history, so we have to be careful not to be misled. Perhaps there's a vast number of fraudsters who are caught immediately.

Overall the evidence is mixed. On the one hand, a relatively small number of researchers account for a fairly large proportion of all retractions. So while these individuals managed to evade detection for a long time (Yoshitaka Fujii published close to 200 papers over a 25 year career), most frauds do not have such vast track records.

On the other hand just because we haven't detected fraudulent papers doesn't necessarily mean they don't exist. And repeat fraud seems fairly common: simple image duplication checks reveal that "in nearly 40% of the instances in which a problematic paper was identified, screening of other papers from the same authors revealed additional problematic papers in the literature."

Even when fraud is clearly present, it can take ages for the relevant authorities to take action. The infamous Andrew Wakefield vaccine autism paper, for example, took 12 years to retract.

Detection Ability

I've been reading a lot of social science papers lately and a thought keeps coming up: "this paper seems unlikely to replicate, but how can I tell if it's due to fraud or just bad methods?" And the answer is that in general we can't tell. In fact things are even worse, as scientists seem to be incapable of detecting even really obviously weak papers (more on this in the next post).

In cases such as Wansink's, people went over his work with a fine comb after the infamous blogpost and discovered all sorts of irregularities. But nobody caught those signs earlier. Part of the issue is that nobody's really looking for fraud when they casually read a paper. Science tends to work on a kind of honor system where everyone just assumes the best. Even if you are looking for fraud, it's time-consuming, difficult, and in many cases unclear. The evidence tends to be indirect: noticing that two subgroups are a bit too similar, or that the effects of an intervention are a bit too consistent. But these can be explained away fairly easily. So unless you have a whistleblower it's often difficult to make an accusation.

The case of the 5-HTTLPR gene is instructive: as Scott Alexander explains in his fantastic literature review, a huge academic industry was built up around what should have been a null result. There are literally hundreds of non-replicating papers on 5-HTTLPR—suppose there was one fraudulent article in this haystack, how would you go about finding it?

Some frauds (or are they simply errors?) are detected using statistical methods such as the granularity testing mentioned above, or with statcheck. But any sophisticated fraud would simply check their own numbers using statcheck before submitting, and correct any irregularities.

Detecting weak research is easy. Detecting fraud and then prosecuting it is extremely difficult.

Whistleblowers

Some cases are brought to light by whistleblowers, but we can't rely on them for a variety of reasons. A survey of scientists finds that potential whistleblowers, especially those without job security, tend not to report fraud due to the potential career consequences. They understand that institutions will go to great lengths to protect frauds—do you want a career, or do you want to do the right thing?

Often there simply is no whistleblower available. Scientists are trusted to collect data on their own, and they often collaborate with people in other countries or continents who never have any contact with the data-gathering process. Under such circumstances we must rely on indirect means of detection.

South Korean celebrity scientist Hwang Woo-suk was uncovered as a fraud by a television program which used two whistleblower sources. But things only got rolling when image duplication was detected in one of his papers. Both whistleblowers lost their jobs and were unable to find other employment.

In some cases people blow the whistle and nothing happens. The report from the investigation into Diederik Stapel, for example, notes that "on three occasions in 2010 and 2011, the attention of members of the academic staff in psychology was drawn to this matter. The first two signals were not followed up in the first or second instance." By the way, these people simply noticed statistical irregularities, they never had direct evidence.

And let's turn back to Wansink once again: in the blog post that sank him, he recounted tales of instructing students to p-hack data until they found a result. Did those grad students ever blow the whistle on him? Of course not.

This is the End...

Let's say that about half of all published research findings are false. How many of those are due to fraud? As a very rough guess I'd say that for every 100 papers that don't replicate, 2.5 are due to fabrication/falsification, and 85 are due to lighter forms of methodological fraud. This would imply that about 1% of fraudulent papers are retracted.

This is both good and bad news. On the one hand, while most fraud goes unpunished, it only represents a small portion of published research. On the other hand, it means that we can't fix reproducibility problems by going after fabrication/falsification: if outright fraud completely disappeared tomorrow, it would be no more than an imperceptible blip in the replication crisis. A real solution needs to address the "questionable" methods used by the median scientist, not the fabrication used by the very worst of them.




Book Review: Science Fictions by Stuart Ritchie

In 1945, Robert Merton wrote:

There is only this to be said: the sociology of knowledge is fast outgrowing a prior tendency to confuse provisional hypothesis with unimpeachable dogma; the plenitude of speculative insights which marked its early stages are now being subjected to increasingly rigorous test.

Then, 16 years later:

After enjoying more than two generations of scholarly interest, the sociology of knowledge remains largely a subject for meditation rather than a field of sustained and methodical investigation. [...] these authors tell us that they have been forced to resort to loose generalities rather than being in a position to report firmly grounded generalizations.

In 2020, the sociology of science is stuck more or less in the same place. I am being unfair to Ritchie (who is a Merton fanboy), because he has not set out to write a systematic account of scientific production—he has set out to present a series of captivating anecdotes, and in those terms he has succeeded admirably. And yet, in the age of progress studies surely one is allowed to hope for more.

If you've never heard of Daryl Bem, Brian Wansink, Andrew Wakefield, John Ioannidis, or Elisabeth Bik, then this book is an excellent introduction to the scientific misconduct that is plaguing our universities. The stories will blow your mind. For example you'll learn about Paolo Macchiarini, who left a trail of dead patients, published fake research saying he healed them, and was then protected by his university and the journal Nature for years. However, if you have been following the replication crisis, you will find nothing new here. The incidents are well-known, and the analysis Ritchie adds on top of them is limited in ambition.

The book begins with a quick summary of how science funding and research work, and a short chapter on the replication crisis. After that we get to the juicy bits as Ritchie describes exactly how all this bad research is produced. He starts with outright fraud, and then moves onto the gray areas of bias, negligence, and hype: it's an engaging and often funny catalogue of misdeeds and misaligned incentives. The final two chapters address the causes behind these problems, and how to fix them.

The biggest weakness is that the vast majority of the incidents presented (with the notable exception of the Stanford prison experiment) occurred in the last 20 years or so. And Ritchie's analysis of the causes behind these failures also depends on recent developments: his main argument is that intense competition and pressure to publish large quantities of papers is harming their quality.

Not only has there been a huge increase in the rate of publication, there’s evidence that the selection for productivity among scientists is getting stronger. A French study found that young evolutionary biologists hired in 2013 had nearly twice as many publications as those hired in 2005, implying that the hiring criteria had crept upwards year-on-year. [...] as the number of PhDs awarded has increased (another consequence, we should note, of universities looking to their bottom line, since PhD and other students also bring in vast amounts of money), the increase in university jobs for those newly minted PhD scientists to fill hasn’t kept pace.

By only focusing on recent examples, Ritchie gives the impression that the problem is new. But that's not really the case. One can go back to the 60s and 70s and find people railing against low standards, underpowered studies, lack of theory, publication bias, and so on. Imre Lakatos, in an amusing series of lectures at the London School of Economics in 1973, said that "the social sciences are on a par with astrology, it is no use beating about the bush."

Let's play a little game. Go to the Journal of Personality and Social Psychology (one of the top social psych journals) and look up a few random papers from the 60s. Are you going to find rigorous, replicable science from a mythical era when valiant scientists followed Mertonian norms and were not incentivized to spew out dozens of mediocre papers every year? No, you're going to find exactly the same p<.05, tiny N, interaction effect, atheoretical bullshit. The only difference being the questionable virtue of low productivity.

If the problem isn't new, then we can't look for the causes in recent developments. If Ritchie had moved beyond "loose generalities" to a more systematic analysis of scientific production I think he would have presented a very different picture. The proposals at the end mostly consist of solutions that are supposed to originate from within the academy. But they've had more than half a century to do that—it feels a bit naive to think that this time it's different.

Finally, is there light at the end of the tunnel?

...after the Bem and Stapel affairs (among many others), psychologists have begun to engage in some intense soul-searching. More than perhaps any other field, we’ve begun to recognise our deep-seated flaws and to develop systematic ways to address them – ways that are beginning to be adopted across many different disciplines of science.

Again, the book is missing hard data and analysis. I used to share his view (surely after all the publicity of the replication crisis, all the open science initiatives, all the "intense soul searching", surely things must change!) but I have now seen some data which makes me lean in the opposite direction. Check back toward the end of August for a post on this issue.

Ritchie's view of science is almost romantic: he goes on about the "nobility" of research and the virtues of Mertonian norms. But the question of how conditions, incentives, competition, and even the Mertonian norms themselves actually affect scientific production is an empirical matter that can and should be investigated systematically. It is time to move beyond "speculative insights" and onto "rigorous testing", exactly in the way that Merton failed to do.




Links Q2 2020

Tyler Cowen reviews Status and Beauty in the Global Party Circuit. "In this world, girls function as a form of capital." The podcast is good too.

Lots of good info on education: Why Conventional Wisdom on Education Reform is Wrong (a primer)

Scott Alexander on the life of Herbert Hoover.

Longer-Run Economic Consequences of Pandemics [speculative]:

Measured by deviations in a benchmark economic statistic, the real natural rate of interest, these responses indicate that pandemics are followed by sustained periods—over multiple decades—with depressed investment opportunities, possibly due to excess capital per unit of surviving labor, and/or heightened desires to save, possibly due to an increase in precautionary saving or a rebuilding of depleted wealth.

Do cognitive biases go away when the stakes are high? A large pre-registered study with very high stakes finds that effort increases significantly but performance does not.

Disco Elysium painting turned into video using AI.

Long-run consequences of the pirate attacks on the coasts of Italy: "in 1951 Rome would have been 15% more populous without piracy."

“A” Business by Any Other Name: Firm Name Choice as a Signal of Firm Quality (2014): "The average plumbing firm whose name begins with A or a number receives five times more service complaints than other firms and also charges higher prices."

Yarkoni: The Generalizability Crisis [in psychology].

Lakens: Review of "The Generalizability Crisis" by Tal Yarkoni.

Yarkoni: Induction is not optional (if you’re using inferential statistics): reply to Lakens.

Estimating the deep replicability of scientific findings using human and artificial intelligence - ML model does about as well as prediction markets when it comes to predicting replication success. "the model’s accuracy is higher when trained on a paper’s text rather than its reported statistics and that n-grams, higher order word combinations that humans have difficulty processing, correlate with replication." Also check out the horrific Fig 1.

Wearing a weight vest leads to weight loss, fairly huge (suspiciously huge?) effect size. The hypothesized mechanism is the "gravitostat": your body senses how heavy you are and adjusts accordingly.

Tyler Cowen on uni- vs multi-disciplinary policy advice in the time of Corona

...and here's Señor Coconut, "A Latin Tribute to Kraftwerk". Who knew "Autobahn" needed a marimba?