Metascience and Philosophy

It has been said that philosophy of science is as useful to scientists as ornithology is to birds. But perhaps it can be useful to metascientists?

State of Play

Philosophy

In the 20th century, philosophy of science attracted first-rate minds: scientists like Henri Poincaré, Pierre Duhem, and Michael Polanyi, as well as philosophers like Popper, Quine, Carnap, Kuhn, and Lakatos. Today the field is a backwater, lost in endless debates about scientific realism which evoke the malaise of medieval angelology.1 Despite being part of philosophy, however, the field made actual progress, abandoning simplistic early models for more sophisticated approaches with greater explanatory power. Ultimately, philosophers reached one of two endpoints: some went full relativist,2 while others (like Quine and Laudan) bit the bullet of naturalism and left the matter to metascientists and psychologists.3 "It is an empirical question, which means promote which ends".

Metascience

Did the metascientists actually pick up the torch? Sort of. There is some overlap, but (with the exception of the great Paul Meehl) they tend to focus on different problems. The current crop of metascientists is drawn, like sharks to blood, to easily quantifiable questions about the recent past (with all those p-values sitting around how could you resist analyzing them?). They focus on different fields, and therefore different problems. They seem hesitant to make normative claims. Less tractable questions about forms of progress, norms, theory selection, etc. have fallen by the wayside. Overall I think they underrate the problems posed by philosophers.

Rational Reconstruction

In The History of Science and Its Rational Reconstructions Lakatos proposed that theories of scientific methodology function as historiographical theories and can be criticized or compared to each other by using the theories to create "rational historical reconstructions" of scientific progress. The idea is simple: if a theory fails to rationally explain the past successes of science, it's probably not a good theory, and we should not adopt its normative tenets. As Lakatos puts it, "if the rationality of science is inductive, actual science is not rational; if it is rational, it is not inductive." He applied this "Pyrrhonian machine de guerre" not only to inductivism and confirmationism, but also to Popper.

The main issue with falsification boils down to the problem of auxiliary hypotheses. On the one hand you have underdetermination (the Duhem-Quine thesis): testing hypotheses in isolation is not possible, so when a falsifying result comes out it's not clear where the modus tollens should be directed. On the other hand there is the possibility of introducing new auxiliary hypotheses to "protect" an existing theory from falsification. These are not merely abstract games for philosophers, but very real problems that scientists have to deal with. Let's take a look at a couple of historical examples from the perspective of naïve falsificationism.

First, Newton's laws. They were already falsified at the time of publication: they failed to correctly predict the motion of the moon. In the words of Newton, "the apse of the Moon is about twice as swift" as his predictions. Despite this falsification, the Principia attracted followers who worked to improve the theory. The moon was no small problem and took two decades to solve with the introduction of new auxiliary hypotheses.

A later episode involving Newton's laws illustrates how treacherous these auxiliary hypotheses can be. In 1846 Le Verrier (I have written about him before) solved an anomaly in the orbit of Uranus by hypothesizing the existence of a new planet. That planet was Neptune and its discovery was a wonderful confirmation of Newton's laws. A decade later Le Verrier tried to solve an anomaly in the orbit of Mercury using the same method. The hypothesized new planet was never found and Newton's laws remained at odds with the data for decades (yet nobody abandoned them). The solution was only found in 1915 with Einstein's general relativity: Newton should have been abandoned this time!

Second, Prout's hypothesis: in 1815 William Prout proposed that the atomic weights of all elements were multiples of the atomic weight of hydrogen. A decade later, chemists measured the atomic weight of chlorine at 35.45x that of hydrogen and Prout's hypothesis was clearly falsified. Except, a century after that, isotopes were discovered: variants of chemical elements with different neutron numbers. Turns out that natural chlorine is composed of 76% 35Cl and 24% 37Cl, hence the atomic weight of 35.45. Whoops! So here we have a case where falsification depends on an auxiliary hypothesis (no isotopes) which the experimenters have no way of knowing.4

Popper tried to rescue falsificationism through a series of unsatisfying ad-hoc fixes: exhorting scientists not to be naughty when introducing auxiliary hypotheses, and saying falsification only applies to "serious anomalies". When asked what a serious anomaly is, he replied: "if an object were to move around the Sun in a square"!5

Problem, officer?

There are a few problems with rational reconstruction, and while I don't think any of them are fatal, they do mean we have to tread carefully.

External factors: no internal history of science can explain the popularity of Lysenkoism in the USSR—sometimes we have to appeal to external factors. But the line between internal and external history is unclear, and can even depend on your methodology of choice.

Meta-criterion choice: what criteria do you use to evaluate the quality of a rational reconstruction? Lakatos suggested using the criteria of each theory (eg use falsificationism to judge falsificationism) but he never makes a good case for that vs a standardized set of meta-criteria.

Case studies: philosophers tend to argue using case studies and it's easy to find one to support virtually any position, even if its normative suggestions are suboptimal. Lots of confirmation bias here. The illustrious Paul Meehl correctly argues for the use of "actuarial methods" instead. "Absent representative sampling, one lacks the database needed to best answer or resolve these types of inherently statistical questions." The metascientists obviously have a great methodological advantage here.

Fake history: the history of science as we read it today is sanitized if not fabricated.6 Successes are remembered and failures thrown aside; chaotic processes of discovery are cleaned up for presentation. As Peter Medawar noted in Is the scientific paper a fraud?, the "official record" of scientific progress contains few traces of the messy process that actually generated said progress.7 He further argues that there is a desire to conform to a particular ideal of induction which creates a biased picture of how scientific discovery works.

Falsification in Metascience

Now, let's shift our gaze to metascience. There's a fascinating subgenre of psychology in which researchers create elaborate scientific simulations and observe subjects as they try to make "scientific discoveries". The results can help us understand how scientific reasoning actually happens, how people search for hypotheses, design experiments, create new concepts, and so on. My favorite of these is Dunbar (1993), which involved a bunch of undergraduate students trying to recreate a Nobel-winning discovery in biochemistry.8

Reading these papers one gets the sense that there is a falsificationist background radiation permeating everything. When the subjects don't behave like falsificationists, it's simply treated as an error or a bias. Klahr & Dunbar scold their subjects: "our subjects frequently maintained their current hypotheses in the face of negative information". And within the tight confines of these experiments it's usually true that it is an error. But this reflects the design of the experiment rather than any inherent property of scientific reasoning or progress, and extrapolating these results to real-world science in general would be a mistake.

Sociology offers a cautionary tale about what happens when you take this kind of reasoning to an extreme: the strong programme people started with an idealistic (and wrong) philosophy of science, they then observed that real-world science does not actually operate like that, and concluded that it's all based on social forces and power relations, descending into an abyss of epistemological relativism. To reasonable people like you and me this looks like an excellent reductio ad absurdum, but sociologists are a special breed and one man’s modus ponens is another man’s modus tollens. The same applies to over-extensions of falsificationism. Lakatos:

...those trendy 'sociologists of knowledge' who try to explain the further (possibly unsuccessful) development of a theory 'falsified' by a 'crucial experiment' as the manifestation of the irrational, wicked, reactionary resistance by established authority to enlightened revolutionary innovation.

One could also argue that the current focus on replication is too narrow. The issue is obscured by the fact that in the current state of things the original studies tend to be very weak, the "theories" do not have track records of success, and the replications tend to be very strong, so the decision is fairly easy. But one can imagine a future scenario in which failed replications should be treated with far more skepticism.

There are also some empirical questions in this area that are ripe for the picking: at which point do scientists shift their beliefs to the replication over the original? What factors do they use? What do they view a falsification as actually refuting (ie where do they direct the modus tollens)? Longitudinal surveys, especially in the current climate of the social sciences, would be incredibly interesting.

Unit of Progress

One of the things philosophers of science are in agreement about is that individual scientists cannot be expected to behave rationally. Recall the example of Prout and the atomic weight of chlorine above: Prout simply didn't accept the falsifying results, and having obtained a value of 35.83 by experiment, rounded it to 36. To work around this problem, philosophers instead treated wider social or conceptual structures as the relevant unit of progress: "thinking style groups" (Fleck), "paradigms" (Kuhn), "research programmes" (Lakatos), "research traditions" (Laudan), etc. When a theory is tested, the implications of the result depend on the broader structure that theory is embedded in. Lakatos:

We have to study not the mind of the individual scientist but the mind of the Scientific Community. [...] Kuhn certainly showed that psychology of science can reveal important-and indeed sad-truths. But psychology of science is not autonomous; for the-rationally reconstructed-growth of science takes place essentially in the world of ideas, in Plato's and Popper's 'third world'.

Psychologists are temperamentally attracted to the individual, and this is reflected in their metascientific research methods which tend to focus on individual scientists' thinking, or isolated papers. Meehl, for example, simply views this as an opportunity to optimize individuals' cognitive performance:

The thinking of scientists, especially during the controversy or theoretical crises preceding Kuhnian revolutions, is often not rigorous, deep, incisive, or even fair-minded; and it is not "objective" in the sense of interjudge reliability. Studies of resistance to scientific discovery, poor agreement in peer review, negligible impact of most published papers, retrospective interpretations of error and conflict all suggest suboptimal cognitive performance.

Given the importance of broader structures however, things that seem irrational from the individual perspective might make sense collectively. Institutional design is criminally under-explored, and the differences in attitudes both over time and over the cross section of scientists are underrated objects of study.

You might retort that this is a job for the sociologists, but look at what they have produced: on the one hand they gave us Robert Merton, and on the other hand the strong programme. They don't strike me as particularly reliable.

Fields & Theories

Almost all the scientists doing philosophy of science were physicists or chemists, and the philosophers stuck to those disciplines in their analyses. Today's metascientists on the other hand mostly come from psychology and medicine. Not coincidentally, they tend to focus on psychology and medicine. These fields tend to have different kinds of challenges compared to the harder sciences: the relative lack of theory, for example, means that today's metascientists tend to ignore some of the most central parts of philosophy of science, such as questions about Lakatos's "positive heuristic" and how to judge auxiliary hypotheses, questions about whether the logical or empirical content of theories is preserved during progress, questions about how principles of theory evaluation change over time, and so on.

That's not to say no work at all has been done in this area, for example Paul Meehl9 tried to construct a quantitative index of a theory's track record that could then be used to determine how to respond to a falsifying result. There's also some similar work from a Bayesian POV. But much more could be done in this direction, and much of it depends on going beyond medicine and the social sciences. "But Alvaro, I barely understand p-values, I could never do the math needed to understand physics!" If the philosophers could do it then so can the psychologists. But perhaps these problems require broader interdisciplinary involvement: not only specialists from other fields, but also involvement from neuroscience, computational science, etc.

What is progress?

One of the biggest questions the philosophers tried to answer was how progress is made, and how to even define it. Notions of progress as strictly cumulative (ie the new theory has to explain everything explained by the old one) inevitably lead to relativism, because theories are sometimes widely accepted at an "early" stage when they have limitations relative to established ones. But what is the actual process of consensus formation? What principles do scientists actually use? What principles should they use? Mertonian theories about agreement about standards/aims are clearly false, but we don't have anything better to replace them. This is another question that depends on looking beyond psychology, toward more theory-oriented fields.

Looking Ahead

Metascience can continue the work and actually solve important questions posed by philosophers:

  • Is there a difference between mature and immature fields? Should there be?
  • What guiding assumptions are used for theory choice? Do they change over time, and if yes how are they accepted/rejected? What is the best set of rules? Meehl's suggestions are a good starting point: "We can construct other indexes of qualitative diversity, formal simplicity, novel fact predictivity, deductive rigor, and so on. Multiple indexes of theoretical merit could then be plotted over time, intercorrelated, and related to the long-term fate of theories."
  • Can we tell, in real time, which fields are progressing and which are degenerating? If not, is this an opening for irrationalism? What factors should we use to decide whether to stick with a theory on shaky ground? What factors should we use to judge auxiliary hypotheses?10 Meehl started doing good work in this area, let's build on it.
  • Does null hypothesis testing undermine progress in social sciences by focusing on stats rather than the building of solid theories as Meehl thought?
  • Is it actually useful, as Mitroff suggests, to have a wide array of differently-biased scientists working on the same problems? (At least when there's lots of uncertainty?)
  • Gholson & Barker 1985 applied Lakatos and Laudan's theories to progress in physics and psychology (arguing that some areas of psychology do have a strong theoretical grounding), but this should be taken beyond case studies: comparative approaches with normative conclusions. Do strong theories really help with progress in the social sciences? Protzko et al 2020 offer some great data with direct normative applications, much more could be done in this direction.
  • And hell, while I'm writing this absurd Christmas list let me add a cherry on top: give me a good explanation of how abduction works!

Recommended reading:

  • Imre Lakatos, The Methodology of Scientific Research Programmes [PDF] [Amazon]

  1. 1.Scientific realism is the view that the entities described by successful scientific theories are real.
  2. 2.Never go full relativist.
  3. 3.Quine abandoned the entirety of epistemology, "as a chapter of psychology".
  4. 4.Prout's hypothesis ultimately turned out to be wrong for other reasons, but it was much closer to the truth than initially suggested by chlorine.
  5. 5.The end-point of this line is the naked appeal to authority for deciding what is a serious anomaly and what is not.
  6. 6.Fictions like the idea that Newton's laws were derived from and compatible with Kepler's laws abound. Even in a popular contemporary textbook for undergrads you can find statements like "Newton demonstrated that [Kepler's] laws are a consequence of the gravitational force that exists between any two masses." But of course the planets do not follow perfect elliptical orbits in Newtonian physics, and empirical deviations from Kepler were already known in Newton's time.
  7. 7.Fleck is also good on this point.
  8. 8.Klahr & Dunbar (1988) and Mynatt, Doherty & Tweeny (1978) are also worth checking out. Also, these experiments could definitely be taken further, as a way of rationally reconstructing past advances in the lab.
  9. 9.Did I mention how great he is?
  10. 10.Lakatos: "It is very difficult to decide, especially since one must not demand progress at each single step, when a research programme has degenerated hopelessly or when one of two rival programmes has achieved a decisive advantage over the other."



The Riddle of Sweden's COVID-19 Numbers

Comparing Sweden's COVID-19 statistics to other European countries, two peculiar features emerge:

  1. Despite very different policies, Sweden has a similar pattern of cases.
  2. Despite a similar pattern of cases, Sweden has a very different pattern of deaths.

Sweden's Strategy

What exactly has Sweden done (and not done) in response to COVID-19?

  • The government has banned large public gatherings.
  • The government has partially closed schools and universities: lower secondary schools remained open while older students stayed at home.
  • The government recommends voluntary social distancing. High-risk groups are encouraged to isolate.
  • Those with symptoms are encouraged to stay at home.
  • The government does not recommend the use of masks, and surveys confirm that very few people use them (79% "not at all" vs 2% in France, 0% in Italy, 11% in the UK).
  • There was a ban on visits to care homes which was lifted in September.
  • There have been no lockdowns.

How has it worked? Well, Sweden is roughly at the same level as other western European countries in terms of per capita mortality, but it's also doing much worse than its Nordic neighbors. Early apocalyptic predictions have not materialized. Economically it doesn't seem to have gained much, as its Q2 GDP drop was more or less the same as that of Norway and Denmark.1

Case Counts

Sweden has followed a trajectory similar to other Western countries with the first wave in April, a pause during the summer (Sweden took longer to settle down, however), and now a second wave in autumn.2

The fact that the summer drop-off in cases happened in Sweden without lockdowns and without masks suggests that perhaps those were not the determining factors? It doesn't necessarily mean that lockdowns are ineffective in general, just that in this particular case the no-lockdown counterfactual probably looks similar.

The similarity of the trajectories plus the timing points to a common factor: climate.

Seasonality?

This sure looks like a seasonal pattern, right? And there are good a priori reasons to think COVID-19 will be slow to spread in summer: the majority of respiratory diseases all but disappear during the warmer months. This chart from Li, Wang & Nair (2020) shows the monthly activity of various viruses sorted by latitude:

The exact reasons are unclear, but it's probably a mix of temperature, humidity,3 behavioral factors, UV radiation, and possibly vitamin D.

However, when it comes to COVID-19 specifically there are reasons to be skeptical. The US did not have a strong seasonal pattern:

And in the southern hemisphere, Australia's two waves don't really fit a clear seasonal pattern. [Edit: or perhaps it does fit? Their second wave was the winter wave; climate differences and lockdowns could explain the differences from the European pattern?]

The WHO (yes, yes, I know) says it's all one big wave and covid-19 has no seasonal pattern like influenza. A report from the National Academy of Sciences is also very skeptical about seasonality, making comparisons to SARS and MERS which do not exhibit seasonal patterns.

A review of 122 papers on the seasonality of COVID-19 is mostly inconclusive, citing lack of data and problems with confounding from control measures, social, economic, and cultural conditions. The results in the papers themselves "offer mixed offer mixed statistical support (none, weak, or strong relationships) for the influence of environmental drivers." Overall I don't think there's compelling evidence in favor of climatic variables explaining a large percentage of variation in COVID-19 deaths. So if we can't attribute the summer "pause" and autumn "second wave" in Europe to seasonality, what is the underlying cause?

Schools?

If not the climate, then I would suggest schools, but the evidence suggests they play a very small role. I like this study from Germany which uses variation in the timing of summer breaks across states, finding no evidence for an effect on new cases. This paper utilizes the partial school closures in Sweden and finds open schools had only "minor consequences". Looking at school closures during the SARS epidemic the results are similar. The ECDC is not particularly worried about schools, arguing that outbreaks in educational facilities are "exceptional events" that are "limited in number and size".

So what are we left with? Confusion.

Deaths

This chart shows daily new cases and new deaths for all of Europe:

There's a clear relationship between cases & deaths, with a lag of a few weeks as you would expect (and a change in magnitude due to increased testing and decreasing mortality rates). Here's what Sweden's chart looks like:

What is going on here? Fatality rates have been dropping everywhere, but cases and deaths appear to be completely disconnected in Sweden. Even the first death peak doesn't coincide with the first case peak, but that's probably because of early spread in nursing homes.

Are they undercounting deaths? I don't think so, total deaths seem to be below normal levels (data from euromomo):

So how do we explain the lack of deaths in Sweden?

Age?

Could it be that only young people are catching it in Sweden? I haven't found any up to date, day-by-day breakdowns by age, but comparing broad statistics for Sweden and Europe as a whole, they look fairly similar. Even if age could explain it, why would that be the case in Sweden and not in other countries? Why aren't the young people transmitting it to vulnerable old people? Perhaps it's happening and the lag is long enough that it's just not reflected in the data yet?

[Edit: thanks to commenter Frank Suozzo for pointing out that cases are concentrated in lower ages. I have found data from July 31 on the internet archive; comparing it to the latest figures, it appears that old people have managed to avoid getting covid in Sweden! Here's the chart showing total case counts:]

Improved Treatment?

Mortality has declined everywhere, and part of that is probably down to improved treatment. But I don't see Sweden doing anything unique which could explain the wild discrepancy.

Again I'm left confused about these cross-country differences. If you have any good theories I would love to hear them. Looks like age is the answer.

  1. 1.I think the right way to look at this is to say that Sweden has underperformed given its cultural advantages. The differences between Italian-, French-, and German-speaking cantons in Switzerland suggest a large role for cultural factors. Sweden should've followed a trajectory similar to its neighbors rather than one similar to Central/Southern Europe. Of course it's hard to say how things will play out in the long run.
  2. 2.Could this all be just because of increased testing? No. While testing has increased, the rate of positive tests has also risen dramatically. The second wave is not a statistical artifact.
  3. 3.Humidity seems very important, at least when it comes to influenza. See eg Absolute Humidity and the Seasonal Onset of Influenzain the Continental United States and Absolute humidity modulates influenza survival, transmission, and seasonality. There's even experimental evidence here, some papers: High Humidity Leads to Loss of Infectious Influenza Virus from Simulated Coughs, Humidity as a non-pharmaceutical intervention for influenza A.



When the Worst Man in the World Writes a Masterpiece

Boswell's Life of Johnson is not just one of my favorite books, it also engendered some of my favorite book reviews. While praise for the work is universal, the main question commentators try to answer is this: how did the worst man in the world manage to write the best biography?

The Man

Who was James Boswell? He was a perpetual drunk, a degenerate gambler, a sex addict, whoremonger, exhibitionist, and rapist. He gave his wife an STD he caught from a prostitute.

Selfish, servile and self-indulgent, lazy and lecherous, vain, proud, obsessed with his aristocratic status, yet with no sense of propriety whatsoever, he frequently fantasized about the feudal affection of serfs for their lords. He loved to watch executions and was a proud supporter of slavery.

“Where ordinary bad taste leaves off,” John Wain comments, “Boswell began.” The Thrales were long-time friends and patrons of Johnson; a single day after Henry Thrale died, Boswell wrote a poem fantasizing about the elderly Johnson and the just-widowed Hester: "Convuls'd in love's tumultuous throws, / We feel the aphrodisian spasm". The rest of his verse is of a similar quality; naturally he considered himself a great poet.

Boswell combined his terrible behavior with a complete lack of shame, faithfully reporting every transgression, every moronic ejaculation, every faux pas. The first time he visited London he went to see a play and, as he happily tells us himself, he "entertained the audience prodigiously by imitating the lowing of a cow."

By all accounts, including his own, he was an idiot. On a tour of Europe, his tutor said to him: "of young men who have studied I have never found one who had so few ideas as you."

As a lawyer he was a perpetual failure, especially when he couldn't get Johnson to write his arguments for him. As a politician he didn't even get the chance to be a failure despite decades of trying.

His correspondence with Johnson mostly consists of Boswell whining pathetically and Johnson telling him to get his shit together.

He commissioned a portrait from his friend Joshua Reynolds and stiffed him on the payment. His descendants hid the portrait in the attic because they were ashamed of being related to him.

Desperate for fame, he kept trying to attach himself to important people, mostly through sycophancy. In Geneva he pestered Rousseau,1 leading to this conversation:

Rousseau: You are irksome to me. It’s my nature. I cannot help it.
Boswell: Do not stand on ceremony with me.
Rousseau: Go away.

Later, Boswell was given the task of escorting Rousseau's mistress Thérèse Le Vasseur to England—they had an affair on the way.

When Adam Smith and Edward Gibbon were elected to The Literary Club, Boswell considered leaving because he thought the club had now "lost its select merit"!

On the positive side, his humor and whimsy made for good conversation; he put people at ease; he gave his children all the love his own father had denied him; and, somehow, he wrote one of the great works of English literature.

The Masterpiece

The Life of Samuel Johnson, LL.D. was an instant sensation. While the works of Johnson were quickly forgotten,2 his biography has never been out of print in the 229 years since its initial publication. It went through 41 editions just in the 19th century.

Burke told King George III that he had never read anything more entertaining. Coleridge said "it is impossible not to be amused with such a book." George Bernard Shaw compared Boswell's dramatization of Johnson to Plato's dramatization of Socrates, and placed old Bozzy in the middle of an "apostolic succession of dramatists" from the Greek tragedians through Shakespeare and ending, of course, with Shaw himself.

It is a strange work, an experimental collage of different modes: part traditional biography, part collection of letters, and part direct reports of Johnson's life as observed by Boswell.3 His inspiration came not from literature, but from the minute naturalistic detail of Flemish paintings. It is difficult to convey its greatness in compressed form: Boswell is not a great writer at the sentence level, and all the famous quotes are (hilarious) Johnsonian bon mots. The book succeeds through a cumulative effect.

Johnson was 54 years old when he first met Boswell, and most of his major accomplishments (the poetry, the dictionary, The Rambler) were behind him; his wife had already died; he was already the recipient of a £300 pension from the King; his edition of Shakespeare was almost complete. All in all they spent no more than 400 days together. Boswell had limited material to work with, but what he doesn't capture in fact, he captures in feeling: an entire life is contained in this book: love and friendship, taverns and work, the glory of success and recognition, the depressive bouts of failure and penury, the inevitable tortures of aging and death.

Out of a person, Boswell created a literary personality. His powers of characterization are positively Shakespearean, and his Johnson resembles none other but the bard's greatest creation: Sir John Falstaff. Big, brash, and deeply flawed, but also lovable. He would "laugh like a rhinoceros":

Johnson could not stop his merriment, but continued it all the way till he got without the Temple-gate. He then burst into such a fit of laughter that he appeared to be almost in a convulsion; and in order to support himself, laid hold of one of the posts at the side of the foot pavement, and sent forth peals so loud, that in the silence of the night his voice seemed to resound from Temple-bar to Fleet ditch.

And around Johnson he painted an entire dramatic cast, bringing 18th century London to life: Garrick the great actor, Reynolds the painter, Beauclerk with his banter, Goldsmith with his insecurities. Monboddo and Burke, Henry and Hester Thrale, the blind Mrs Williams and the Jamaican freedman Francis Barber.

Borges (who was also a big fan) finds his parallels not in Shakespeare and Falstaff, but in Cervantes and Don Quixote. He (rather implausibly) suggests that every Quixote needs his Sancho, and "Boswell appears as a despicable character" deliberately to create a contrast.4

And in the 1830s, two brilliant and influential reviews were written by two polar opposites: arch-progressive Thomas Babington Macaulay and radical reactionary Thomas Carlyle. The first thing you'll notice is their sheer magnitude: Macaulay's is 55 pages long, while Carlyle's review in Fraser's Magazine reaches 74 pages!5 And while they both agree that it's a great book and that Boswell was a scoundrel, they have very different theories about what happened.

Macaulay

Never in history, Macaulay says, has there been "so strange a phænomenon as this book". On the one hand he has effusive praise:

Homer is not more decidedly the first of heroic poets, Shakspeare is not more decidedly the first of dramatists, Demosthenes is not more decidedly the first of orators, than Boswell is the first of biographers. He has no second. He has distanced all his competitors so decidedly that it is not worth while to place them.

On the other hand, he spends several paragraphs laying into Boswell with gusto:

He was, if we are to give any credit to his own account or to the united testimony of all who knew him, a man of the meanest and feeblest intellect. [...] He was the laughing-stock of the whole of that brilliant society which has owed to him the greater part of its fame. He was always laying himself at the feet of some eminent man, and begging to be spit upon and trampled upon. [...] Servile and impertinent, shallow and pedantic, a bigot and a sot, bloated with family pride, and eternally blustering about the dignity of a born gentleman, yet stooping to be a talebearer, an eavesdropper, a common butt in the taverns of London.

Macaulay's theory is that while Homer and Shakespeare and all the other greats owe their eminence to their virtues, Boswell is unique in that he owes his success to his vices.

He was a slave, proud of his servitude, a Paul Pry, convinced that his own curiosity and garrulity were virtues, an unsafe companion who never scrupled to repay the most liberal hospitality by the basest violation of confidence, a man without delicacy, without shame, without sense enough to know when he was hurting the feelings of others or when he was exposing himself to derision; and because he was all this, he has, in an important department of literature, immeasurably surpassed such writers as Tacitus, Clarendon, Alfieri, and his own idol Johnson.

Of the talents which ordinarily raise men to eminence as writers, Boswell had absolutely none. There is not in all his books a single remark of his own on literature, politics, religion, or society, which is not either commonplace or absurd. [...] Logic, eloquence, wit, taste, all those things which are generally considered as making a book valuable, were utterly wanting to him. He had, indeed, a quick observation and a retentive memory. These qualities, if he had been a man of sense and virtue, would scarcely of themselves have sufficed to make him conspicuous; but, because he was a dunce, a parasite, and a coxcomb, they have made him immortal.

The work succeeds partly because of its subject: if Johnson had not been so extraordinary, then airing all his dirty laundry would have just made him look bad.

No man, surely, ever published such stories respecting persons whom he professed to love and revere. He would infallibly have made his hero as contemptible as he has made himself, had not his hero really possessed some moral and intellectual qualities of a very high order. The best proof that Johnson was really an extraordinary man is that his character, instead of being degraded, has, on the whole, been decidedly raised by a work in which all his vices and weaknesses are exposed.

And finally, Boswell provided Johnson with a curious form of literary fame:

The reputation of [Johnson's] writings, which he probably expected to be immortal, is every day fading; while those peculiarities of manner and that careless table-talk the memory of which, he probably thought, would die with him, are likely to be remembered as long as the English language is spoken in any quarter of the globe.

Carlyle

Carlyle rates Johnson's biography as the greatest work of the 18th century. In a sublime passage that brings tears to my eyes, he credits the Life with the power of halting the inexorable passage of time:

Rough Samuel and sleek wheedling James were, and are not. [...] The Bottles they drank out of are all broken, the Chairs they sat on all rotted and burnt; the very Knives and Forks they ate with have rusted to the heart, and become brown oxide of iron, and mingled with the indiscriminate clay. All, all has vanished; in every deed and truth, like that baseless fabric of Prospero's air-vision. Of the Mitre Tavern nothing but the bare walls remain there: of London, of England, of the World, nothing but the bare walls remain; and these also decaying (were they of adamant), only slower. The mysterious River of Existence rushes on: a new Billow thereof has arrived, and lashes wildly as ever round the old embankments; but the former Billow with its loud, mad eddyings, where is it? Where! Now this Book of Boswell's, this is precisely a revocation of the edict of Destiny; so that Time shall not utterly, not so soon by several centuries, have dominion over us. A little row of Naphtha-lamps, with its line of Naphtha-light, burns clear and holy through the dead Night of the Past: they who are gone are still here; though hidden they are revealed, though dead they yet speak. There it shines, that little miraculously lamplit Pathway; shedding its feebler and feebler twilight into the boundless dark Oblivion, for all that our Johnson touched has become illuminated for us: on which miraculous little Pathway we can still travel, and see wonders.

Carlyle disagrees completely with Macaulay: it is not because of his vices that Boswell could write this book, but rather because he managed to overcome them. He sees in Boswell a hopeful symbol for humanity as a whole, a victory in the war between the base and the divine in our souls.

In fact, the so copious terrestrial dross that welters chaotically, as the outer sphere of this man's character, does but render for us more remarkable, more touching, the celestial spark of goodness, of light, and Reverence for Wisdom, which dwelt in the interior, and could struggle through such encumbrances, and in some degree illuminate and beautify them.

Boswell's shortcomings were visible: he was "vain, heedless, a babbler". But if that was the whole story, would he really have chosen Johnson? He could have picked more illustrious targets, richer ones, perhaps some powerful statesman or an aristocrat with a distinguished lineage. "Doubtless the man was laughed at, and often heard himself laughed at for his Johnsonism". Boswell must have been attracted to Johnson by nobler motives. And to do that he would have to "hurl mountains of impediment aside" in order to overcome his nature.

The plate-licker and wine-bibber dives into Bolt Court, to sip muddy coffee with a cynical old man, and a sour-tempered blind old woman (feeling the cups, whether they are full, with her finger); and patiently endures contradictions without end; too happy so he may but be allowed to listen and live.

The Life is not great because of Boswell's foolishness, but because of his love and his admiration, an admiration that Macaulay considered a disease. Boswell wrote that in Johnson's company he "felt elevated as if brought into another state of being".

His sneaking sycophancies, his greediness and forwardness, whatever was bestial and earthy in him, are so many blemishes in his Book, which still disturb us in its clearness; wholly hindrances, not helps. Towards Johnson, however, his feeling was not Sycophancy, which is the lowest, but Reverence, which is the highest of human feelings.

On Johnson's personality, Carlyle writes: "seldom, for any man, has the contrast between the ethereal heavenward side of things, and the dark sordid earthward, been more glaring". And this is what Johnson wrote about Falstaff in his Shakespeare commentary:

Falstaff is a character loaded with faults, and with those faults which naturally produce contempt. [...] the man thus corrupt, thus despicable, makes himself necessary to the prince that despises him, by the most pleasing of all qualities, perpetual gaiety, by an unfailing power of exciting laughter, which is the more freely indulged, as his wit is not of the splendid or ambitious kind, but consists in easy escapes and sallies of levity, which make sport but raise no envy.

Johnson obviously enjoyed the comparison to Falstaff, but would it be crazy to also see Boswell in there? The Johnson presented to us in the Life is a man who had to overcome poverty, disease, depression, and a constant fear of death, but never let those things poison his character. Perhaps Boswell crafted the character he wished he could become: Johnson was his Beatrice—a dream, an aspiration, an ideal outside his grasp that nonetheless thrust him toward greatness. Through a process of self-overcoming Boswell wrote a great book on self-overcoming.

Mediocrities Everywhere...I Absolve You

The story of Boswell is basically the plot of Amadeus, with the role of Salieri being played by Macaulay, by Carlyle, by me, and—perhaps even by yourself, dear reader. The line between admiration, envy, and resentment is thin, and crossing it is easier when the subject is a scoundrel. But if Bozzy could set aside resentment for genuine reverence, perhaps there is hope for us all. And yet...it would be an error to see in Boswell the Platonic Form of Mankind.

Shaffer and Forman's film portrays Mozart as vulgar, arrogant, a womanizer, bad with money—but, like Bozzy, still somehow quite likable. In one of the best scenes of the film, we see Mozart transform the screeching of his mother-in-law into the Queen of the Night Aria; thus Boswell transformed his embarrassments into literary gold. He may be vulgar, but his productions are not. He may be vulgar, but he is not ordinary.

Perhaps it is in vain that we seek correlations among virtues and talents: perhaps genius is ineffable. Perhaps it's Ramanujans all the way down. You can't even say that genius goes with independence: there's nothing Boswell wanted more than social approval. I won't tire you with clichés about the Margulises and the Musks.

Would Johnson have guessed that he would be the mediocrity, and Bozzy the genius? Would he have felt envy and resentment? What would he say, had he been given the chance to read in Carlyle that Johnson's own writings "are becoming obsolete for this generation; and for some future generation may be valuable chiefly as Prolegomena and expository Scholia to this Johnsoniad of Boswell"?


If you want to read The Life of Johnson, I recommend a second-hand copy of the Everyman's Library edition: cheap, reasonably sized, and the paper & binding are great.


  1. 1.In the very first letter Boswell wrote to Rousseau, he described himself as "a man of singular merit".
  2. 2.They were "rediscovered" in the early 1900s.
  3. 3.While some are quick to dismiss the non-direct parts, I think they're necessary, especially the letters which illuminate a different side of Johnson's character.
  4. 4.Lecture #10 in Professor Borges: A Course on English Literature.
  5. 5.What happened to the novella-length book review? Anyway, many of those pages are taken up by criticism of John Wilson Croker's incompetent editorial efforts.



Links & What I've Been Reading Q3 2020

High Replicability of Newly-Discovered Social-behavioral Findings is Achievable: a replication of 16 papers that followed "optimal practices" finds a high rate of replicability and virtually identical effect sizes as the original studies.

How do you decide what to replicate? This paper attempts to build a model that can be used to pick studies to maximize utility gained from replications.

Guzey on that deworming study, tracks which variables are reported across 5 different drafts of the paper starting in 2011. "But then you find that these variables didn’t move in the right direction. What do you do? Do you have to show these variables? Or can you drop them?"

I've been enjoying the NunoSempre forecasting newsletter, a monthly collection of links on forecasting.

COVID-19 made weather forecasts worse by limiting the metereological data coming from airplanes.

The 16th paragraph in this piece on the long-term effects of coronavirus mentions that 2 out of 3 people with "long-lasting" COVID-19 symptoms never had COVID to begin with.

An experiment with working 120 hours in a week goes surprisingly well.

Gwern's giant GPT-3 page. The Zizek Navy Seal Copypasta is incredible, as are the poetic imitations.

Ethereum is a Dark Forest. "In the Ethereum mempool, these apex predators take the form of “arbitrage bots.” Arbitrage bots monitor pending transactions and attempt to exploit profitable opportunities created by them."

Tyler Cowen in conversation with Nicholas Bloom, lots of fascinating stuff on innovation and progress. "Just in economics — when I first started in economics, it was standard to do a four-year PhD. It’s now a six-year PhD, plus many of the PhD students have done a pre-doc, so they’ve done an extra two years. We’re taking three or four years longer just to get to the research frontier." Immediately made me think of Scott Alexander's Ars Longa, Vita Brevis.

The Progress Studies for Young Scholars youtube channel has a bunch of interesting interviews, including Cowen, Collison, McCloskey, and Mokyr.

From the promising new Works in Progress magazine, Progress studies: the hard question.

I've written a parser for your Kindle's My Clippings.txt file. It removes duplicates, splits them up by book, and outputs them in convenient formats. Works cross-platform.

Generative bad handwriting in 280 characters. You can find a lot more of that sort of thing by searching for #つぶやきProcessing on twitter.

A new ZeroHPLovecraft short story, Key Performance Indicators. Black Mirror-esque.

A great skit about Ecclesiastes from Israeli sketch show The Jews Are Coming. Turn on the subs.

And here's some sweet Dutch prog-rock/jazz funk from the 70s.

What I've Been Reading

  • Piranesi by Susanna Clarke. 16 years after Jonathan Strange & Mr Norrell, a new novel from Susanna Clarke! It's short and not particularly ambitious, but I enjoyed it a lot. A tight fantastical mystery that starts out similar to The Library of Babel but then goes off in a different direction.

  • The Poems of T. S. Eliot: the great ones are great, and there's a lot of mediocre stuff in between. Ultimately a bit too grey and resigned and pessimistic for my taste. I got the Faber & Faber hardcover edition and would not recommend it, it's unwieldy and the notes are mostly useless.

  • Antkind by Charlie Kaufman. A typically Kaufmanesque work about a neurotic film critic and his discovery of an astonishing piece of outsider art. Memory, consciousness, time, doubles, etc. Extremely good and laugh-out-loud funny for the first half, but the final 3-400 pages were a boring, incoherent psychedelic smudge.

  • Under the Volcano by Malcolm Lowry. Very similar to another book I read recently, Lawrence Durrell's Alexandria Quartet. I prefer Durrell. Lowry doesn't have the stylistic ability to make the endless internal monologues interesting (as eg Gass does in The Tunnel), and I find the central allegory deeply misguided. Also, it's the kind of book that has a "central allegory".

  • Less than One by Joseph Brodsky. A collection of essays, mostly on Russian poetry. If I knew more about that subject I think I would have enjoyed the book more. The essays on his life in Soviet Russia are good.

  • Science Fictions: Exposing Fraud, Bias, Negligence and Hype in Science by Stuart Ritchie. Very good, esp. if you are not familiar with the replication crisis. Some quibbles about the timing and causes of the problems. Full review here.

  • The Idiot by "Dostoyevsky". Review forthcoming.

  • Borges and His Successors: The Borgesian Impact on Literature and the Arts: a collection of fairly dull essays with little to no insight.

  • Samuel Johnson: Literature, Religion and English Cultural Politics from the Restoration to Romanticism by J.C.D. Clark: a dry but well-researched study on an extraordinarily narrow slice of cultural politics. Not really aimed at a general audience.

  • Dhalgren by Samuel R. Delany. A wild semi-autobiographical semi-post-apocalyptic semi-science fiction monster. It's a 900 page slog, it's puerile, the endless sex scenes (including with minors) are pointless at best, the characters are uninteresting, there's barely any plot, the 70s counterculture stuff is just comical, and stylistically it can't reach the works it's aping. So I can see why some people hate it. But I actually enjoyed it, it has a compelling strangeness to it that is difficult to put into words (or perhaps I was just taken in by all the unresolved plot points?). Its sheer size is a quality in itself, too. Was it worth the effort? Could I recommend it? Probably not.

  • Novum Organum by Francis Bacon. While he did not actually invent the scientific method, his discussion of empiricism, experiments, and induction was clearly a step in that direction. The first part deals with science and empiricism and induction from an abstract perspective and it feels almost contemporary, like it was written by a time traveling 19th century scientist or something like that. The quarrel between the ancients and the moderns is already in full swing here, Bacon dunks on the Greeks constantly and upbraids people for blindly listening to Aristotle. Question received dogma and popular opinions, he says. He points to inventions like gunpowder and the compass and printing and paper and says that surely these indicate that there's a ton of undiscovered ideas out there, we should go looking for them. He talks about cognitive biases and scientific progress:

    we are laying the foundations not of a sect or of a dogma, but of human progress and empowerment.

    Then you get to the second part and the middle ages hit you like a freight train, you suddenly realize this is no contemporary man at all and his conception of how the world works is completely alien. Ideas that to us seem bizarre and just intuitively nonsensical (about gravity, heat, light, biology, etc.) are only common sense to him. He repeats absurdities about worms and flies arising spontaneously out of putrefaction, that light objects are pulled to the heavens while heavy objects are pulled to the earth, and so on. Not just surface-level opinions, but fundamental things that you wouldn't even think someone else could possibly perceive differently.

    You won't learn anything new from Bacon, but it's a fascinating historical document.

  • The Book of Marvels and Travels by John Mandeville. This medieval bestseller (published around 1360) combines elements of travelogue, ethnography, and fantasy. It's unclear how much of it people believed, but there was huge demand for information about far-off lands and marvelous stories. Mostly compiled from other works, it was incredibly popular for centuries. In the age of exploration (Columbus took it with him on his trip) people were shocked when some of the fantastical stories (eg about cannibals) actually turned out to be true. The tricks the author uses to generate verisimilitude are fascinating: he adds small personal touches about people he met, sometimes says that he doesn't know anything about a particular region because he hasn't been there, etc.




What's Wrong with Social Science and How to Fix It: Reflections After Reading 2578 Papers

I've seen things you people wouldn't believe.

Over the past year, I have skimmed through 2578 social science papers, spending about 2.5 minutes on each one. This was due to my participation in Replication Markets, a part of DARPA's SCORE program, whose goal is to evaluate the reliability of social science research. 3000 studies were split up into 10 rounds of ~300 studies each. Starting in August 2019, each round consisted of one week of surveys followed by two weeks of market trading. I finished in first place in 3 out 10 survey rounds and 6 out of 10 market rounds. In total, about $200,000 in prize money will be awarded.

The studies were sourced from all social science disciplines (economics, psychology, sociology, management, etc.) and were published between 2009 and 2018 (in other words, most of the sample came from the post-replication crisis era).

The average replication probability in the market was 54%; while the replication results are not out yet (250 of the 3000 papers will be replicated), previous experiments have shown that prediction markets work well.1

This is what the distribution of my own predictions looks like:2

My average forecast was in line with the market. A quarter of the claims were above 76%. And a quarter of them were below 33%: we're talking hundreds upon hundreds of terrible papers, and this is just a tiny sample of the annual academic production.

Criticizing bad science from an abstract, 10000-foot view is pleasant: you hear about some stuff that doesn't replicate, some methodologies that seem a bit silly. "They should improve their methods", "p-hacking is bad", "we must change the incentives", you declare Zeuslike from your throne in the clouds, and then go on with your day.

But actually diving into the sea of trash that is social science gives you a more tangible perspective, a more visceral revulsion, and perhaps even a sense of Lovecraftian awe at the sheer magnitude of it all: a vast landfill—a great agglomeration of garbage extending as far as the eye can see, effluvious waves crashing and throwing up a foul foam of p=0.049 papers. As you walk up to the diving platform, the deformed attendant hands you a pair of flippers. Noticing your reticence, he gives a subtle nod as if to say: "come on then, jump in".

They Know What They're Doing

Prediction markets work well because predicting replication is easy.3 There's no need for a deep dive into the statistical methodology or a rigorous examination of the data, no need to scrutinize esoteric theories for subtle errors—these papers have obvious, surface-level problems.

There's a popular belief that weak studies are the result of unconscious biases leading researchers down a "garden of forking paths". Given enough "researcher degrees of freedom" even the most punctilious investigator can be misled.

I find this belief impossible to accept. The brain is a credulous piece of meat4 but there are limits to self-delusion. Most of them have to know. It's understandable to be led down the garden of forking paths while producing the research, but when the paper is done and you give it a final read-over you will surely notice that all you have is a n=23, p=0.049 three-way interaction effect (one of dozens you tested, and with no multiple testing adjustments of course). At that point it takes more than a subtle unconscious bias to believe you have found something real. And even if the authors really are misled by the forking paths, what are the editors and reviewers doing? Are we supposed to believe they are all gullible rubes?

People within the academy don't want to rock the boat. They still have to attend the conferences, secure the grants, publish in the journals, show up at the faculty meetings: all these things depend on their peers. When criticising bad research it's easier for everyone to blame the forking paths rather than the person walking them. No need for uncomfortable unpleasantries. The fraudster can admit, without much of a hit to their reputation, that indeed they were misled by that dastardly garden, really through no fault of their own whatsoever, at which point their colleagues on twitter will applaud and say "ah, good on you, you handled this tough situation with such exquisite virtue, this is how progress happens! hip, hip, hurrah!" What a ridiculous charade.

Even when they do accuse someone of wrongdoing they use terms like "Questionable Research Practices" (QRP). How about Questionable Euphemism Practices?

  • When they measure a dozen things and only pick their outcome variable at the end, that's not the garden of forking paths but the greenhouse of fraud.
  • When they do a correlational analysis but give "policy implications" as if they were doing a causal one, they're not walking around the garden, they're doing the landscaping of forking paths.
  • When they take a continuous variable and arbitrarily bin it to do subgroup analysis or when they add an ad hoc quadratic term to their regression, they're...fertilizing the garden of forking paths? (Look, there's only so many horticultural metaphors, ok?)

The bottom line is this: if a random schmuck with zero domain expertise like me can predict what will replicate, then so can scientists who have spent half their lives studying this stuff. But they sure don't act like it.

...or Maybe They Don't?

The horror! The horror!

Check out this crazy chart from Yang et al. (2020):

Yes, you're reading that right: studies that replicate are cited at the same rate as studies that do not. Publishing your own weak papers is one thing, but citing other people's weak papers? This seemed implausible, so I decided to do my own analysis with a sample of 250 articles from the Replication Markets project. The correlation between citations per year and (market-estimated) probability of replication was -0.05!

You might hypothesize that the citations of non-replicating papers are negative, but negative citations are extremely rare.5 One study puts the rate at 2.4%. Astonishingly, even after retraction the vast majority of citations are positive, and those positive citations continue for decades after retraction.6

As in all affairs of man, it once again comes down to Hanlon's Razor. Either:

  1. Malice: they know which results are likely false but cite them anyway.
  2. or, Stupidity: they can't tell which papers will replicate even though it's quite easy.

Accepting the first option would require a level of cynicism that even I struggle to muster. But the alternative doesn't seem much better: how can they not know? I, an idiot with no relevant credentials or knowledge, can fairly accurately determine good research from bad, but all the tenured experts can not? How can they not tell which papers are retracted?

I think the most plausible explanation is that scientists don't read the papers they cite, which I suppose involves both malice and stupidity.7 Gwern has a nice write-up on this question citing some ingenious analyses based on the proliferation of misprints: "Simkin & Roychowdhury venture a guess that as many as 80% of authors citing a paper have not actually read the original". Once a paper is out there nobody bothers to check it, even though they know there's a 50-50 chance it's false!

Whatever the explanation might be, the fact is that the academic system does not allocate citations to true claims.8 This is bad not only for the direct effect of basing further research on false results, but also because it distorts the incentives scientists face. If nobody cited weak studies, we wouldn't have so many of them. Rewarding impact without regard for the truth inevitably leads to disaster.

There Are No Journals With Strict Quality Standards

Naïvely you might expect that the top-ranking journals would be full of studies that are highly likely to replicate, and the low-ranking journals would be full of p<0.1 studies based on five undergraduates. Not so! Like citations, journal status and quality are not very well correlated: there is no association between statistical power and impact factor, and journals with higher impact factor have more papers with erroneous p-values.

This pattern is repeated in the Replication Markets data. As you can see in the chart below, there's no relationship between h-index (a measure of impact) and average expected replication rates. There's also no relationship between h-index and expected replication within fields.

Even the crème de la crème of economics journals barely manage a ⅔ expected replication rate. 1 in 5 articles in QJE scores below 50%, and this is a journal that accepts just 1 out of every 30 submissions. Perhaps this (partially) explains why scientists are undiscerning: journal reputation acts as a cloak for bad research. It would be fun to test this idea empirically.

Here you can see the distribution of replication estimates for every journal in the RM sample:

As far as I can tell, for most journals the question of whether the results in a paper are true is a matter of secondary importance. If we model journals as wanting to maximize "impact", then this is hardly surprising: as we saw above, citation counts are unrelated to truth. If scientists were more careful about what they cited, then journals would in turn be more careful about what they publish.

Things Are Not Getting Better

Before we got to see any of the actual Replication Markets studies, we voted on the expected replication rates by year. Gordon et al. (2020) has that data: replication rates were expected to steadily increase from 43% in 2009/2010 to 55% in 2017/2018.

This is what the average predictions looked like after seeing the papers: from 53.4% in 2009 to 55.8% in 2018 (difference not statistically significant; black dots are means).

I frequently encounter the notion that after the replication crisis hit there was some sort of great improvement in the social sciences, that people wouldn't even dream of publishing studies based on 23 undergraduates any more (I actually saw plenty of those), etc. Stuart Ritchie's new book praises psychologists for developing "systematic ways to address" the flaws in their discipline. In reality there has been no discernible improvement.

The results aren't out yet, so it's possible that the studies have improved in subtle ways which the forecasters have not been able to detect. Perhaps the actual replication rates will be higher. But I doubt it. Looking at the distribution of p-values over time, there's a small increase in the proportion of p<.001 results, but nothing like the huge improvement that was expected.

Everyone is Complicit

Authors are just one small cog in the vast machine of scientific production. For this stuff to be financed, generated, published, and eventually rewarded requires the complicity of funding agencies, journal editors, peer reviewers, and hiring/tenure committees. Given the current structure of the machine, ultimately the funding agencies are to blame.9 But "I was just following the incentives" only goes so far. Editors and reviewers don't actually need to accept these blatantly bad papers.

Journals and universities certainly can't blame the incentives when they stand behind fraudsters to the bitter end. Paolo Macchiarini "left a trail of dead patients" but was protected for years by his university. Andrew Wakefield's famously fraudulent autism-MMR study took 12 years to retract. Even when the author of a paper admits the results were entirely based on an error, journals still won't retract.

Elisabeth Bik documents her attempts to report fraud to journals. It looks like this:

The Editor in Chief of Neuroscience Letters [Yale's Stephen G. Waxman] never replied to my email. The APJTM journal had a new publisher, so I wrote to both current Editors in Chief, but they never replied to my email.

Two papers from this set had been published in Wiley journals, Gerodontology and J Periodontology. The EiC of the Journal of Periodontology never replied to my email. None of the four Associate Editors of that journal replied to my email either. The EiC of Gerodontology never replied to my email.

Even when they do take action, journals will often let scientists "correct" faked figures instead of retracting the paper! The rate of retraction is about 0.04%; it ought to be much higher.

And even after being caught for outright fraud, about half of the offenders are allowed to keep working: they "have received over $123 million in federal funding for their post-misconduct research efforts".

Just Because a Paper Replicates Doesn't Mean it's Good

First: a replication of a badly designed study is still badly designed. Suppose you are a social scientist, and you notice that wet pavements tend to be related to umbrella usage. You do a little study and find the correlation is bulletproof. You publish the paper and try to sneak in some causal language when the editors/reviewers aren't paying attention. Rain is never even mentioned. Of course if someone repeats your study, they will get a significant result every time. This may sound absurd, but it describes a large proportion of the papers that successfully replicate.

Economists and education researchers tend to be relatively good with this stuff, but as far as I can tell most social scientists go through 4 years of undergrad and 4-6 years of PhD studies without ever encountering ideas like "identification strategy", "model misspecification", "omitted variable", "reverse causality", or "third-cause". Or maybe they know and deliberately publish crap. Fields like nutrition and epidemiology are in an even worse state, but let's not get into that right now.

"But Alvaro, correlational studies can be usef-" Spare me.

Second: the choice of claim for replication. For some papers it's clear (eg math educational intervention → math scores), but other papers make dozens of different claims which are all equally important. Sometimes the Replication Markets organisers picked an uncontroversial claim from a paper whose central experiment was actually highly questionable. In this way a study can get the "successfully replicates" label without its most contentious claim being tested.

Third: effect size. Should we interpret claims in social science as being about the magnitude of an effect, or only about its direction? If the original study says an intervention raises math scores by .5 standard deviations and the replication finds that the effect is .2 standard deviations (though still significant), that is considered a success that vindicates the original study! This is one area in which we absolutely have to abandon the binary replicates/doesn't replicate approach and start thinking more like Bayesians.

Fourth: external validity. A replicated lab experiment is still a lab experiment. While some replications try to address aspects of external validity (such as generalizability across different cultures), the question of whether these effects are relevant in the real world is generally not addressed.

Fifth: triviality. A lot of the papers in the 85%+ chance-to-replicate range are just really obvious. "Homeless students have lower test scores", "parent wealth predicts their children's wealth", that sort of thing. These are not worthless, but they're also not really expanding the frontiers of science.

So: while about half the papers will replicate, I would estimate that only half of those are actually worthwhile.

Lack of Theory

The majority of journal articles are almost completely atheoretical. Even if all the statistical, p-hacking, publication bias, etc. issues were fixed, we'd still be left with a ton of ad-hoc hypotheses based, at best, on (WEIRD) folk intuitions. But how can science advance if there's no theoretical grounding, nothing that can be refuted or refined? A pile of "facts" does not a progressive scientific field make.

Michael Muthukrishna and the superhuman Joe Henrich have written a paper called A Problem in Theory which covers the issue better than I ever could. I highly recommend checking it out.

Rather than building up principles that flow from overarching theoretical frameworks, psychology textbooks are largely a potpourri of disconnected empirical findings.

There's Probably a Ton of Uncaught Frauds

This is a fairly lengthy topic, so I made a separate post for it. tl;dr: I believe about 1% of falsified/fabricated papers are retracted, but overall they represent a very small portion of non-replicating research.

Power: Not That Bad

[Warning: technical section. Skip ahead if bored.]

A quick refresher on hypothesis testing:

  • α, the significance level, is the probability of a false positive.
  • β, or type II error, is the probability of a false negative.
  • Power is (1-β): if a study has 90% power, there's a 90% chance of successfully detecting the effect being studied. Power increases with sample size and effect size.
  • The probability that a significant p-value indicates a true effect is not 1-α. It is called the positive predictive value (PPV), and is calculated as follows: PPV=priorpowerpriorpower+(1prior)αPPV = \frac{prior \cdot power}{prior \cdot power + (1-prior) \cdot \alpha}

This great diagram by Felix Schönbrodt gives the intuition behind PPV:

This model makes the assumption that effects can be neatly split into two categories: those that are "real" and those that are not. But is this accurate? In the opposite extreme you have the "crud factor": everything is correlated so if your sample is big enough you will always find a real effect.10 As Bakan puts it: "there is really no good reason to expect the null hypothesis to be true in any population". If you look at the universe of educational interventions, for example, are they going to be neatly split into two groups of "real" and "fake" or is it going to be one continuous distribution? What does "false positive" even mean if there are no "fake" effects, unless it refers purely to the direction of the effect? Perhaps the crud factor is wrong, at least when it comes to causal effects? Perhaps the pragmatic solution is to declare that all effects with, say, d<.1 are fake and the rest are real? Or maybe we should just go full Bayesian?

Anyway, let's pretend the previous paragraph never happened. Where do we find the prior? There are a few different approaches, and they're all problematic.11

The exact number doesn't really matter that much (there's nothing we can do about it), so I'm going to go ahead and use a prior of 25% for the calculations below. The main takeaways don't change with a different prior value.

Now the only thing we're missing is the power of the typical social science study. To determine that we need to know 1) sample sizes (easy), and 2) the effect size of true effects (not so easy).14 I'm going to use the results of extremely high-powered, large-scale replication efforts:

Surprisingly large, right? We can then use the power estimates in Szucs & Ioannidis (2017): they give an average power of .49 for "medium effects" (d=.5) and .71 for "large effects" (d=.8). Let's be conservative and split the difference.

With a prior of 25%, power of 60%, and α=5%, PPV is equal to 80%. Assuming no fraud and no QRPs, 20% of positive findings will be false.

These averages hide a lot of heterogeneity: it's well-established that studies of large effects are adequately powered whereas studies of small effects are underpowered, so the PPV is going to be smaller for small effects. There are also large differences depending on the field you're looking at. The lower the power the bigger the gains to be had from increasing sample sizes.

This is what PPV looks like for the full range of prior/power values, with α=5%:

At the current prior/power levels, PPV is more sensitive to the prior: we can only squeeze small gains out of increasing power. That's a bit of a problem given the fact that increasing power is relatively easy, whereas increasing the chance that the effect you're investigating actually exists is tricky, if not impossible. Ultimately scientists want to discover surprising results—in other words, results with a low prior.

I made a little widget so you can play around with the values:

Alpha0.05
Power0.5
Prior0.25
False positives
True positives
False negatives
True negatives
PPV

Assuming a 25% prior, increasing power from 60% to 90% would require more than twice the sample size and would only increase PPV by 5.7 percentage points. It's something, but it's no panacea. However, there is something else we could do: sample size is a budget, and we can allocate that budget either to higher power or to a lower significance cutoff. Lowering alpha is far more effective at reducing the false discovery rate.15

Let's take a look at 4 different different power/alpha scenarios, assuming a 25% prior and d=0.5 effect size.16 The required sample sizes are for a one-sided t-test.

False Discovery Rate
α
0.050.005
Power0.523.1%2.9%
0.815.8%1.8%
Required Sample Size
α
0.050.005
Power0.545110
0.8100190

To sum things up: power levels are decent on average and improving them wouldn't do much. Power increases should be focused on studies of small effects. Lowering the significance cutoff achieves much more for the same increase in sample size.

Field of Dreams

Before we got to see any of the actual Replication Markets studies, we voted on the expected replication rates by field. Gordon et al. (2020) has that data:

This is what the predictions looked like after seeing the papers:

Economics is Predictably Good

Economics topped the charts in terms of expectations, and it was by far the strongest field. There are certainly large improvements to be made—a 2/3 replication rate is not something to be proud of. But reading their papers you get the sense that at least they're trying, which is more than can be said of some other fields. 6 of the top 10 economics journals participated, and they did quite well: QJE is the behemoth of the field and it managed to finish very close to the top. A unique weakness of economics is the frequent use of absurd instrumental variables. I doubt there's anyone (including the authors) who is convinced by that stuff, so let's cut it out.

EvoPsych is Surprisingly Bad

You were supposed to destroy the Sith, not join them!

Going into this, my view of evolutionary psychology was shaped by people like Cosmides, Tooby, DeVore, Boehm, and so on. You know, evolutionary psychology! But the studies I skimmed from evopsych journals were mostly just weak social psychology papers with an infinitesimally thin layer of evolutionary paint on top. Few people seem to take the "evolutionary" aspect really seriously.

Also underdetermination problems are particularly difficult in this field and nobody seems to care.

Education is Surprisingly Good

Education was expected to be the worst field, but it ended up being almost as strong as economics. When it came to interventions there were lots of RCTs with fairly large samples, which made their claims believable. I also got the sense that p-hacking is more difficult in education: there's usually only one math score which measures the impact of a math intervention, there's no early stopping, etc.

However, many of the top-scoring papers were trivial (eg "there are race differences in science scores"), and the field has a unique problem which is not addressed by replication: educational intervention effects are notorious for fading out after a few years. If the replications waited 5 years to follow up on the students, things would look much, much worse.

Demography is Good

Who even knew these people existed? Yet it seems they do (relatively) competent work. googles some of the authors Ah, they're economists. Well.

Criminology Should Just Be Scrapped

If you thought social psychology was bad, you ain't seen nothin' yet. Other fields have a mix of good and bad papers, but criminology is a shocking outlier. Almost every single paper I read was awful. Even among the papers that are highly likely to replicate, it's de rigueur to confuse correlation for causation.

If we compare criminology to, say, education, the headline replication rates look similar-ish. But the designs used in education (typically RCT, diff-in-diff, or regression discontinuity) are at least in principle capable of detecting the effects they're looking for. That's not really the case for criminology. Perhaps this is an effect of the (small number of) specific journals selected for RM, and there is more rigorous work published elsewhere.

There's no doubt in my mind that the net effect of criminology as a discipline is negative: to the extent that public policy is guided by these people, it is worse. Just shameful.

Marketing/Management

In their current state these are a bit of a joke, but I don't think there's anything fundamentally wrong with them. Sure, some of the variables they use are a bit fluffy, and of course there's a lack of theory. But the things they study are a good fit for RCTs, and if they just quintupled their sample sizes they would see massive improvements.

Cognitive Psychology

Much worse than expected; generally has a reputation as being one of the more solid subdisciplines of psychology, and has done well in previous replication projects. Not sure what went wrong here. It's only 50 papers and they're all from the same journal, so perhaps it's simply an unrepresentative sample.

Social Psychology

More or less as expected. All the silly stuff you've heard about is still going on.

Limited Political Hackery

Some of the most highly publicized social science controversies of the last decade happened at the intersection between political activism and low scientific standards: the implicit association test,17 stereotype threat, racial resentment, etc. I thought these were representative of a wider phenomenon, but in reality they are exceptions. The vast majority of work is done in good faith.

While blatant activism is rare, there is a more subtle background ideological influence which affects the assumptions scientists make, the types of questions they ask, and how they go about testing them. It's difficult to say how things would be different under the counterfactual of a more politically balanced professoriate, though.

Interaction Effects Bad

A paper whose main finding is an interaction effect is about 10 percentage points less likely to replicate. Their usage is not inherently wrong, sometimes it's theoretically justified. But all too often you'll see blatant fishing expeditions with a dozen double and triple ad hoc interactions thrown into the regression. They make it easy to do naughty things and tend to be underpowered.

Nothing New Under the Sun

All is mere breath, and herding the wind.

The replication crisis did not begin in 2010, it began in the 1950s. All the things I've written above have been written before, by respected and influential scientists. They made no difference whatsoever. Let's take a stroll through the museum of metascience.

Sterling (1959) analyzed psychology articles published in 1955-56 and noted that 97% of them rejected their null hypothesis. He found evidence of a huge publication bias, and a serious problem with false positives which was compounded by the fact that results are "seldom verified by independent replication".

Nunnally (1960) noted various problems with null hypothesis testing, underpowered studies, over-reliance on student samples (it doesn't take Joe Henrich to notice that using Western undergrads for every experiment might be a bad idea), and much more. The problem (or excuse) of publish-or-perish, which some portray as a recent development, was already in place by this time.18

The "reprint race" in our universities induces us to publish hastily-done, small studies and to be content with inexact estimates of relationships.

Jacob Cohen (of Cohen's d fame) in a 1962 study analyzed the statistical power of 70 psychology papers: he found that underpowered studies were a huge problem, especially for those investigating small effects. Successive studies by Sedlemeier & Gigerenzer in 1989 and Szucs & Ioannidis in 2017 found no improvement in power.

If we then accept the diagnosis of general weakness of the studies, what treatment can be prescribed? Formally, at least, the answer is simple: increase sample sizes.

Paul Meehl (1967) is highly insightful on problems with null hypothesis testing in the social sciences, the "crud factor", lack of theory, etc. Meehl (1970) brilliantly skewers the erroneous (and still common) tactic of automatically controling for "confounders" in observational designs without understanding the causal relations between the variables. Meehl (1990) is downright brutal: he highlights a series issues which, he argues, make psychological theories "uninterpretable". He covers low standards, pressure to publish, low power, low prior probabilities, and so on.

I am prepared to argue that a tremendous amount of taxpayer money goes down the drain in research that pseudotests theories in soft psychology and that it would be a material social advance as well as a reduction in what Lakatos has called “intellectual pollution” if we would quit engaging in this feckless enterprise.

Rosenthal (1979) covers publication bias and the problems it poses for meta-analyses: "only a few studies filed away could change the combined significant result to a nonsignificant one". Cole, Cole & Simon (1981) present experimental evidence on the evaluation of NSF grant proposals: they find that luck plays a huge factor as there is little agreement between reviewers.

I could keep going to the present day with the work of Goodman, Gelman, Nosek, and many others. There are many within the academy who are actively working on these issues: the CASBS Group on Best Practices in Science, the Meta-Research Innovation Center at Stanford, the Peer Review Congress, the Center for Open Science. If you click those links you will find a ton of papers on metascientific issues. But there seems to be a gap between awareness of the problem and implementing policy to fix it. You've got tons of people doing all this research and trying to repair the broken scientific process, while at the same time journal editors won't even retract blatantly fraudulent research.

There is even a history of government involvement. In the 70s there were battles in Congress over questionable NSF grants, and in the 80s Congress (led by Al Gore) was concerned about scientific integrity, which eventually led to the establishment of the Office of Scientific Integrity. (It then took the federal government another 11 years to come up with a decent definition of scientific misconduct.) After a couple of embarrassing high-profile prosecutorial failures they more or less gave up, but they still exist today and prosecute about a dozen people per year.

Generations of psychologists have come and gone and nothing has been done. The only difference is that today we have a better sense of the scale of the problem. The one ray of hope is that at least we have started doing a few replications, but I don't see that fundamentally changing things: replications reveal false positives, but they do nothing to prevent those false positives from being published in the first place.

What To Do

The reason nothing has been done since the 50s, despite everyone knowing about the problems, is simple: bad incentives. The best cases for government intervention are collective action problems: situations where the incentives for each actor cause suboptimal outcomes for the group as a whole, and it's difficult to coordinate bottom-up solutions. In this case the negative effects are not confined to academia, but overflow to society as a whole when these false results are used to inform business and policy.

Nobody actually benefits from the present state of affairs, but you can't ask isolated individuals to sacrifice their careers for the "greater good": the only viable solutions are top-down, which means either the granting agencies or Congress (or, as Scott Alexander has suggested, a Science Czar). You need a power that sits above the system and has its own incentives in order: this approach has already had success with requirements for pre-registration and publication of clinical trials. Right now I believe the most valuable activity in metascience is not replication or open science initiatives but political lobbying.19

  • Earmark 60% of funding for registered reports (ie accepted for publication based on the preregistered design only, not results). For some types of work this isn't feasible, but for ¾ of the papers I skimmed it's possible. In one fell swoop, p-hacking and publication bias would be virtually eliminated.20
  • Earmark 10% of funding for replications. When the majority of publications are registered reports, replications will be far less valuable than they are today. However, intelligently targeted replications still need to happen.
  • Earmark 1% of funding for progress studies. Including metascientific research that can be used to develop a serious science policy in the future.
  • Increase sample sizes and lower the significance threshold to .005. This one needs to be targeted: studies of small effects probably need to quadruple their sample sizes in order to get their power to reasonable levels. The median study would only need 2x or so. Lowering alpha is generally preferable to increasing power. "But Alvaro, doesn't that mean that fewer grants would be funded?" Yes.
  • Ignore citation counts. Given that citations are unrelated to (easily-predictable) replicability, let alone any subtler quality aspects, their use as an evaluative tool should stop immediately.
  • Open data, enforced by the NSF/NIH. There are problems with privacy but I would be tempted to go as far as possible with this. Open data helps detect fraud. And let's have everyone share their code, too—anything that makes replication/reproduction easier is a step in the right direction.
  • Financial incentives for universities and journals to police fraud. It's not easy to structure this well because on the one hand you want to incentivize them to minimize the frauds published, but on the other hand you want to maximize the frauds being caught. Beware Goodhart's law!
  • Why not do away with the journal system altogether? The NSF could run its own centralized, open website; grants would require publication there. Journals are objectively not doing their job as gatekeepers of quality or truth, so what even is a journal? A combination of taxonomy and reputation. The former is better solved by a simple tag system, and the latter is actually misleading. Peer review is unpaid work anyway, it could continue as is. Attach a replication prediction market (with the estimated probability displayed in gargantuan neon-red font right next to the paper title) and you're golden. Without the crutch of "high ranked journals" maybe we could move to better ways of evaluating scientific output. No more editors refusing to publish replications. You can't shift the incentives: academics want to publish in "high-impact" journals, and journals want to selectively publish "high-impact" research. So just make it impossible. Plus as a bonus side-effect this would finally sink Elsevier.
  • Have authors bet on replication of their research. Give them fixed odds, say 1:4—if it's good work, it's +EV for them. This sounds a bit distasteful, so we could structure the same cashflows as a "bonus grant" from the NSF when a paper you wrote replicates successfully.22

And a couple of points that individuals can implement today:

  • Just stop citing bad research, I shouldn't need to tell you this, jesus christ what the fuck is wrong with you people.
  • Read the papers you cite. Or at least make your grad students to do it for you. It doesn't need to be exhaustive: the abstract, a quick look at the descriptive stats, a good look at the table with the main regression results, and then a skim of the conclusions. Maybe a glance at the methodology if they're doing something unusual. It won't take more than a couple of minutes. And you owe it not only to SCIENCE!, but also to yourself: the ability to discriminate between what is real and what is not is rather useful if you want to produce good research.23
  • When doing peer review, reject claims that are likely to be false. The base replication rate for studies with p>.001 is below 50%. When reviewing a paper whose central claim has a p-value above that, you should recommend against publication unless the paper is exceptional (good methodology, high prior likelihood, etc.)24 If we're going to have publication bias, at least let that be a bias for true positives. Remember to subtract another 10 percentage points for interaction effects. You don't need to be complicit in the publication of false claims.
  • Stop assuming good faith. I'm not saying every academic interaction should be hostile and adversarial, but the good guys are behaving like dodos right now and the predators are running wild.

...My Only Friend, The End

The first draft of this post had a section titled "Some of My Favorites", where I listed the silliest studies in the sample. But I removed it because I don't want to give the impression that the problem lies with a few comically bad papers in the far left tail of the distribution. The real problem is the median.

It is difficult to convey just how low the standards are. The marginal researcher is a hack and the marginal paper should not exist. There's a general lack of seriousness hanging over everything—if an undergrad cites a retracted paper in an essay, whatever; but if this is your life's work, surely you ought to treat the matter with some care and respect.

Why is the Replication Markets project funded by the Department of Defense? If you look at the NSF's 2019 Performance Highlights, you'll find items such as "Foster a culture of inclusion through change management efforts" (Status: "Achieved") and "Inform applicants whether their proposals have been declined or recommended for funding in a timely manner" (Status: "Not Achieved"). Pusillanimous reports repeat tired clichés about "training", "transparency", and a "culture of openness" while downplaying the scale of the problem and ignoring the incentives. No serious actions have followed from their recommendations.

It's not that they're trying and failing—they appear to be completely oblivious. We're talking about an organization with an 8 billion dollar budget that is responsible for a huge part of social science funding, and they can't manage to inform people that their grant was declined! These are the people we must depend on to fix everything.

When it comes to giant bureaucracies it can be difficult to know where (if anywhere) the actual power lies. But a good start would be at the top: NSF director Sethuraman Panchanathan, SES division director Daniel L. Goroff, NIH director Francis S. Collins, and the members of the National Science Board. The broken incentives of the academy did not appear out of nowhere, they are the result of grant agency policies. Scientists and the organizations that represent them (like the AEA and APA) should be putting pressure on them to fix this ridiculous situation.

The importance of metascience is inversely proportional to how well normal science is working, and right now it could use some improvement. The federal government spends about $100b per year on research, but we lack a systematic understanding of scientific progress, we lack insight into the forces that underlie the upward trajectory of our civilization. Let's take 1% of that money and invest it wisely so that the other 99% will not be pointlessly wasted. Let's invest it in a robust understanding of science, let's invest it in progress studies, let's invest it in—the future.


Thanks to Alexey Guzey and Dormin for their feedback. And thanks to the people at SCORE and the Replication Markets team for letting me use their data and for running this unparalleled program.


  1. 1.Dreber et al. (2015), Using prediction markets to estimate the reproducibility of scientific research.
    Camerer et al. (2018), Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015.
  2. 2.The distribution is bimodal because of the way p-values are typically reported: there's a huge difference between p<.01 and p<.001. If actual p-values were reported instead of cutoffs, the distribution would be unimodal.
  3. 3.Even laypeople are half-decent at it.
  4. 4.Ludwik Fleck has an amusing bit on the development of anatomy: "Simple lack of 'direct contact with nature' during experimental dissection cannot explain the frequency of the phrase "which becomes visible during autopsy" often accompanying what to us seem the most absurd assertions."
  5. 5.Another possible explanation is that importance is inversely related to replication probability. In my experience that is not the case, however. If anything it's the opposite: important effects tend to be large effects, and large effects tend to replicate. In general, any "conditioning on a collider"-type explanation doesn't work here because these citations also continue post-retraction.
  6. 6.Some more:
  7. 7.Though I must admit that after reading the papers myself I understand why they would shy away from the task.
  8. 8.I can tell you what is rewarded with citations though: papers in which the authors find support for their hypothesis.
  9. 9.Perhaps I don't understand the situation at places like the NSF or the ESRC but the problem seems to be incompetence (or a broken bureaucracy?) rather than misaligned incentives.
  10. 10.Theoretically there's the possibility of overpowered studies being a problem. Meehl (1967) argues that 1) everything in psychology is correlated (the "crud factor"), and 2) theories only make directional predictions (as opposed to point predictions in eg physics). So as power increases the probability of finding a significant result for a directional prediction approaches 50% regardless of what you're studying.
  11. 11.In medicine there are plenty of cohort-based publication bias analyses, but I don't think we can generalize from those to the social sciences.
  12. 12.But RRs are probably not representative of the literature, so this is an overestimate. And who knows how many unpublished pilot studies are behind every RR?
  13. 13.Dreber et al. (2015) use prediction market probabilities and work backward to get a prior of 9%, but this number is based on unreasonable assumptions about false positives: they don't take into account fraud and QRPs. If priors were really that low, the entire replication crisis would be explained purely by normal sampling error: no QRPs!
  14. 14.Part of the issue is that the literature is polluted with a ton of false results, which actually pushes estimates of true effect sizes downwards. There's an unfortunate tendency to lump together effect sizes of real and non-existent effects (eg Many Labs 2: "ds were 0.60 for the original findings and 0.15 for the replications"), but that's a meaningless number.
  15. 15.False negatives are bad too, but they're not as harmful as false positives. Especially since they're almost never published. Also, there's been a ton of stuff written on lowering alpha, a good starting point is Redefine Statistical Significance.
  16. 16.These figures actually understate the benefit of a lower alpha, because it would also change the calculus around p-hacking. With an alpha of 5%, getting a false positive is quite easy. Simply stopping data collection once you have a significant result has a hit rate of over 20%! Add some dredging and HARKing to that and you can squeeze a result out of anything. With a lower alpha, the chances of p-hacking success will be vastly lower and some researchers won't even bother trying.
  17. 17.The original IAT paper is worth revisiting. You only really need to read page 1475. The construct validity evidence is laughable. The whole thing is based on N=26 and they find no significant correlation between the IAT and explicit measures of racism. But that's OK, Greenwald says, because the IAT is meant to find secret racists ("reveal explicitly disavowed prejudice")! The question of why a null correlation between implicit and explicit racial attitudes is to be expected is left as an exercise to the reader. The correlation between two racial IATs (male and female names) is .46 and they conveniently forget to mention the comically low test-retest reliability. That's all you need for 13k citations and a consulting industry selling implicit bias to the government for millions of dollars.
  18. 18.I suspect psychologists today would laugh at the idea of the 1960s being an over-competitive environment. Personally I highly doubt that this situation can be blamed on high (or increasing) productivity.
  19. 19.You might ask: well, why haven't the independent grant agencies already fixed the problem then? I'm not sure if it's a lack of competence, or caring, or power, or something else. But I find Garrett Jones' arguments on the efficacy of independent government agencies convincing: this model works well in other areas.
  20. 20."But Alvaro, what if I make an unexpected discovery during my investigation?" Well, you start writing a new registered report, and perhaps publish it as an exploratory result. You may not like it, but that's how we protect against false positives. In cases where only one dataset is available (eg historical data) we must rely on even stricter standards of evidence, to protect against multiple testing.
  21. 21.Another idea to steal from the SEC: whistleblower rewards.
  22. 22.This would be immediately exploited by publishing a bunch of trivial results. But that's a solvable problem. In any case, it's much better to have systematic, automatic mechanisms instead of relying on subjective factors and prosecuting of individual cases.
  23. 23.I believe the SCORE program intends to use the data from Replication Markets to train a ML model that predicts replicability. If scientists had the ability to just run that on every reference in their papers, perhaps they could go back to not reading what they cite.
  24. 24.Looking at Replication Markets data, about 1 in 4 studies with p>.001 had more than a 50% chance to replicate. Of course I'd consider 50-50 odds far too low a threshold, but you have to start somewhere. "But Alvaro, science is not done paper by paper, it is a cumulative enterprise. We should publish marginal results, even if they're probably not true. They are pieces of evidence that, brick by brick, raise the vast edifice that we call scientific knowledge". In principle this is a good argument: publish everything and let the meta-analyses sort it out. But given the reality of publication bias we must be selective. If registered reports became the standard, this problem would not exist.



How Many Undetected Frauds in Science?

0.04% of papers are retracted. At least 1.9% of papers have duplicate images "suggestive of deliberate manipulation". About 2.5% of scientists admit to fraud, and they estimate that 10% of other scientists have committed fraud. 27% of postdocs said they were willing to select or omit data to improve their results. More than 50% of published findings in psychology are false. The ORI, which makes about 13 misconduct findings per year, gives a conservative estimate of over 2000 misconduct incidents per year.

That's a wide range of figures, and all of them suffer from problems if we try to use them as estimates of the real rate of fraud. While the vast majority of false published claims are not due to fabrication, it's clear that there is a huge iceberg of undiscovered fraud hiding underneath the surface.

Part of the issue is that the limits of fraud are unclear. While fabrication/falsification are easy to adjudicate, there's a wide range of quasi-fraudulent but quasi-acceptable "Questionable Research Practices" (QRPs) such as HARKing which result in false claims being presented as true. Publishing a claim that has a ~0%1 chance of being true is the worst thing in the world, but publishing a claim that has a 15% chance of being true is a totally normal thing that perfectly upstanding scientists do. Thus the literature is inundated by false results that are nonetheless not "fraudulent". Personally I don't think there's much of a difference.

There are two main issues with QRPs: first, there's no clear line in the sand, which makes it difficult to single out individuals for punishment. Second, the majority of scientists engage in QRPs. In fact they have been steeped in an environment full of bad practices for so long that they are no longer capable of understanding that they are behaving badly:

Let him who is without QRPs cast the first stone.

The case of Brian Wansink (who committed both clear fraud and QRPs) is revealing: in the infamous post that set off his fall from grace, he brazenly admitted to extreme p-hacking. The notion that any of this was wrong had clearly never crossed his mind: he genuinely believed he was giving useful advice to grad students. When commenters pushed back, he justified himself by writing that "P-hacking shouldn’t be confused with deep data dives".

Anyway, here are some questions that might help us determine the size of the iceberg:

  • Are uncovered frauds high-quality, or do we only have the ability to find low-hanging fruit?
  • Are frauds caught quickly, or do they have long careers before anyone finds out?
  • Are scientists capable of detecting fraud or false results in general (regardless of whether they are produced by fraud, QRPs, or just bad luck)?
  • How much can we rely on whistleblowers?

Quality

Here's an interesting case recently uncovered by Elisabeth Bik: 8 different published, peer-reviewed papers, by different authors, on different subjects, with literally identical graphs. The laziness is astonishing! It would take just a few minutes to write an R script that generates random data so that each fake paper could at least have unique charts. But the paper mill that wrote these articles won't even do that. This kind of extreme sloppiness is a recurring theme when it comes to frauds that have been caught.

In general the image duplication that Bik uncovers tends to be rather lazy: people just copy paste to their heart's content and hope nobody will notice (and peer reviewers and editors almost certainly won't notice).

The Bell Labs physicist Jan Hendrik Schön was found out because he used identical graphs for multiple, completely different experiments.

This guy not only copy-pasted a ton of observations, he forgot to delete the excel sheet he used to fake the data! Managed to get three publications out of it.

Back to Wansink again: he was smart enough not to copy-paste charts, but he made other stupid mistakes. For example in one paper (The office candy dish) he reported impossible means and test statistics (detected through granularity testing). If he had just bothered to create a plausible sample instead of directly fiddling with summary statistics, there's a good chance he would not have been detected. (By the way, the paper has not been retracted, and continues to be cited. I Fucking Love Science!)

In general Wansink comes across as a moron, yet he managed to amass hundreds of publications, 30k+ citations, and half a dozen books. What percentile of fraud competence do you think Wansink represents?

The point is this: generating plausible random numbers is not that difficult! Especially considering the fact that these are intelligent people with extensive training in science and statistics. It seems highly likely that there are more sophisticated frauds out there.

Speed

Do frauds manage to have long careers before they get caught? I don't think there's any hard data on this (though someone could probably compile it with the Retraction Watch database). Obviously the highest-profile frauds are going to be those with a long history, so we have to be careful not to be misled. Perhaps there's a vast number of fraudsters who are caught immediately.

Overall the evidence is mixed. On the one hand, a relatively small number of researchers account for a fairly large proportion of all retractions. So while these individuals managed to evade detection for a long time (Yoshitaka Fujii published close to 200 papers over a 25 year career), most frauds do not have such vast track records.

On the other hand just because we haven't detected fraudulent papers doesn't necessarily mean they don't exist. And repeat fraud seems fairly common: simple image duplication checks reveal that "in nearly 40% of the instances in which a problematic paper was identified, screening of other papers from the same authors revealed additional problematic papers in the literature."

Even when fraud is clearly present, it can take ages for the relevant authorities to take action. The infamous Andrew Wakefield vaccine autism paper, for example, took 12 years to retract.

Detection Ability

I've been reading a lot of social science papers lately and a thought keeps coming up: "this paper seems unlikely to replicate, but how can I tell if it's due to fraud or just bad methods?" And the answer is that in general we can't tell. In fact things are even worse, as scientists seem to be incapable of detecting even really obviously weak papers (more on this in the next post).

In cases such as Wansink's, people went over his work with a fine comb after the infamous blogpost and discovered all sorts of irregularities. But nobody caught those signs earlier. Part of the issue is that nobody's really looking for fraud when they casually read a paper. Science tends to work on a kind of honor system where everyone just assumes the best. Even if you are looking for fraud, it's time-consuming, difficult, and in many cases unclear. The evidence tends to be indirect: noticing that two subgroups are a bit too similar, or that the effects of an intervention are a bit too consistent. But these can be explained away fairly easily. So unless you have a whistleblower it's often difficult to make an accusation.

The case of the 5-HTTLPR gene is instructive: as Scott Alexander explains in his fantastic literature review, a huge academic industry was built up around what should have been a null result. There are literally hundreds of non-replicating papers on 5-HTTLPR—suppose there was one fraudulent article in this haystack, how would you go about finding it?

Some frauds (or are they simply errors?) are detected using statistical methods such as the granularity testing mentioned above, or with statcheck. But any sophisticated fraud would simply check their own numbers using statcheck before submitting, and correct any irregularities.

Detecting weak research is easy. Detecting fraud and then prosecuting it is extremely difficult.

Whistleblowers

Some cases are brought to light by whistleblowers, but we can't rely on them for a variety of reasons. A survey of scientists finds that potential whistleblowers, especially those without job security, tend not to report fraud due to the potential career consequences. They understand that institutions will go to great lengths to protect frauds—do you want a career, or do you want to do the right thing?

Often there simply is no whistleblower available. Scientists are trusted to collect data on their own, and they often collaborate with people in other countries or continents who never have any contact with the data-gathering process. Under such circumstances we must rely on indirect means of detection.

South Korean celebrity scientist Hwang Woo-suk was uncovered as a fraud by a television program which used two whistleblower sources. But things only got rolling when image duplication was detected in one of his papers. Both whistleblowers lost their jobs and were unable to find other employment.

In some cases people blow the whistle and nothing happens. The report from the investigation into Diederik Stapel, for example, notes that "on three occasions in 2010 and 2011, the attention of members of the academic staff in psychology was drawn to this matter. The first two signals were not followed up in the first or second instance." By the way, these people simply noticed statistical irregularities, they never had direct evidence.

And let's turn back to Wansink once again: in the blog post that sank him, he recounted tales of instructing students to p-hack data until they found a result. Did those grad students ever blow the whistle on him? Of course not.

This is the End...

Let's say that about half of all published research findings are false. How many of those are due to fraud? As a very rough guess I'd say that for every 100 papers that don't replicate, 2.5 are due to fabrication/falsification, and 85 are due to lighter forms of methodological fraud. This would imply that about 1% of fraudulent papers are retracted.

This is both good and bad news. On the one hand, while most fraud goes unpunished, it only represents a small portion of published research. On the other hand, it means that we can't fix reproducibility problems by going after fabrication/falsification: if outright fraud completely disappeared tomorrow, it would be no more than an imperceptible blip in the replication crisis. A real solution needs to address the "questionable" methods used by the median scientist, not the fabrication used by the very worst of them.




Book Review: Science Fictions by Stuart Ritchie

In 1945, Robert Merton wrote:

There is only this to be said: the sociology of knowledge is fast outgrowing a prior tendency to confuse provisional hypothesis with unimpeachable dogma; the plenitude of speculative insights which marked its early stages are now being subjected to increasingly rigorous test.

Then, 16 years later:

After enjoying more than two generations of scholarly interest, the sociology of knowledge remains largely a subject for meditation rather than a field of sustained and methodical investigation. [...] these authors tell us that they have been forced to resort to loose generalities rather than being in a position to report firmly grounded generalizations.

In 2020, the sociology of science is stuck more or less in the same place. I am being unfair to Ritchie (who is a Merton fanboy), because he has not set out to write a systematic account of scientific production—he has set out to present a series of captivating anecdotes, and in those terms he has succeeded admirably. And yet, in the age of progress studies surely one is allowed to hope for more.

If you've never heard of Daryl Bem, Brian Wansink, Andrew Wakefield, John Ioannidis, or Elisabeth Bik, then this book is an excellent introduction to the scientific misconduct that is plaguing our universities. The stories will blow your mind. For example you'll learn about Paolo Macchiarini, who left a trail of dead patients, published fake research saying he healed them, and was then protected by his university and the journal Nature for years. However, if you have been following the replication crisis, you will find nothing new here. The incidents are well-known, and the analysis Ritchie adds on top of them is limited in ambition.

The book begins with a quick summary of how science funding and research work, and a short chapter on the replication crisis. After that we get to the juicy bits as Ritchie describes exactly how all this bad research is produced. He starts with outright fraud, and then moves onto the gray areas of bias, negligence, and hype: it's an engaging and often funny catalogue of misdeeds and misaligned incentives. The final two chapters address the causes behind these problems, and how to fix them.

The biggest weakness is that the vast majority of the incidents presented (with the notable exception of the Stanford prison experiment) occurred in the last 20 years or so. And Ritchie's analysis of the causes behind these failures also depends on recent developments: his main argument is that intense competition and pressure to publish large quantities of papers is harming their quality.

Not only has there been a huge increase in the rate of publication, there’s evidence that the selection for productivity among scientists is getting stronger. A French study found that young evolutionary biologists hired in 2013 had nearly twice as many publications as those hired in 2005, implying that the hiring criteria had crept upwards year-on-year. [...] as the number of PhDs awarded has increased (another consequence, we should note, of universities looking to their bottom line, since PhD and other students also bring in vast amounts of money), the increase in university jobs for those newly minted PhD scientists to fill hasn’t kept pace.

By only focusing on recent examples, Ritchie gives the impression that the problem is new. But that's not really the case. One can go back to the 60s and 70s and find people railing against low standards, underpowered studies, lack of theory, publication bias, and so on. Imre Lakatos, in an amusing series of lectures at the London School of Economics in 1973, said that "the social sciences are on a par with astrology, it is no use beating about the bush."

Let's play a little game. Go to the Journal of Personality and Social Psychology (one of the top social psych journals) and look up a few random papers from the 60s. Are you going to find rigorous, replicable science from a mythical era when valiant scientists followed Mertonian norms and were not incentivized to spew out dozens of mediocre papers every year? No, you're going to find exactly the same p<.05, tiny N, interaction effect, atheoretical bullshit. The only difference being the questionable virtue of low productivity.

If the problem isn't new, then we can't look for the causes in recent developments. If Ritchie had moved beyond "loose generalities" to a more systematic analysis of scientific production I think he would have presented a very different picture. The proposals at the end mostly consist of solutions that are supposed to originate from within the academy. But they've had more than half a century to do that—it feels a bit naive to think that this time it's different.

Finally, is there light at the end of the tunnel?

...after the Bem and Stapel affairs (among many others), psychologists have begun to engage in some intense soul-searching. More than perhaps any other field, we’ve begun to recognise our deep-seated flaws and to develop systematic ways to address them – ways that are beginning to be adopted across many different disciplines of science.

Again, the book is missing hard data and analysis. I used to share his view (surely after all the publicity of the replication crisis, all the open science initiatives, all the "intense soul searching", surely things must change!) but I have now seen some data which makes me lean in the opposite direction. Check back toward the end of August for a post on this issue.

Ritchie's view of science is almost romantic: he goes on about the "nobility" of research and the virtues of Mertonian norms. But the question of how conditions, incentives, competition, and even the Mertonian norms themselves actually affect scientific production is an empirical matter that can and should be investigated systematically. It is time to move beyond "speculative insights" and onto "rigorous testing", exactly in the way that Merton failed to do.




Links Q2 2020

Tyler Cowen reviews Status and Beauty in the Global Party Circuit. "In this world, girls function as a form of capital." The podcast is good too.

Lots of good info on education: Why Conventional Wisdom on Education Reform is Wrong (a primer)

Scott Alexander on the life of Herbert Hoover.

Longer-Run Economic Consequences of Pandemics [speculative]:

Measured by deviations in a benchmark economic statistic, the real natural rate of interest, these responses indicate that pandemics are followed by sustained periods—over multiple decades—with depressed investment opportunities, possibly due to excess capital per unit of surviving labor, and/or heightened desires to save, possibly due to an increase in precautionary saving or a rebuilding of depleted wealth.

Do cognitive biases go away when the stakes are high? A large pre-registered study with very high stakes finds that effort increases significantly but performance does not.

Disco Elysium painting turned into video using AI.

Long-run consequences of the pirate attacks on the coasts of Italy: "in 1951 Rome would have been 15% more populous without piracy."

“A” Business by Any Other Name: Firm Name Choice as a Signal of Firm Quality (2014): "The average plumbing firm whose name begins with A or a number receives five times more service complaints than other firms and also charges higher prices."

Yarkoni: The Generalizability Crisis [in psychology].

Lakens: Review of "The Generalizability Crisis" by Tal Yarkoni.

Yarkoni: Induction is not optional (if you’re using inferential statistics): reply to Lakens.

Estimating the deep replicability of scientific findings using human and artificial intelligence - ML model does about as well as prediction markets when it comes to predicting replication success. "the model’s accuracy is higher when trained on a paper’s text rather than its reported statistics and that n-grams, higher order word combinations that humans have difficulty processing, correlate with replication." Also check out the horrific Fig 1.

Wearing a weight vest leads to weight loss, fairly huge (suspiciously huge?) effect size. The hypothesized mechanism is the "gravitostat": your body senses how heavy you are and adjusts accordingly.

Tyler Cowen on uni- vs multi-disciplinary policy advice in the time of Corona

...and here's Señor Coconut, "A Latin Tribute to Kraftwerk". Who knew "Autobahn" needed a marimba?




Memetic Defoundation

The bunny ears sign used to be a way of calling someone a cuck. In fact they're not bunny ears at all, they're cuckold horns. The original meaning has been lost, and today clueless children across the world use it as nothing more than a vaguely teasing gesture. This is an amusing case of a wider phenomenon I like to call memetic defoundation.

A general formulation would look something like this:

  • Start with a couple of ideas of the form "[foundation] therefore [meme]"1
  • [foundation] is forgotten, disproved, or rendered obsolete
  • [meme] persists regardless

Dead beliefs

Organizational decay is a hotspot for memetic defoundation. Luttwak tells us of a unit in the Rhine legions led by a Praefectus Militum Balistariorum long after the Roman army had lost the ability to construct and use ballistae. Gene Wolfe uses this effect in The Book of the New Sun to evoke the image of an ancient, ossified, slowly crumbling civilization: my favorite example is a prison called the "antechamber" where the inmates are still served coffee and pastries every morning.

E. R. Dodds offers another example in The Greeks and the Irrational, where he describes the decline of religion in Hellenistic times:

Gods withdraw, but their rituals live on, and no one except a few intellectuals notices that they have ceased to mean anything.

Scott Alexander comments on the relation between science and policy: "The science did a 180, but the political implications stayed exactly the same."

John Stuart Mill writes that memetic defoundation "is illustrated in the experience of almost all ethical doctrines and religious creed" and argues that free speech is necessary to prevent it, as open debate preserves the arguments behind ideas:2

If, however, the mischievous operation of the absence of free discussion, when the received opinions are true, were confined to leaving men ignorant of the grounds of those opinions, it might be thought that this, if an intellectual, is no moral evil, and does not affect the worth of the opinions, regarded in their influence on the character. The fact, however, is, that not only the grounds of the opinion are forgotten in the absence of discussion, but too often the meaning of the opinion itself. The words which convey it, cease to suggest ideas, or suggest only a small portion of those they were originally employed to communicate. Instead of a vivid conception and a living belief, there remain only a few phrases retained by rote; or, if any part, the shell and husk only of the meaning is retained, the finer essence being lost. [...] Truth, thus held, is but one superstition the more, accidentally clinging to the words which enunciate a truth.

Sometimes a meme will spread because it captures a true relation, but will use an unrelated foundation to do so. Greg Cochran suggests that Christian Science (a sect that avoids all medical care) developed as a response to the high fatality rates of pre-modern medicine. But the meme only spread when the foundation was put in theological rather than medical terms. What really matters for defoundation is the implicit relation that is captured (pseudoscientific medicine → avoid medical care) rather than the explicit one (sickness results from spiritual error → avoid medical care). When medicine improved, the true basis of the meme was gone, but of course that did nothing to change people's religious beliefs.

Finally, many (including Schumpeter,3 Santayana,4 and Saint Max5) have identified an instance of memetic defoundation in the relation between Protestantism and political liberalism (in the most general sense of the word). In broad strokes, the argument is that liberalism dropped God but kept the Protestant morality. Moldbug6 erroneously places this transition after WWII, while Barzun argues it happened 300 years earlier7. Tom Holland thinks this is an awesome development,8 while others are more skeptical. My old buddy Freddie makes the same diagnosis in Twilight of the Idols:

In England, in response to every little emancipation from theology one has to reassert one’s position in a fear-inspiring manner as a moral fanatic. That is the penance one pays there. – With us it is different. When one gives up Christian belief one thereby deprives oneself of the right to Christian morality. For the latter is absolutely not self-evident: one must make this point clear again and again, in spite of English shallowpates. Christianity is a system, a consistently thought out and complete view of things. If one breaks out of it a fundamental idea, the belief in God, one thereby breaks the whole thing to pieces: one has nothing of any consequence left in one’s hands. Christianity presupposes that man does not know, cannot know what is good for him and what evil: he believes in God, who alone knows. Christian morality is a command: its origin is transcendental; it is beyond all criticism, all right to criticize; it possesses truth only if God is truth – it stands or falls with the belief in God. – If the English really do believe they will know, of their own accord, ‘intuitively’, what is good and evil; if they consequently think they no longer have need of Christianity as a guarantee of morality; that is merely the consequence of the ascendancy of Christian evaluation and an expression of the strength and depth of this ascendancy: so that the origin of English morality has been forgotten, so that the highly conditional nature of its right to exist is no longer felt.

Things are in the saddle

Which brings us to the question of how memetic defoundation happens. In Nietzsche's model you start with the foundation and the meme is derived from it, but once the ideas have been entrenched deeply enough, the foundation can evaporate without affecting the meme. Like a fish doesn't notice water, people no longer notice the assumptions behind their beliefs. I call this the foundation-first model.

But I think he's wrong: in some cases, including the question of Christianity, the correct approach is a meme-first model. In this view, the foundation is simply a post-hoc justification (or a spandrel) glued onto a preëxisting meme. That is not to say the foundation is irrelevant, just that its role in supporting the meme is viral rather than logical.

Where did the meme come from? In his brilliant essay The Three Sources of Human Values, Hayek argues that ideas come from three sources:

  1. Consciously directed rational thought
  2. Biology
  3. Cultural evolution

We can use this classification to look at memetic defoundation. The first case is the easiest: the Roman army uses siege weapons, so someone in charge creates a siege unit and a Praefectus to lead it (a clear foundation-first instance). Eventually it loses those capabilities, but the structure remains.

Biologically instilled tendencies and values are more challenging to analyze: their aims tend to be inaccessible to introspection or hidden through self-deception. And they are not necessarily moral judgements: it could be something as simple as folkbiological classifications predisposed to certain patterns, which then influence values.9

Behaviors and social structures generated by cultural evolution also tend to be opaque: they were created by a process of random variation and selection, then sustained by a distributed system of knowledge accumulation and replication—no individual understands how they work (and they generally don't even try to, simply attributing them to custom or one's ancestors). Henrich details how the tendency of modern westerners to search for causal, explicable reasons is an anomaly.

Even when we try, we don't always succeed: the age of reason didn't necessarily make culturally evolved behaviors transparent. For example, traditional societies in the New World had various processes for nixtamalizing corn before eating it, which makes the niacin nutritionally available and prevents the disease of pellagra. It took until the 1940s(!) and hundreds of thousands of deaths until scientists finally understood the problem. And that's a simple nutritional issue rather than a question of complex social organization. As Scott Alexander puts it:

Our ancestors lived in Epistemic Hell, where they had to constantly rely on causally opaque processes with justifications that couldn’t possibly be true, and if they ever questioned them then they might die.

In a world filled with vital customs and weak explanations it's important to make sure nobody ever questions tradition—thus it is safeguarded by indoctrination, preference falsification,10 ostracism, or the promise of divine punishment. And now we have a second level of selective forces which are shaped by the needs of the memes: they mould their biological and social substrate to maximize their spread. And what are the traits they select for? Conformity, homogeneity, mimesis, self-ignorance, lack of critical thought: the herd-instinct. An overbearing society for a myopic, servile species domesticated under the yoke of ideas. That is the price we pay for the "secret of our success".11

Now consider what happens after a rapid shift in our environment (such as the introduction of agriculture, large-scale hierarchical societies, or the industrial revolution): both biological and cultural evolution are slow processes, and the latter has built-in safeguards to prevent modification. That is how we end up with a lag of ideas: baseless memes designed for a different habitat. Like a saltwater fish thrown in a lake, modern man depends on ideas he thinks are universal when they are really made for a different time and place. Hayek:

The gravest deficiency of the older prophets was their belief that the intuitively perceived ethical values, divined out of the depth of man's breast, were immutable and eternal.

What kind of ideas are most likely to take hold? "Doctrines intrinsically fitted to make the deepest impression upon the mind"12 that also increase fitness. Successful cultural adaptations tend to capture true relations, in false yet convincing ways. This is why religious memes are particularly susceptible to defoundation, and why most defoundation is meme-first. While many of these ideas may appear altruistic, they are really "subtly selfish" as George Williams put it—otherwise they would not have survived.

For example, G. E. Pugh in The Biological Origin of Human Values talks about the ubiquitous sharing norms in primitive human societies. Christopher Boehm in Hierarchy in the Forest (a work that blatantly plagiarizes Nietzsche) discusses the "egalitarian ethos" of primitive societies and its evolutionary origin, which expresses itself as a "drive to parity", which became possible to enforce with the evolution of tool use and greater coordination abilities:

Because the united subordinates are constantly putting down the more assertive alpha types in their midst, egalitarianism is in effect a bizarre type of political hierarchy.

The collective power of resentful subordinates is at the base of human egalitarian society, and we can see important traces of this group approach in chimpanzee behavior. [...] It is obvious that symbolic communication and possession of an ethos make a very large difference for humans. Yet it would appear that the underlying emotions and behavioral orientations are similar to those of chimpanzees, as are group intimidation strategies that have the effect of terminating resented behaviors of aggressors.

To re-work Nietzsche's argument into a more plausible form: the drive to parity came first. Christian morality is simply a post-hoc justification of this innate tendency, in a highly contagious and highly effective prosocial package. God is now dead, but that does nothing to change our evolved moral intuitions, so this drive simply finds new outlets: humanism, democracy, liberalism, socialism, etc. As this shift of ideas happens, we inevitably bring along some old baggage.

The sentiments necessary to thrive in a band or a tribe are not those that we need today, but they are largely those we are stuck with. Modern civilization and its markets are inhuman and unintuitive (if not actively repulsive) and exist largely because we are able to suspend, disregard, and master our innate impulses. Seemingly new ideologies directed against the market are nothing but an atavism: the incompatibility between our innate tendencies and the external environment explains their peculiar combination of perpetual failure and perpetual popularity.

Clean sweep

Counterintuitively, the memes can be strengthened by abandoning the thing they're (supposedly) based on. You can attack Christianity-the-religion-and-ethical-system by attacking God: if morality comes from God, when you take down God you also take down his morality. But it didn't work out that way in practice: people dropped the God but kept his system; where do you attack now? In theory "that which can be asserted without evidence can be dismissed without evidence." In reality, that which is asserted without evidence is difficult to refute regardless of the evidence.13

Another issue, as I argued above, is that we don't comprehend them, either because of self-deception, limited introspection, or the blind forces of cultural evolution. The solution to both of these problems is the genealogical method. The ultimate aims of our values and customs lie in their (genetic or cultural) evolutionary history; by understanding their development we can understand their purpose and the selective forces that shaped them. Through genealogy we can reach truths we have been designed not to see.14

Which brings us back to Nietzsche. How should one argue against God? Forget the old debate tactics, he says in Daybreak 95, and just treat it as an anthropological problem:

In former times, one sought to prove that there is no God – today one indicates how the belief that there is a God arose and how this belief acquired its weight and importance: a counter-proof that there is no God thereby becomes superfluous. – When in former times one had refuted the 'proofs of the existence of God' put forward, there always remained the doubt whether better proofs might not be adduced than those just refuted: in those days atheists did not know how to make a clean sweep.

It is this approach that we should deploy against foundationless memes. Don't bother with arguments attacking the foundation or the meme itself, rather go for a "clean sweep". The case of Christian Science mentioned above is a perfect example: providing theological arguments against it is futile (and fundamentally aiming at the wrong target). But understanding how it came to be makes the situation crystal clear.

The Hansonian approach of noticing a disconnect between stated and revealed preferences is also useful for spotting these memes in the first place. Hanson combines both techniques in his analysis of The Evolution of Health Altruism.

What if some lies are useful and life-preserving? What if such lies are fundamentally necessary for societies to work well? Isn't this just a naïve overexpression of the drive to truth? That may well be the case, but just because some lies are useful does not mean that the particular lies we live by right now are the best ones. In fact the tyranny of mediocrity that flourished in our recent evolutionary past appears to be fundamentally incompatible with the modern world (not to mention the world of tomorrow). Understanding is a precondition for designing superior replacements, or as Nietzsche put it "we must become physicists in order to be able to be creators".

Genealogy allows us to understand the selective forces at play, and once we understand that we (and by we I refer to a tiny minority) have the power to overcome our self-ignorance and ingrained limitations in order to choose from a higher point of view. Not a position of "transcendent leverage", but at least an informed valuing of values, consistent with the world as it is.


  1. 1.I deliberately avoid the use of "assumptions" and "conclusion" because they're not always assumptions and/or conclusions.
  2. 2.He also supports an early version of steelmanning for the same purpose: "So essential is this discipline to a real understanding of moral and human subjects, that if opponents of all important truths do not exist, it is indispensable to imagine them, and supply them with the strongest arguments which the most skilful devil’s advocate can conjure up."
  3. 3."Though the classical doctrine of collective action may not be supported by the results of empirical analysis, it is powerfully supported by that association with religious belief to which I have adverted already. This may not be obvious at first sight. The utilitarian leaders were anything but religious in the ordinary sense of the term. In fact they believed themselves to be anti-religious and they were so considered almost universally. They took pride in what they thought was precisely an unmetaphysical attitude and they were quite out of sympathy with the religious institutions and the religious movements of their time. But we need only cast another glance at the picture they drew of the social process in order to discover that it embodied essential features of the faith of protestant Christianity and was in fact derived from that faith. For the intellectual who had cast off his religion the utilitarian creed provided a substitute for it.", Capitalism, Socialism, and Democracy
  4. 4."The chief fountains of this [genteel] tradition were Calvinism and transcendentalism. Both were living fountains; but to keep them alive they required, one an agonised conscience, and the other a radical subjective criticism of knowledge. When these rare metaphysical preoccupations disappeared—and the American atmosphere is not favourable to either of them—the two systems ceased to be inwardly understood; they subsisted as sacred mysteries only; and the combination of the two in some transcendental system of the universe (a contradiction in principle) was doubly artificial.", The Genteel Tradition in American Philosophy
  5. 5."Take notice how a “moral man” behaves, who today often thinks he is through with God and throws off Christianity as a bygone thing. [...] Much as he rages against the pious Christians, he himself has nevertheless as thoroughly remained a Christian — to wit, a moral Christian.", The Ego and His Own
  6. 6."Progressive Christianity, through secular theologians such as Harvey Cox, abandoned the last shreds of Biblical theology and completed the long transformation into mere socialism. [...] Creedal declarations of Universalism are not hard to find. I am fond of the Humanist Manifestos, which pretty much say it all. The UN Declaration of Human Rights is good as well. No mainline Protestant will find anything morally objectionable in any of these documents."
  7. 7."The outcome of what has been reviewed here—late 17C critical thought, the events of 1688, and the writings of Locke, Voltaire, and Montesquieu— may be summed up in a few points [...] the political ideas of the English Puritans aiming at equality and democracy were now in the main stream of thought, minus the religious component.", From Dawn to Decadence
  8. 8.His book Dominion: How the Christian Revolution Remade the World is all about this topic. "If secular humanism derives not from reason or from science, but from the distinctive course of Christianity’s evolution – a course that, in the opinion of growing numbers in Europe and America, has left God dead – then how are its values anything more than the shadow of a corpse? What are the foundations of its morality, if not a myth?" Holland also likes to quote the Indian historian S. N. Balagangadhara: "Christianity spreads in two ways: through conversion and through secularisation."
  9. 9.Henrich has a very interesting paper with Scott Atran: The Evolution of Religion: How Cognitive By-Products, Adaptive Learning Heuristics, Ritual Displays, and Group Competition Generate Deep Commitments to Prosocial Religions. "Most religious beliefs minimally violate the expectations created by our intuitive ontology and these modes of construal, thus creating cognitively manageable and memorable supernatural worlds."
  10. 10.I highly recommend Timur Kuran's Private Truths, Public Lies, his analysis of how social pressures cause people to display and sustain false beliefs is brilliant.
  11. 11.Nietzsche also brings up another related issue: the incompatibility between the older animalistic values and the new ones imposed by selective forces downstream of cultural accumulation, turning man into a "sick animal". But that's a story for another day.
  12. 12.Mill, On Liberty.
  13. 13.It might be interesting to approach this from the POV of Zizekian "ideology". Perhaps the issue is a kind of a-priori faith (because belief by conviction isn't really—it has already been mediated through our subjectivity) which disintegrates once you instrumentalize the idea. Of course people are resistant to instrumentalizing sacred values. From The Sublime Object of Ideology: "Pascal's final answer, then, is: leave rational argumentation and submit yourself simply to ideological ritual".
  14. 14.There's a Chesterton's Fence aspect to all of this: you need to understand the lie before you try to tear it down.



Links Jan-Feb 2020

Word2vec: fish + music = bass

fish + music = bass
fish + friend = chum
fish + hair = mullet
fish + struggle = flounder
oink - pig + bro = wassup
yeti – snow + economics = homo economicus
music – soul = downloadable ringtones
good : bad :: rock : nü metal

Related, The (Too Many) Problems of Analogical Reasoning with Word Vectors.

We always knew meta-analyses are somewhat flawed because of publication bias and the "file drawer problem", but exactly how bad is it? A new paper compares meta-analyses to pre-registered replications and finds that meta-analyses overstate effect sizes by 3x.

In related news, registered reports in psychology have 44% positive results vs 96% in the standard literature.

Female orgasm frequency by male income quartile. Obviously confounded in all sorts of ways, but still.

Effective Altruists tackle the problem of tfw no gf. h/t @SilverVVulpes

Mark Koyama reviews Scheidel's Escape from Rome, with some very interesting comments on the use of counterfactuals by historians vs economists doing history. "There is no control group for Europe had Archduke Ferdinand not been assassinated."

A review of Dietz Vollrath's new book, Fully Grown:

Vollrath’s preferred decomposition of the causes of the 1.25% annual slowdown in real GDP per capita growth is:

  • 0.80pp - Declining growth in human capital
  • 0.20pp - The shift of spending from goods to services
  • 0.15pp - Declining reallocation of workers and firms
  • 0.10pp - Declining geographic mobility

Pay-as-you-go pension systems are going to have serious trouble in countries with rapidly aging populations. Just how bad is it going to be? If you're a <40 yo worker today, it's probably safe to assume you won't be getting much out of the money you're paying into the pension system.

RCA summarizes his views on US healthcare costs with a ton of great charts: Why conventional wisdom on health care is wrong (a primer).

Should we be worrying about automation in the near future? Scholl and Hanson argue no.

Disco Elysium (which I highly recommend) lead designer and writer Robert Kurvitz talks about the development process and how twitter inspired their dialogue engine: The Feature That Almost Sank Disco Elysium.

It has long been established that asking the same question twice in the same questionnaire will often result in the same person giving two different responses. But what happens if you place the repeated questions right next to each other?

Human-cat co-evolution: "We found that the population density of free-ranging cats is linearly related to the proportion of female students in the university. [...] suggests that the cats may have the ability to distinguish human sex and adopt a sociable skill to human females."

The dril Turing test.

And here's some sweet Afro-Cuban jazz fusion.