## The randomistas revolution

7 Jul, 2018 at 10:54 | Posted in Statistics & Econometrics | 2 CommentsIn his new history of experimental social science — *Randomistas: How radical researchers are changing our world* — Andrew Leigh gives an introduction to the RCT (randomized controlled trial) method for conducting experiments in medicine, psychology, development economics, and policy evaluation. Although it mentions there are critiques that can be waged against it, the author does not let that shadow his overwhelmingly enthusiastic view on RCT.

Among mainstream economists, this uncritical attitude towards RCTs has become standard. Nowadays many mainstream economists maintain that ‘imaginative empirical methods’ — such as natural experiments, field experiments, lab experiments, RCTs — can help us to answer questions concerning the external validity of economic models. In their view, they are more or less tests of ‘an underlying economic model’ and enable economists to make the right selection from the ever-expanding ‘collection of potentially applicable models.’

When looked at carefully, however, there are in fact few real reasons to share this optimism on the alleged ’empirical turn’ in economics.

If we see experiments or field studies as theory tests or models that ultimately aspire to say something about the real ‘target system,’ then the problem of external validity is central (and was for a long time also a key reason why behavioural economists had trouble getting their research results published).

Assume that you have examined how the performance of a group of people (A) is affected by a specific ‘treatment’ (B). How can we extrapolate/generalize to new samples outside the original population? How do we know that any replication attempt ‘succeeds’? How do we know when these replicated experimental results can be said to justify inferences made in samples from the original population? If, for example, P(A|B) is the conditional density function for the original sample, and we are interested in doing an extrapolative prediction of E [P(A|B)], how can we know that the new sample’s density function is identical with the original? Unless we can give some really good argument for this being the case, inferences built on P(A|B) is not really saying anything on that of the target system’s P'(A|B).

External validity/extrapolation/generalization is founded on the assumption that we can make inferences based on P(A|B) that is exportable to other populations for which P'(A|B) applies. Sure, if one can convincingly show that P and P’are similar enough, the problems are perhaps surmountable. But arbitrarily just introducing functional specification restrictions of the type invariance/stability /homogeneity, is, at least for an epistemological realist far from satisfactory. And often it is – unfortunately – exactly this that I see when I take part of mainstream economists’ RCTs and ‘experiments.’

Many ‘experimentalists’ claim that it is easy to replicate experiments under different conditions and therefore a fortiori easy to test the robustness of experimental results. But is it really that easy? Population selection is almost never simple. Had the problem of external validity only been about inference from sample to population, this would be no critical problem. But the really interesting inferences are those we try to make from specific labs/experiments/fields to specific real-world situations/institutions/ structures that we are interested in understanding or (causally) to explain. And then the population problem is more difficult to tackle.

In randomized trials the researchers try to find out the causal effects that different variables of interest may have by changing circumstances randomly — a procedure somewhat (‘on average’) equivalent to the usual ceteris paribus assumption).

Besides the fact that ‘on average’ is not always ‘good enough,’ it amounts to nothing but hand waving to simpliciter assume, without argumentation, that it is tenable to treat social agents and relations as homogeneous and interchangeable entities.

Randomization is used to basically allow the econometrician to treat the population as consisting of interchangeable and homogeneous groups (‘treatment’ and ‘control’). The regression models one arrives at by using randomized trials tell us the average effect that variations in variable X has on the outcome variable Y, without having to explicitly control for effects of other explanatory variables R, S, T, etc., etc. Everything is assumed to be essentially equal except the values taken by variable X.

In a usual regression context one would apply an ordinary least squares estimator (OLS) in trying to get an unbiased and consistent estimate:

Y = α + βX + ε,

where α is a constant intercept, β a constant ‘structural’ causal effect and ε an error term.

The problem here is that although we may get an estimate of the ‘true’ average causal effect, this may ‘mask’ important heterogeneous effects of a causal nature. Although we get the right answer of the average causal effect being 0, those who are ‘treated'( X=1) may have causal effects equal to – 100 and those ‘not treated’ (X=0) may have causal effects equal to 100. Contemplating being treated or not, most people would probably be interested in knowing about this underlying heterogeneity and would not consider the OLS average effect particularly enlightening.

Most ‘randomistas’ underestimate the heterogeneity problem. It does not just turn up as an external validity problem when trying to ‘export’ regression results to different times or different target populations. It is also often an internal problem to the millions of regression estimates that economists produce every year.

‘Ideally controlled experiments’ tell us with certainty what causes what effects — but only given the right ‘closures.’ Making appropriate extrapolations from (ideal, accidental, natural or quasi) experiments to different settings, populations or target systems, is not easy. “It works there” is no evidence for “it will work here”. Causes deduced in an experimental setting still have to show that they come with an export-warrant to the target population/system. The causal background assumptions made have to be justified, and without licenses to export, the value of ‘rigorous’ and ‘precise’ methods — and ‘on-average-knowledge’ — is despairingly small.

RCTs have very little reach beyond giving descriptions of what has happened in the past. From the perspective of the future and for policy purposes they are as a rule of limited value since they cannot tell us what background factors were held constant when the trial intervention was being made.

RCTs usually do not provide evidence that the results are exportable to other target systems. RCTs cannot be taken for granted to give generalizable results. That something works *somewhere* for someone is no warranty for us to believe it to work for us *here* or even that it works *generally*. RCTs are simply not the best method for all questions and in all circumstances. And insisting on using only *one* tool often means using the *wrong* tool.

## 2 Comments

Sorry, the comment form is closed at this time.

Blog at WordPress.com.

Entries and Comments feeds.

Evidence‐based practice in educational research: a critical realist critique of systematic review

Sue Clegg* https://www.tandfonline.com/doi/abs/10.1080/01425690500128932

Comment by Jan Milch— 7 Jul, 2018 #

It seems to me that those involved in studies should, and do often do, treat the proposition ‘X is a good method’ as being uncertain, whereas once a method has been deemed to be good, it will be applied with less circumspection, even as dogma. Therefore the context of the roll-out is frequently significantly different from that of the study. In so far as a study relies on logic, all one can say is that the study ‘suggests’ that X is good, and seems to be a hypothesis worth testing. In principle only seems possible to establish the actual goodness of X if either:

1) One fools the subjects of the study into thinking that the method is already known to be good (which would be unethical).

2) One combines some degree of monitoring of the roll-out, so that one can check that ‘in practice’ it really is good.

In both case, there seems no way to know if the ‘goodness’ will remain good. Even in a physics laboratory one cannot be sure that everything that is actually relevant is actually constrained enough. So, to a greater or lesser degree, one has to monitor the situation very broadly to make sure that X is still a good method.

RCTs and the like can be the best possible method, but that doesn’t excuse them from the laws of thought.

Comment by Dave Marsay— 9 Jul, 2018 #