## Regression och heterogenitet

24 Nov, 2021 at 13:05 | Posted in Statistics & Econometrics | 1 CommentEn grupp ‘högpresterande’ elever — Ada, Beda, och Cissi — söker in till en friskola. Ada och Beda blir antagna och börjar på den. Cissi blir också antagen, men väljer att gå på en kommunal skola. En annan grupp ‘lågpresterande’ elever — bestående av Dora och Eva — söker och blir både antagna till en friskola, men Eva väljer att gå på en kommunal skola.

Om vi nu tittar på hur de presterar på ett kunskapsprov får vi följande resulatat: Ada — 22, Beda — 20, Cissi — 22, Dora — 12, Eva — 6. I den första gruppen får vi en provresultatskillnad mellan de elever som går på friskola och eleven som går i kommunal skola på **-1** ((22+20)/2 – 22). I den andra gruppen blir provresultatskillnaden mellan eleven som väljer att gå på friskola och eleven som väljer gå i kommunal skola 6 (12-6). Den genomsnittliga provresultatskillnaden för grupperna tagna tillsammans är **2.5 **((-1+6)/2). Om man kör en vanlig OLS regression på datan — Skattade Provresultat = α + ß*Skolform + ζ*Grupptillhörighet — så får vi α = 8, ß = **2** och ζ = 12.

Kruxet med regressionsparameterskattningen är att det viktade genomsnittsvärdet — 2 — egentligen inte säger speciellt mycket om de gruppspecifika effekterna, där vi i den ena gruppen har en negativ ‘effekt’ av att gå i friskola och i den andra en positiv ‘effekt.’ Återigen har vi ett exempel där verklighetens heterogenitet riskerar ‘maskeras’ när man använder traditionell regressionsanalys för att skatta ‘kausala’ effekter.

## Anti-vaxxers and statistics

22 Nov, 2021 at 17:26 | Posted in Statistics & Econometrics | 17 CommentsIf we make a Pearson’s correlation analysis on the variables in the scatterplot above — the corona vaccination rate and the (7 day) case rate per 100 000 people (data from last week) — we get an r = -0,15. And still in the youtube video below Sahra Wagenknecht says there is “keine Zusammenhang” between vaccination rate and case rate for the countries in the plot. Maybe someone ought to teach high-profile anti-vaxxers some basic statistics …

.

## David Freedman — the conscience of statistics

21 Nov, 2021 at 14:19 | Posted in Statistics & Econometrics | 1 CommentOn the issue of the various shortcomings of statistics, regression analysis, and econometrics, no one sums it up better than David Freedman in his *Statistical Models and Causal Inference:*

In my view, regression models are not a particularly good way of doing empirical work in the social sciences today, because the technique depends on knowledge that we do not have. Investigators who use the technique are not paying adequate attention to the connection — if any — between the models and the phenomena they are studying. Their conclusions may be valid for the computer code they have created, but the claims are hard to transfer from that microcosm to the larger world …

Regression models often seem to be used to compensate for problems in measurement, data collection, and study design. By the time the models are deployed, the scientific position is nearly hopeless. Reliance on models in such cases is Panglossian …

Given the limits to present knowledge, I doubt that models can be rescued by technical fixes. Arguments about the theoretical merit of regression or the asymptotic behavior of specification tests for picking one version of a model over another seem like the arguments about how to build desalination plants with cold fusion and the energy source. The concept may be admirable, the technical details may be fascinating, but thirsty people should look elsewhere …

Causal inference from observational data presents may difficulties, especially when underlying mechanisms are poorly understood. There is a natural desire to substitute intellectual capital for labor, and an equally natural preference for system and rigor over methods that seem more haphazard. These are possible explanations for the current popularity of statistical models.

Indeed, far-reaching claims have been made for the superiority of a quantitative template that depends on modeling — by those who manage to ignore the far-reaching assumptions behind the models. However, the assumptions often turn out to be unsupported by the data. If so, the rigor of advanced quantitative methods is a matter of appearance rather than substance.

## The experimentalist ‘revolution’ in economics

18 Nov, 2021 at 10:34 | Posted in Statistics & Econometrics | Comments Off on The experimentalist ‘revolution’ in economicsWhat has always bothered me about the “experimentalist” school is the false sense of certainty it conveys. The basic idea is that if we have a “really good instrument” we can come up with “convincing” estimates of “causal effects” that are not “too sensitive to assumptions.” Elsewhere I have written an extensive critique of this experimentalist perspective, arguing it presents a false panacea, andthat allstatistical inference relies on some untestable assumptions …

Consider Angrist and Lavy (1999), who estimate the effect of class size on student performance by exploiting variation induced by legal limits. It works like this: Let’s say a law prevents class size from exceeding. Let’s further assume a particular school has student cohorts that average about 90, but that cohort size fluctuates between, say, 84 and 96. So, if cohort size is 91–96 we end up with four classrooms of size 22 to 24, while if cohort size is 85–90 we end up with three classrooms of size 28 to 30. By comparing test outcomes between students who are randomly assigned to the small vs. large classes (based on their exogenous birth timing), we obtain a credible estimate of the effect of class size on academic performance. Their answer is that a ten-student reduction raises scores by about 0.2 to 0.3 standard deviations.

This example shares a common characteristic of natural experiment studies, which I think accounts for much of their popularity: At first blush, the results do seem incredibly persuasive. But if you think for awhile, you start to see they rest on a host of assumptions. For example, what if schools that perform well attract more students? In this case, incoming cohort sizes are not random, and the whole logic beaks down. What if parents who care most about education respond to large class sizes by sending their kids to a different school? What if teachers assigned to the extra classes offered in high enrollment years are not a random sample of all teachers?

## The presumed advantage of the experimentalist approach

9 Nov, 2021 at 15:56 | Posted in Statistics & Econometrics | Comments Off on The presumed advantage of the experimentalist approachHere, I want to challenge the popular view that “natural experiments” offer a simple, robust and relatively “assumption free” way to learn interesting things about economic relationships. Indeed, I will argue that it is not possible to learn anything of interest from data without theoretical assumptions, even when one has available an “ideal instrument”. Data cannot determine interesting economic relationships without

a prioriidentifying assumptions, regardless of what sort of idealized experiments, “natural experiments” or “quasi-experiments” are present in that data. Economic models are always needed to provide a window through which we interpret data, and our interpretation will always be subjective, in the sense that it is contingent on our model.Furthermore, atheoretical “experimentalist” approaches do not rely on fewer or weaker assumptions than do structural approaches. The real distinction is that, in a structural approach, one’s

a prioriassumptions about behavior must be laid out explicitly, while in an experimentalist approach, key assumptions are left implicit …If one accepts that inferences drawn from experimentalist work are just as contingent on

a prioriassumptions as those from structural work, the key presumed advantage of the experimentalist approach disappears. One is forced to accept that all empirical work in economics, whether “experimentalist” or “structural”, relies critically ona prioritheoretical assumptions.

In econometrics, it is often said that the error term in the regression model used represents the effect of the variables that are omitted from the model. The error term is somehow thought to be a ‘cover-all’ term representing omitted content in the model and necessary to include to ‘save’ the assumed deterministic relation between the other random variables included in the model. Error terms are usually assumed to be orthogonal (uncorrelated) to the explanatory variables. But since they are unobservable, they are also impossible to empirically test. And without justification of the orthogonality assumption, there is as a rule nothing to ensure identifiability. To me, this only highlights that the important lesson to draw from the debate between ‘structuralist’ and ‘experimentalist’ econometricians is that no matter what set of assumptions you choose to build your analysis on, you will never be able to empirically test them conclusively. Ultimately it always comes down to a question of faith.

## Science — a messy business

6 Nov, 2021 at 14:24 | Posted in Statistics & Econometrics | Comments Off on Science — a messy businessThe obvious response of course, albeit one that econometricians occupied with fitting a line to given sets of data rarely contemplate, is to add to the ‘available data.’ Specifically the aim must be to draw consequences for, and seek out observations on, actual phenomena which allow the causal factor responsible to be identified. If, for example, bird droppings is a relevant causal factor then we could expect higher yields wherever birds roost. Perhaps there is a telegraph wire that crosses the field which is heavily populated by roosting birds, but which provides only negligible shade … Perhaps too there is a plot of land somewhere close to the farm house which is shaded by a protruding iron roof, but which birds avoid because of a patrolling cat … The fact that it is not possible to state categorically at this abstract level the precise conditions under which substantive theories can be selected amongst, i.e. without knowing the contents of the theories themselves or the nature or context of the conditions upon which they bear, is an unfortunate fact of all science. Science is a messy business. It requires an abundance of ingenuity, as well as patience, along with skills that may need to be developed on the job.

## Modelling dangers

29 Oct, 2021 at 11:19 | Posted in Statistics & Econometrics | Comments Off on Modelling dangersWith models, it is easy to lose track of three essential points: (i) results depend on assumptions, (ii) changing the assumptions in apparently innocuous ways can lead to drastic changes in conclusions, and (iii) familiarity with a model’s name is no guarantee of the model’s truth. Under the circumstances, it may be the assumptions behind the model that provide the leverage, not the data fed into the model. This is a danger with experiments, and even more so with observational studies.

## The LATE estimator — a critique (wonkish)

24 Oct, 2021 at 11:58 | Posted in Statistics & Econometrics | Comments Off on The LATE estimator — a critique (wonkish)One of the reasons Guido Imbens and Joshua Angrist were given this year’s ‘Nobel prize’ in economics is their LATE estimator used in instrumental variables estimation of causal effects. Another prominent ‘Nobel prize’ winner in economics — Angus Deaton — is not overly impressed:

Without explicit prior consideration of the effect of the instrument choice on the parameter being estimated, such a procedure is effectively the opposite of standard statistical practice in which a parameter of interest is defined first, followed by an estimator that delivers that parameter. Instead, we have a procedure in which the choice of the instrument, which is guided by criteria designed for a situation in which there is no heterogeneity, is implicitly allowed to determine the parameter of interest. This goes beyond the old story of looking for an object where the light is strong enough to see; rather, we have at least some control over the light but choose to let it fall where it may and then proclaim that whatever it illuminates is what we were looking for all along …

The LATE may or may not be a parameter of interest to the World Bank or the Chinese government and, in general, there is no reason to suppose that it will be …

I find it hard to make any sense of the LATE. We are unlikely to learn much about the processes at work if we refuse to say anything about what determines (the effect ‘parameter’) θ; heterogeneity is not a technical problem calling for an econometric solution but a reflection of the fact that we have not started on our proper business, which is trying to understand what is going on. Of course, if we are as skeptical of the ability of economic theory to deliver useful models as are many applied economists today, the ability to avoid modeling can be seen as an advantage, though it should not be a surprise when such an approach delivers answers that are hard to interpret.

## How to achieve exchangeability (student stuff)

21 Oct, 2021 at 16:53 | Posted in Statistics & Econometrics | Comments Off on How to achieve exchangeability (student stuff).

## ‘Nobel prize’ econometrics

16 Oct, 2021 at 09:59 | Posted in Statistics & Econometrics | Comments Off on ‘Nobel prize’ econometrics.

Great presentation, but I do think Angrist ought to have also mentioned that although ‘ideally controlled experiments’ may tell us with certainty what causes what effects, this is so only when given the right ‘closures.’ Making appropriate extrapolations from — ideal, accidental, natural or quasi — experiments to different settings, populations or target systems, is not easy. “It works there” is no evidence for “it will work here.” The causal background assumptions made have to be justified, and without licenses to export, the value of ‘rigorous’ and ‘precise’ methods used when analyzing ‘natural experiments’ is often despairingly small. Since the core assumptions on which IV analysis builds are NEVER directly testable, those of us who choose to use instrumental variables to find out about causality ALWAYS have to defend and argue for the validity of the assumptions the causal inferences build on. Especially when dealing with natural experiments, we should be very cautious when being presented with causal conclusions without convincing arguments about the veracity of the assumptions made. If you are out to make causal inferences you have to rely on a trustworthy theory of the data generating process. The empirical results causal analysis supply us with are only as good as the assumptions we make about the data generating process. Garbage in, garbage out.

## Econometric toolbox developers get this year’s ‘Nobel prize’ in economics

11 Oct, 2021 at 17:46 | Posted in Statistics & Econometrics | 2 CommentsMany of the big questions in the social sciences deal with cause and effect. How does immigration affect pay and employment levels? How does a longer education affect someone’s future income? …

This year’s Laureates have shown that it is possible to answer these and similar questions using

natural experiments. The key is to use situations in which chance events or policy changes result in groups of people being treated differently, in a way that resembles clinical trials in medicine.Using natural experiments,

David Cardhas analysed the labour market effects of minimum wages, immigration and education …Data from a natural experiment are difficult to interpret, however … In the mid-1990s,

Joshua AngristandGuido Imbenssolved this methodological problem, demonstrating how precise conclusions about cause and effect can be drawn from natural experiments.

For economists interested in research methodology in general and natural experiments in specific, these three economists are well-known. A central part of their work is based on the idea that random or as-if random assignment in natural experiments obviates the need for controlling potential confounders, and hence this kind of ‘simple and transparent’ design-based research method is preferable to more traditional multivariate regression analysis where the controlling only comes in* ex post* via statistical modelling.

But — there is always a but …

The point of making a randomized experiment is often said to be that it ‘ensures’ that any correlation between a supposed cause and effect indicates a causal relation. This is believed to hold since randomization (allegedly) ensures that a supposed causal variable does not correlate with other variables that may influence the effect.

The problem with that simplistic view on randomization is that the claims made are exaggerated and sometimes even false:

• Even if you manage to do the assignment to treatment and control groups ideally random, the sample selection certainly is — except in extremely rare cases — not random. Even if we make a proper randomized assignment, if we apply the results to a biased sample, there is always the risk that the experimental findings will not apply. What works ‘there,’ does not work ‘here.’ Randomization hence does not ‘guarantee ‘ or ‘ensure’ making the right causal claim. Although randomization may help us rule out certain possible causal claims, randomization *per se* does not *guarantee* anything!

• Even if both sampling and assignment are made in an ideal random way, performing standard randomized experiments only give you averages. The problem here is that although we may get an estimate of the ‘true’ average causal effect, this may ‘mask’ important heterogeneous effects of a causal nature. Although we get the right answer of the average causal effect being 0, those who are ‘treated’ may have causal effects equal to -100 and those ‘not treated’ may have causal effects equal to 100. Contemplating being treated or not, most people would probably be interested in knowing about this underlying heterogeneity and would not consider the average effect particularly enlightening.

• There is almost always a trade-off between bias and precision. In real-world settings, a little bias often does not overtrump greater precision. And — most importantly — in case we have a population with sizeable heterogeneity, the average treatment effect of the sample may differ substantially from the average treatment effect in the population. If so, the value of any extrapolating inferences made from trial samples to other populations is highly questionable.

• Since most real-world experiments and trials build on performing a single randomization, what *would* happen *if* you kept on randomizing forever, does not help you to ‘ensure’ or ‘guarantee’ that you do not make false causal conclusions in the one particular randomized experiment you actually do perform. It is indeed difficult to see why thinking about what you know you will never do, would make you happy about what you actually do.

• And then there is also the problem that ‘Nature’ may not always supply us with the random experiments we are most interested in. If we are interested in X, why should we study Y only because design dictates that? Method should never be prioritized over substance!

Randomization is not a panacea. It is not the best method for all questions and circumstances. Proponents of randomization make claims about its ability to deliver causal knowledge that are simply wrong. There are good reasons to be sceptical of this nowadays popular — and ill-informed — view that randomization is the only valid and the best method on the market. It is not.

Trygve Haavelmo — the father of modern probabilistic econometrics — once wrote that he and other econometricians could not build a complete bridge between our models and reality by logical operations alone, but finally had to make “a non-logical jump.” To Haavelmo and his modern followers, econometrics is not really in the truth business. The explanations we can give of economic relations and structures based on econometric models are “not hidden truths to be discovered” but rather our own “artificial inventions”.

Rigour and elegance in the analysis do not make up for the gap between reality and model. A crucial ingredient to any economic theory that wants to use probabilistic models should be a convincing argument for the view that it is harmless to consider economic variables as stochastic variables. In most cases, no such arguments are given.

A rigorous application of econometric methods in economics really presupposes that the phenomena of our real-world economies are ruled by stable causal relations between variables. To warrant this assumption one, however, has to convincingly establish that the targeted acting causes are stable and invariant so that they maintain their parametric status after the bridging. The endemic lack of predictive success of the econometric project indicates that this hope of finding fixed parameters is a hope for which there really is no other ground than hope itself.

Evidence-based theories and policies are highly valued nowadays. Randomization is supposed to control for bias from unknown confounders. The received opinion is that evidence based on randomized experiments therefore is the best.

More and more economists have also lately come to advocate randomization as the principal method for ensuring being able to make valid causal inferences.

I would however rather argue that randomization — just as econometrics — promises more than it can deliver, basically because it requires assumptions that in practice are not possible to maintain.

Given the assumptions (such as manipulability, transitivity, separability, additivity, linearity, etc.) econometric methods deliver deductive inferences. The problem, of course, is that we will never completely know when the assumptions are right. And although randomization may contribute to controlling for confounding, it does not guarantee it, since genuine ramdomness presupposes infinite experimentation and we know all real experimentation is finite.

The prize committe says that econometrics and natural experiments “help answer important questions for society.” Maybe so, but it is far from evident to what extent they do so. As a rule, the econometric practitioners of natural experiments have far to over-inflated hopes on their explanatory potential and value.

## Statistics and econometrics — science building on fantasy worlds

28 Sep, 2021 at 11:04 | Posted in Statistics & Econometrics | 2 CommentsIn econometrics one often gets the feeling that many of its practitioners think of it as a kind of automatic inferential machine: input data and out comes casual knowledge. This is like pulling a rabbit from a hat. Great — but first you have to put the rabbit in the hat. And this is where assumptions come into the picture.

The assumption of imaginary ‘super populations’ is one of the many dubious assumptions used in modern econometrics.

As social scientists — and economists — we have to confront the all-important question of how to handle uncertainty and randomness. Should we define randomness with probability? If we do, we have to accept that to speak of randomness we also have to presuppose the existence of nomological probability machines, since probabilities cannot be spoken of – and actually, to be strict, do not at all exist – without specifying such system-contexts. Accepting a domain of probability theory and sample space of infinite populations also implies that judgments are made on the basis of observations that are actually never made!

Infinitely repeated trials or samplings never take place in the real world. So that cannot be a sound inductive basis for a science with aspirations of explaining real-world socio-economic processes, structures or events. It’s not tenable.

And as if this wasn’t enough, one could — as we’ve seen — also seriously wonder what kind of ‘populations’ these statistical and econometric models ultimately are based on. Why should we as social scientists — and not as pure mathematicians working with formal-axiomatic systems without the urge to confront our models with real target systems — unquestioningly accept models based on concepts like the ‘infinite super populations’ used in e.g. the ‘potential outcome’ framework that has become so popular lately in social sciences?

The theory requires that the data be embedded in a stochastic framework, complete with random variables, probability distributions, and unknown parameters. However, the data often arrive without benefit of randomness. In such cases, the investigators may still wish to separate effects of “the causes they wish to study or are trying to detect” from “accidental occurrences due to the many other circumstances which they cannot control.” What can they do? Usually, they follow Fisher (1922) into a fantasy world “by constructing a hypothetical infinite population, of which the actual data are regarded as constituting a random sample.” Unfortunately, this fantasy world is often harder to understand than the original problem which lead to its invocation.

Of course one could treat observational or experimental data as random samples from real populations. I have no problem with that (although it has to be noted that most ‘natural experiments’ are *not* based on random sampling from some underlying population — which, of course, means that the effect-estimators, strictly seen, only are unbiased for the specific groups studied). But probabilistic econometrics does not content itself with that kind of populations. Instead, it creates imaginary populations of ‘parallel universes’ and assume that our data are random samples from that kind of ‘infinite super populations.’

But this is actually nothing else but hand-waving! And it is inadequate for real science. As David Freedman writes:

These are convenient fictions… Nevertheless, reliance on imaginary populations is widespread. Indeed regression models are commonly used to analyze convenience samples… The rhetoric of imaginary populations is seductive because it seems to free the investigator from the necessity of understanding how data were generated.

Modelling assumptions made in statistics and econometrics are more often than not made for mathematical tractability reasons, rather than verisimilitude. That is unfortunately also a reason why the methodological ‘rigour’ encountered when taking part of statistical and econometric research to a large degree is nothing but deceptive appearance. The models constructed may seem technically advanced and very ‘sophisticated,’ but that’s usually only because the problems here discussed have been swept under the carpet. Assuming that our data are generated by ‘coin flips’ in an imaginary ‘superpopulation’ only means that we get answers to questions that we are not asking. The inferences made based on imaginary ‘superpopulations,’ well, they too are nothing but imaginary. We should not — as already Aristotle noted — expect more rigour and precision than the object examined allows. And in social sciences — including economics and econometrics — it’s always wise to ponder C. S. Peirce’s remark that universes are not as common as peanuts …

## Why technical fixes will not rescue econometrics

27 Sep, 2021 at 21:35 | Posted in Statistics & Econometrics | 1 CommentOn the issue of the various shortcomings of regression analysis and econometrics, no one sums it up better than David Freedman in his *Statistical Models and Causal Inference:*

In my view, regression models are not a particularly good way of doing empirical work in the social sciences today, because the technique depends on knowledge that we do not have. Investigators who use the technique are not paying adequate attention to the connection — if any — between the models and the phenomena they are studying. Their conclusions may be valid for the computer code they have created, but the claims are hard to transfer from that microcosm to the larger world …

Regression models often seem to be used to compensate for problems in measurement, data collection, and study design. By the time the models are deployed, the scientific position is nearly hopeless. Reliance on models in such cases is Panglossian …

Given the limits to present knowledge, I doubt that models can be rescued by technical fixes. Arguments about the theoretical merit of regression or the asymptotic behavior of specification tests for picking one version of a model over another seem like the arguments about how to build desalination plants with cold fusion and the energy source. The concept may be admirable, the technical details may be fascinating, but thirsty people should look elsewhere …

Causal inference from observational data presents may difficulties, especially when underlying mechanisms are poorly understood. There is a natural desire to substitute intellectual capital for labor, and an equally natural preference for system and rigor over methods that seem more haphazard. These are possible explanations for the current popularity of statistical models.

Indeed, far-reaching claims have been made for the superiority of a quantitative template that depends on modeling — by those who manage to ignore the far-reaching assumptions behind the models. However, the assumptions often turn out to be unsupported by the data. If so, the rigor of advanced quantitative methods is a matter of appearance rather than substance.

## Science before statistics — causal inference

10 Sep, 2021 at 12:20 | Posted in Statistics & Econometrics | Comments Off on Science before statistics — causal inference.

## Probability and rationality — trickier than most people think

26 Aug, 2021 at 18:42 | Posted in Statistics & Econometrics | 5 Comments**The Coin-tossing Problem**

My friend Ben says that on the first day he got the following sequence of Heads and Tails when tossing a coin:

H H H H H H H H H H

And on the second day he says that he got the following sequence:

H T T H H T T H T H

Which report makes you suspicious?

Most people yours truly asks this question says the first report looks suspicious.

But actually both reports are equally probable! Every time you toss a (fair) coin there is the same probability (50 %) of getting H or T. Both days Ben makes equally many tosses and every sequence is equally probable!

**The Linda Problem**

Linda is 40 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations.

Which of the following two alternatives is more probable?

A. Linda is a bank teller.

B. Linda is a bank teller and active in the feminist movement.

‘Rationally,’ alternative B cannot be more likely than alternative A. Nonetheless Amos Tversky and Daniel Kahneman reported — ‘Judgments of and by representativeness.’ In D. Kahneman, P. Slovic & A. Tversky (Eds.), *Judgment under uncertainty: Heuristics and biases.* Cambridge, UK: Cambridge University Press 1982 — that more than 80 per cent of respondents said that it was.

Why do we make such ‘irrational’ judgments in both these cases? Tversky and Kahneman argued that in making this kind of judgment we seek the closest resemblance between causes and effects (in The Linda Problem, between Linda’s personality and her behaviour), rather than calculating probability, and that this makes alternative B seem preferable. By using a heuristic called *representativeness*, statement B in The Linda Problem seems more ‘representative’ of Linda based on the description of her, although from a probabilistic point of view it is clearly less likely.

Blog at WordPress.com.

Entries and Comments feeds.