## Extreme events and how to live with them

12 Jun, 2021 at 12:00 | Posted in Statistics & Econometrics | Leave a comment.

## On the limits of ‘mediation analysis’ and ‘statistical causality’

11 Jun, 2021 at 18:12 | Posted in Statistics & Econometrics | Leave a comment“Mediation analysis” is this thing where you have a treatment and an outcome and you’re trying to model how the treatment works: how much does it directly affect the outcome, and how much is the effect “mediated” through intermediate variables …

In the real world, it’s my impression that almost all the mediation analyses that people actually fit in the social and medical sciences are misguided: lots of examples where the assumptions aren’t clear and where, in any case, coefficient estimates are hopelessly noisy and where confused people will over-interpret statistical significance …

More and more I’ve been coming to the conclusion that the standard causal inference paradigm is broken … So how to do it? I don’t think traditional path analysis or other multivariate methods of the throw-all-the-data-in-the-blender-and-let-God-sort-em-out variety will do the job. Instead we need some structure and some prior information.

Causality in social sciences — and economics — can never solely be a question of statistical inference. Causality entails more than predictability, and to really in depth explain social phenomena require theory. Analysis of variation — the foundation of all econometrics — can never in itself reveal how these variations are brought about. First, when we are able to tie actions, processes or structures to the statistical relations detected, can we say that we are getting at relevant explanations of causation.

Most facts have many different, possible, alternative explanations, but we want to find the best of all contrastive (since all real explanation takes place relative to a set of alternatives) explanations. So which is the best explanation? Many scientists, influenced by statistical reasoning, think that the likeliest explanation is the best explanation. But the likelihood of x is not in itself a strong argument for thinking it explains y. I would rather argue that what makes one explanation better than another are things like aiming for and finding powerful, deep, causal, features and mechanisms that we have warranted and justified reasons to believe in. Statistical — especially the variety based on a Bayesian epistemology — reasoning generally has no room for these kinds of explanatory considerations. The only thing that matters is the probabilistic relation between evidence and hypothesis. That is also one of the main reasons I find abduction — inference to the best explanation — a better description and account of what constitute actual scientific reasoning and inferences.

In the social sciences … regression is used to discover relationships or to disentangle cause and effect. However, investigators have only vague ideas as to the relevant variables and their causal order; functional forms are chosen on the basis of convenience or familiarity; serious problems of measurement are often encountered.

Regression may offer useful ways of summarizing the data and making predictions. Investigators may be able to use summaries and predictions to draw substantive conclusions. However, I see no cases in which regression equations, let alone the more complex methods, have succeeded as engines for discovering causal relationships.

Some statisticians and data scientists think that algorithmic formalisms somehow give them access to causality. That is, however, simply not true. Assuming ‘convenient’ things like faithfulness or stability is not to give proofs. It’s to assume what has to be proven. Deductive-axiomatic methods used in statistics do no produce evidence for causal inferences. The real causality we are searching for is the one existing in the real world around us. If there is no warranted connection between axiomatically derived theorems and the real world, well, then we haven’t really obtained the causation we are looking for.

If contributions made by statisticians to the understanding of causation are to be taken over with advantage in any specific field of inquiry, then what is crucial is that the right relationship should exist between statistical and subject-matter concerns …

The idea of causation as consequential manipulation is apt to research that can be undertaken primarily through experimental methods and, especially to ‘practical science’ where the central concern is indeed with ‘the consequences of performing particular acts’. The development of this idea in the context of medical and agricultural research is as understandable as the development of that of causation as robust dependence within applied econometrics. However, the extension of the manipulative approach into sociology would not appear promising, other than in rather special circumstances … The more fundamental difficulty is that under the — highly anthropocentric — principle of ‘no causation without manipulation’, the recognition that can be given to the action of individuals as having causal force is in fact peculiarly limited.

## Causality and the need to reform the teaching of statistics

10 Jun, 2021 at 16:18 | Posted in Statistics & Econometrics | Leave a commentI will argue that realistic and thus scientifically relevant statistical theory is best viewed as a subdomain of causality theory, not a separate entity or an extension of probability. In particular, the application of statistics (and indeed most technology) must deal with causation if it is to represent adequately the underlying reality of how we came to observe what was seen … The network we deploy for analysis incorporates whatever time-order and independence assumptions we use for interpreting observed associations, whether those assumptions are derived from background (contextual) or design information … Statistics should integrate causal networks into its basic teachings and indeed into its entire theory, starting with the probability and bias models that are used to build up statistical methods and interpret their outputs. Every real data analysis has a causal component comprising the causal network assumed to have created the data set …

Thus, because statistical analyses need a causal skeleton to connect to the world, causality is not extra-statistical but instead is a logical antecedent of real-world inferences. Claims of random or “ignorable” or “unbiased” sampling or allocation are justified by causal actions to block (“control”) unwanted causal effects on the sample patterns. Without such actions of causal blocking, independence can only be treated as a subjective exchangeability assumption whose justification requires detailed contextual information about absence of factors capable of causally influencing both selection (including selection for treatment) and outcomes …

Given the absence of elaborated causality discussions in statistics textbooks and coursework, we should not be surprised at the widespread misuse and misinterpretation of statistical methods and results. This is why incorporation of causality into introductory statistics is needed as urgently as other far more modest yet equally resisted reforms involving shifts in labels and interpretations for p-values and interval estimates.

## The elite illusion

9 Jun, 2021 at 13:13 | Posted in Statistics & Econometrics | Leave a comment.

A great set of lectures — but yours truly still warns his students that regression-based averages is something we have reasons to be cautious about.

Suppose we want to estimate the average causal effect of a dummy variable (T) on an observed outcome variable (O). In a usual regression context one would apply an ordinary least squares estimator (OLS) in trying to get an unbiased and consistent estimate:

O = α + βT + ε,

where α is a constant intercept, β a constant ‘structural’ causal effect and ε an error term.

The problem here is that although we may get an estimate of the ‘true’ average causal effect, this may ‘mask’ important heterogeneous effects of a causal nature. Although we get the right answer of the average causal effect being 0, those who are ‘treated’ (T=1) may have causal effects equal to -100 and those ‘not treated’ (T=0) may have causal effects equal to 100. Contemplating being treated or not, most people would probably be interested in knowing about this underlying heterogeneity and would not consider the OLS average effect particularly enlightening.

The heterogeneity problem does not just turn up as an external validity problem when trying to ‘export’ regression results to different times or different target populations. It is also often an internal problem to the millions of OLS estimates that economists produce every year.

## Why data alone does not answer counterfactual questions.

8 Jun, 2021 at 22:40 | Posted in Statistics & Econometrics | Leave a comment.

## What are the key assumptions of linear regression models?

7 Jun, 2021 at 22:08 | Posted in Statistics & Econometrics | 3 CommentsIn Andrew Gelman’s and Jennifer Hill’s Data Analysis Using Regression and Multilevel/Hierarchical Models the authors list the assumptions of the linear regression model. The assumptions — *in decreasing order of importance *— are:

1. Validity. Most importantly, the data you are analyzing should map to the research question you are trying to answer. This sounds obvious but is often overlooked or ignored because it can be inconvenient. . . .

2. Additivity and linearity. The most important mathematical assumption of the regression model is that its deterministic component is a linear function of the separate predictors . . .

3. Independence of errors. . . .

4. Equal variance of errors. . . .

5. Normality of errors. . . .

Further assumptions are necessary if a regression coefficient is to be given a causal interpretation …

Yours truly can’t but concur — especially on the “decreasing order of importance” of the assumptions. But then, of course, one really has to wonder why econometrics textbooks almost invariably turn this order of importance upside-down and don’t have more thorough discussions on the overriding importance of Gelman/Hill’s two first points …

## Does smoking — really — help you fight COVID-19?

6 Jun, 2021 at 11:36 | Posted in Statistics & Econometrics | Leave a comment.

## Counterfactual modelling (student stuff)

5 Jun, 2021 at 10:51 | Posted in Statistics & Econometrics | Leave a comment.

## Collider bias (student stuff)

3 Jun, 2021 at 18:17 | Posted in Statistics & Econometrics | Leave a comment.

## How statistics can be misleading

2 Jun, 2021 at 18:12 | Posted in Statistics & Econometrics | Leave a comment.

From a theoretical perspective, Simpson’s paradox importantly shows that causality can never be reduced to a question of statistics or probabilities.

To understand causality we always have to relate it to a specific causal *structure*. Statistical correlations are *never* enough. No structure, no causality.

Simpson’s paradox is an interesting paradox in itself, but it can also highlight a deficiency in the traditional econometric approach towards causality. Say you have 1000 observations on men and an equal amount of observations on women applying for admission to university studies, and that 70% of men are admitted, but only 30% of women. Running a logistic regression to find out the odds ratios (and probabilities) for men and women on admission, females seem to be in a less favourable position (‘discriminated’ against) compared to males (male odds are 2.33, female odds are 0.43, giving an odds ratio of 5.44). But once we find out that males and females apply to different departments we may well get a Simpson’s paradox result where males turn out to be ‘discriminated’ against (say 800 male apply for economics studies (680 admitted) and 200 for physics studies (20 admitted), and 100 female apply for economics studies (90 admitted) and 900 for physics studies (210 admitted) — giving odds ratios of 0.62 and 0.37).

Econometric patterns should never be seen as anything else than possible clues to follow. From a critical realist perspective, it is obvious that behind observable data there are real structures and mechanisms operating, things that are — if we really want to understand, explain and (possibly) predict things in the real world — more important to get hold of than to simply correlate and regress observable variables.

Math cannot establish the truth value of a fact. Never has. Never will.

## Contaminated data — the case of racial discrimination

31 May, 2021 at 18:26 | Posted in Statistics & Econometrics | Leave a comment.

## Exchangeability (student stuff)

30 May, 2021 at 12:48 | Posted in Statistics & Econometrics | 1 Comment.

## Econometrics — science built on untestable assumptions

27 May, 2021 at 22:31 | Posted in Statistics & Econometrics | 1 CommentJust what is the causal content attributed to structural models in econometrics? And what does this imply with respect to the interpretation of the error term? …

Consider briefly the testability of the assumptions brought to light in this section. Given these assumptions directly involve the factors omitted in the error term, testing these empirically seems impossible without information about what is hidden in the error term. But given the error term is unobservable, this places the modeller in a difficult situation: how to know that some important factor has not been left out from the model undermining desired inferences in some way. It also shows that there will always be element of faith in the assumptions about the error term.

In econometrics textbooks it is often said that the error term in the regression models used represents the effect of the variables that were omitted from the model. The error term is somehow thought to be a ‘cover-all’ term representing omitted content in the model and necessary to include to ‘save’ the assumed deterministic relation between the other random variables included in the model. Error terms are usually assumed to be orthogonal (uncorrelated) to the explanatory variables. But since they are unobservable, they are also impossible to empirically test. And without justification of the orthogonality assumption, there is as a rule nothing to ensure identifiability.

Distributional assumptions about error terms are a good place to bury things because hardly anyone pays attention to them. Moreover, if a critic does see that this is the identifying assumption, how can she win an argument about the true expected value the level of aether? If the author can make up an imaginary variable, “because I say so” seems like a pretty convincing answer to any question about its properties.

## Why econometric models by necessity are endlessly misspecified

27 May, 2021 at 14:39 | Posted in Statistics & Econometrics | 1 CommentThe impossibility of proper specification is true generally in regression analyses across the social sciences, whether we are looking at the factors affecting occupational status, voting behavior, etc. The problem is that as implied by the three conditions for regression analyses to yield accurate, unbiased estimates, you need to investigate a phenomenon that has underlying mathematical regularities – and, moreover, you need to know what they are. Neither seems true. I have no reason to believe that the way in which multiple factors affect earnings, student achievement, and GNP have some underlying mathematical regularity across individuals or countries. More likely, each individual or country has a different function, and one that changes over time. Even if there was some constancy, the processes are so complex that we have no idea of what the function looks like.

Researchers recognize that they do not know the true function and seem to treat, usually implicitly, their results as a good-enough approximation. But there is no basis for the belief that the results of what is run in practice is anything close to the underlying phenomenon, even if there is an underlying phenomenon. This just seems to be wishful thinking. Most regression analysis research doesn’t even pay lip service to theoretical regularities. But you can’t just regress anything you want and expect the results to approximate reality. And even when researchers take somewhat seriously the need to have an underlying theoretical framework – as they have, at least to some extent, in the examples of studies of earnings, educational achievement, and GNP that I have used to illustrate my argument – they are so far from the conditions necessary for proper specification that one can have no confidence in the validity of the results.

Most work in econometrics and regression analysis is made on the assumption that the researcher has a theoretical model that is ‘true.’ Based on this belief of having a correct specification for an econometric model or running a regression, one proceeds as if the only problem remaining to solve have to do with measurement and observation.

The problem is that there is pretty little to support the perfect specification assumption. Looking around in social science and economics we don’t find a single regression or econometric model that lives up to the standards set by the ‘true’ theoretical model — and there is nothing that gives us reason to believe things will be different in the future.

To think that we are being able to construct a model where all relevant variables are included and correctly specify the functional relationships that exist between them, is not only a belief with little support, but a belief *impossible* to support.

The theories we work with when building our econometric regression models are insufficient. No matter what we study, there are always some variables missing, and we don’t know the correct way to functionally specify the relationships between the variables.

*Every* regression model constructed is misspecified. There are always an endless list of possible variables to include, and endless possible ways to specify the relationships between them. So every applied econometrician comes up with his own specification and ‘parameter’ estimates. The econometric Holy Grail of consistent and stable parameter-values is nothing but a dream.

The theoretical conditions that have to be fulfilled for regression analysis and econometrics to really work are nowhere even closely met in reality. Making outlandish statistical assumptions does not provide a solid ground for doing relevant social science and economics. Although regression analysis and econometrics have become the most used quantitative methods in social sciences and economics today, it’s still a fact that the inferences made from them are of strongly questionable validity.

The econometric art as it is practiced at the computer … involves fitting many, perhaps thousands, of statistical models….There can be no doubt that such a specification search invalidates the traditional theories of inference … All the concepts of traditional theory utterly lose their meaning by the time an applied researcher pulls from the bramble of computer output the one thorn of a model he likes best, the one he chooses to portray as a rose.

## Do any benefits of alcohol outweigh the risks?

19 May, 2021 at 23:02 | Posted in Statistics & Econometrics | 3 Comments.

Identifying the data generating process sure is important if we want to be able to understand data. Finding a correlation between drinking alcohol and mortality does not explain anything, and certainly does not mean that you have been able to identify a causal relation between variables. Regressing on covariates is not enough. There are tons of alternative explanations for the (alleged) causal relationship, and as long as you haven’t been able to convincingly block them all, you haven’t really succeeded with your identification strategy.

Blog at WordPress.com.

Entries and Comments feeds.