Causal inference in social sciences (student stuff)

8 Jul, 2021 at 11:32 | Posted in Statistics & Econometrics | Comments Off on Causal inference in social sciences (student stuff)

.

The main ideas behind bootstrapping (student stuff)

6 Jul, 2021 at 10:50 | Posted in Statistics & Econometrics | Comments Off on The main ideas behind bootstrapping (student stuff)

.

Propensity score matching vs. regression (student stuff)

5 Jul, 2021 at 11:37 | Posted in Statistics & Econometrics | Comments Off on Propensity score matching vs. regression (student stuff)

.

Questionable research practices

2 Jul, 2021 at 17:14 | Posted in Statistics & Econometrics | Comments Off on Questionable research practices

.

Bradford Hill — comment trouver de la causalité dans des corrélations

30 Jun, 2021 at 09:53 | Posted in Statistics & Econometrics | Comments Off on Bradford Hill — comment trouver de la causalité dans des corrélations

.

How to achieve ‘external validity’

29 Jun, 2021 at 11:54 | Posted in Statistics & Econometrics | 2 Comments

Suman Ambwani on Twitter: "Also some very funny (and oddly specific)  student-generated memes from the course...… "There is a lot of discussion in the literature on beginning with experiments and then going on to check “external validity”. But to imagine that there is a scientific way to achieve external validity is, for the most part, a delusion … RCTs do not in themselves tell us anything about the traits of populations in other places and at other times. Hence, no matter how large the population from which we draw our random samples is, because it is impossible to draw samples from tomorrow’s population and all policies we craft today are for use tomorrow, there is no “scientific” way to go from RCTs to policy. En route from evidence and experience to policy, we have to rely on intuition, common sense and judgement. It is evidence coupled with intuition and judgement that gives us knowledge. To deny any role to intuition is to fall into total nihilism.

Kaushik Basu

Randomizations creating illusions of knowledge

28 Jun, 2021 at 21:46 | Posted in Statistics & Econometrics | 1 Comment

The advantage of randomised experiments in describing populations creates an illusion of knowledge … This happens because of the propensity of scientific journals to value so-called causal findings and not to value findings where no (so-called) causality is found. In brief, it is arguable that we know less than we think we do.

tumblr_mvn24oSKXv1rsxr1do1_500To see this, suppose—as is indeed the case in reality—that thousands of researchers in thousands of places are conducting experiments to reveal some causal link. Let us in particular suppose that there are numerous researchers in numerous villages carrying out randomised experiments to see whether M causes P. Words being more transparent than symbols, let us assume they want to see whether medicine (M) improves the school participation (P) of school-going children. In each village, 10 randomly selected children are administered M and the school participation rates of those children and also children who were not given M are monitored. Suppose children without M go to school half the time and are out of school the other half. The question is: is there a systematic difference of behaviour among children given M?

I shall now deliberately construct an underlying model whereby there will be no causal link between M and P. Suppose Nature does the following. For each child, whether or not the child has had M, Nature tosses a coin. If it comes out tails the child does not go to school and if it comes out heads, the child goes to school regularly.

Consider a village and an RCT researcher in the village. What is the probability, p, that she will find that all 10 children given M will go to school regularly? The answer is clearly

p = (1/2)^10

because we have to get heads for each of the 10 tosses for the 10 children.

Now consider n researchers in n villages. What is the probability that in none of these villages will a researcher find that all the 10 children given M go to school regularly? Clearly, the answer is (1–p)^n.

Hence, if w(n) is used to denote the probability that among the n villages where the experiment is done, there is at least one village where all 10 tosses come out heads, we have:

w(n) = 1 – (1-p)^n.

It is easy to check the following are true:

w(100) = 0.0931,
w(1000) = 0.6236,
w(10 000) = 0.9999.

Therein lies the catch … If there are 1000 experimenters in 1000 villages doing this, the probability that there will exist one village where it will be found that all 10 children administered M will participate regularly in school is 0.6236. That is, it is more likely that such a village will exist than not. If the experiment is done in 10 000 villages, the probability of there being one village where M always leads to P is a virtual certainty (0.9999).

This is, of course, a specific example. But that this problem will invariably arise follows from the fact that

lim(n => infinity)w(n) = 1 – (1 -p)^n = 1.

Given that those who find such a compelling link between M and P will be able to publish their paper and others will not, we will get the impression that a true causal link has been found, though in this case (since we know the underlying process) we know that that is not the case. With 10 000 experiments, it is close to certainty that someone will find a firm link between M and P. Hence, the finding of such a link shows nothing but the laws of probability being intact. Yet, thanks to the propensity of journals to publish the presence rather than the absence of “causal” links, we get an illusion of knowledge and discovery where there are none.

Kaushik Basu

Why the idea of causation cannot be a purely statistical one

23 Jun, 2021 at 15:32 | Posted in Statistics & Econometrics | 6 Comments

If contributions made by statisticians to the understanding of causation are to be taken over with advantage in any specific field of inquiry, then what is crucial is that the right relationship should exist between statistical and subject-matter concerns …

introduction-to-statistical-inferenceWhere the ultimate aim of research is not prediction per se but rather causal explanation, an idea of causation that is expressed in terms of predictive power — as, for example, ‘Granger’ causation — is likely to be found wanting. Causal explanations cannot be arrived at through statistical methodology alone: a subject-matter input is also required in the form of background knowledge and, crucially, theory …

Likewise, the idea of causation as consequential manipulation is apt to research that can be undertaken primarily through experimental methods and, especially to ‘practical science’ where the central concern is indeed with ‘the consequences of performing particular acts’. The development of this idea in the context of medical and agricultural research is as understandable as the development of that of causation as robust dependence within applied econometrics. However, the extension of the manipulative approach into sociology would not appear promising, other than in rather special circumstances … The more fundamental difficulty is that, under the — highly anthropocentric — principle of ‘no causation without manipulation’, the recognition that can be given to the action of individuals as having causal force is in fact peculiarly limited.

John H. Goldthorpe

Causality in social sciences — and economics — can never solely be a question of statistical inference. Statistics and data often serve to suggest causal accounts, but causality entails more than predictability, and to really in depth explain social phenomena require theory. Analysis of variation — the foundation of all econometrics — can never in itself reveal how these variations are brought about. First, when we are able to tie actions, processes or structures to the statistical relations detected, can we say that we are getting at relevant explanations of causation.

5cd674ec7348d0620e102a79a71f0063Most facts have many different, possible, alternative explanations, but we want to find the best of all contrastive (since all real explanation takes place relative to a set of alternatives) explanations. So which is the best explanation? Many scientists, influenced by statistical reasoning, think that the likeliest explanation is the best explanation. But the likelihood of x is not in itself a strong argument for thinking it explains y. I would rather argue that what makes one explanation better than another are things like aiming for and finding powerful, deep, causal, features and mechanisms that we have warranted and justified reasons to believe in. Statistical — especially the variety based on a Bayesian epistemology — reasoning generally has no room for these kinds of explanatory considerations. The only thing that matters is the probabilistic relation between evidence and hypothesis. That is also one of the main reasons I find abduction — inference to the best explanation — a better description and account of what constitute actual scientific reasoning and inferences.

In the social sciences … regression is used to discover relationships or to disentangle cause and effect. However, investigators have only vague ideas as to the relevant variables and their causal order; functional forms are chosen on the basis of convenience or familiarity; serious problems of measurement are often encountered.

Regression may offer useful ways of summarizing the data and making predictions. Investigators may be able to use summaries and predictions to draw substantive conclusions. However, I see no cases in which regression equations, let alone the more complex methods, have succeeded as engines for discovering causal relationships.

David Freedman

Some statisticians and data scientists think that algorithmic formalisms somehow give them access to causality. That is, however, simply not true. Assuming ‘convenient’ things like faithfulness or stability is not to give proofs. It’s to assume what has to be proven. Deductive-axiomatic methods used in statistics do no produce evidence for causal inferences. The real causality we are searching for is the one existing in the real world around us. If there is no warranted connection between axiomatically derived theorems and the real-world, well, then we haven’t really obtained the causation we are looking for.

Structural equation modelling (student stuff)

15 Jun, 2021 at 19:00 | Posted in Statistics & Econometrics | Comments Off on Structural equation modelling (student stuff)

.

This is a good introduction to some of the basic thoughts behind the use of SEMs. But — for the controversial question if SEMs really can be considered causal, yours truly highly recommends reading Kenneth Bollen’s and Judea Pearl’s Eight myths about causality and structural equation models.

Table 2 Fallacy (student stuff)

14 Jun, 2021 at 19:02 | Posted in Statistics & Econometrics | 2 Comments

.

Discrimination and the use of ‘statistical controls’

14 Jun, 2021 at 12:27 | Posted in Statistics & Econometrics | Comments Off on Discrimination and the use of ‘statistical controls’

The gender pay gap is a fact that, sad to say, to a non-negligible extent is the result of discrimination. And even though many women are not deliberately discriminated against, but rather self-select into lower-wage jobs, this in no way magically explains away the discrimination gap. As decades of socialization research has shown, women may be ‘structural’ victims of impersonal social mechanisms that in different ways aggrieve them. Wage discrimination is unacceptable. Wage discrimination is a shame.

You see it all the time in studies. “We controlled for…” And then the list starts … The more things you can control for, the stronger your study is — or, at least, the stronger your study seems. Controls give the feeling of specificity, of precision. But sometimes, you can control for too much. Sometimes you end up controlling for the thing you’re trying to measure …

paperAn example is research around the gender wage gap, which tries to control for so many things that it ends up controlling for the thing it’s trying to measure …

Take hours worked, which is a standard control in some of the more sophisticated wage gap studies. Women tend to work fewer hours than men. If you control for hours worked, then some of the gender wage gap vanishes. As Yglesias wrote, it’s “silly to act like this is just some crazy coincidence. Women work shorter hours because as a society we hold women to a higher standard of housekeeping, and because they tend to be assigned the bulk of childcare responsibilities.”

Controlling for hours worked, in other words, is at least partly controlling for how gender works in our society. It’s controlling for the thing that you’re trying to isolate.

Ezra Klein

Trying to reduce the risk of having established only ‘spurious relations’ when dealing with observational data, statisticians and econometricians standardly add control variables. The hope is that one thereby will be able to make more reliable causal inferences. But — as Keynes showed already back in the 1930s when criticizing statistical-econometric applications of regression analysis — if you do not manage to get hold of all potential confounding factors, the model risks producing estimates of the variable of interest that are even worse than models without any control variables at all. Conclusion: think twice before you simply include ‘control variables’ in your models!

piled-up-dishes-in-kitchen-sinkWhen I present this argument … one or more scholars say, “But shouldn’t I control for everything I can in my regressions? If not, aren’t my coefficients biased due to excluded variables?” … The excluded variable argument only works if you are sure your specification is precisely correct with all variables included. But no one can know that with more than a handful of explanatory variables …

A preferable approach is to separate the observations into meaningful subsets—internally compatible statistical regimes … If this can’t be done, then statistical analysis can’t be done. A researcher claiming that nothing else but the big, messy regression is possible because, after all, some results have to be produced, is like a jury that says, “Well, the evidence was weak, but somebody had to be convicted.”

Christopher H. Achen

Kitchen sink econometric models are often the result of researchers trying to control for confounding. But what they usually haven’t understood is that the confounder problem requires a causal solution and not statistical ‘control.’ Controlling for everything opens up the risk that we control for ‘collider’ variables and thereby create ‘back-door paths’ which gives us confounding that wasn’t there to begin with.

Extreme events and how to live with them

12 Jun, 2021 at 12:00 | Posted in Statistics & Econometrics | Comments Off on Extreme events and how to live with them

.

On the limits of ‘mediation analysis’ and ‘statistical causality’

11 Jun, 2021 at 18:12 | Posted in Statistics & Econometrics | Comments Off on On the limits of ‘mediation analysis’ and ‘statistical causality’

mediator“Mediation analysis” is this thing where you have a treatment and an outcome and you’re trying to model how the treatment works: how much does it directly affect the outcome, and how much is the effect “mediated” through intermediate variables …

In the real world, it’s my impression that almost all the mediation analyses that people actually fit in the social and medical sciences are misguided: lots of examples where the assumptions aren’t clear and where, in any case, coefficient estimates are hopelessly noisy and where confused people will over-interpret statistical significance …

More and more I’ve been coming to the conclusion that the standard causal inference paradigm is broken … So how to do it? I don’t think traditional path analysis or other multivariate methods of the throw-all-the-data-in-the-blender-and-let-God-sort-em-out variety will do the job. Instead we need some structure and some prior information.

Andrew Gelman

Causality in social sciences — and economics — can never solely be a question of statistical inference. Causality entails more than predictability, and to really in depth explain social phenomena require theory. Analysis of variation — the foundation of all econometrics — can never in itself reveal how these variations are brought about. First, when we are able to tie actions, processes or structures to the statistical relations detected, can we say that we are getting at relevant explanations of causation.

Most facts have many different, possible, alternative explanations, but we want to find the best of all contrastive (since all real explanation takes place relative to a set of alternatives) explanations. So which is the best explanation? Many scientists, influenced by statistical reasoning, think that the likeliest explanation is the best explanation. But the likelihood of x is not in itself a strong argument for thinking it explains y. I would rather argue that what makes one explanation better than another are things like aiming for and finding powerful, deep, causal, features and mechanisms that we have warranted and justified reasons to believe in. Statistical — especially the variety based on a Bayesian epistemology — reasoning generally has no room for these kinds of explanatory considerations. The only thing that matters is the probabilistic relation between evidence and hypothesis. That is also one of the main reasons I find abduction — inference to the best explanation — a better description and account of what constitute actual scientific reasoning and inferences.

In the social sciences … regression is used to discover relationships or to disentangle cause and effect. However, investigators have only vague ideas as to the relevant variables and their causal order; functional forms are chosen on the basis of convenience or familiarity; serious problems of measurement are often encountered.

Regression may offer useful ways of summarizing the data and making predictions. Investigators may be able to use summaries and predictions to draw substantive conclusions. However, I see no cases in which regression equations, let alone the more complex methods, have succeeded as engines for discovering causal relationships.

David Freedman

Some statisticians and data scientists think that algorithmic formalisms somehow give them access to causality. That is, however, simply not true. Assuming ‘convenient’ things like faithfulness or stability is not to give proofs. It’s to assume what has to be proven. Deductive-axiomatic methods used in statistics do no produce evidence for causal inferences. The real causality we are searching for is the one existing in the real world around us. If there is no warranted connection between axiomatically derived theorems and the real world, well, then we haven’t really obtained the causation we are looking for.

If contributions made by statisticians to the understanding of causation are to be taken over with advantage in any specific field of inquiry, then what is crucial is that the right relationship should exist between statistical and subject-matter concerns …
introduction-to-statistical-inferenceThe idea of causation as consequential manipulation is apt to research that can be undertaken primarily through experimental methods and, especially to ‘practical science’ where the central concern is indeed with ‘the consequences of performing particular acts’. The development of this idea in the context of medical and agricultural research is as understandable as the development of that of causation as robust dependence within applied econometrics. However, the extension of the manipulative approach into sociology would not appear promising, other than in rather special circumstances … The more fundamental difficulty is that​ under the — highly anthropocentric — principle of ‘no causation without manipulation’, the recognition that can be given to the action of individuals as having causal force is in fact peculiarly limited.

John H. Goldthorpe

Causality and the need to reform the teaching of statistics

10 Jun, 2021 at 16:18 | Posted in Statistics & Econometrics | Comments Off on Causality and the need to reform the teaching of statistics

An Introduction to Causality (6 May 2021): Overview · HIFIS and Helmholtz  Events (Indico)I will argue that realistic and thus scientifically relevant statistical theory is best viewed as a subdomain of causality theory, not a separate entity or an extension of probability. In particular, the application of statistics (and indeed most technology) must deal with causation if it is to represent adequately the underlying reality of how we came to observe what was seen … The network we deploy for analysis incorporates whatever time-order and independence assumptions we use for interpreting observed associations, whether those assumptions are derived from background (contextual) or design information … Statistics should integrate causal networks into its basic teachings and indeed into its entire theory, starting with the probability and bias models that are used to build up statistical methods and interpret their outputs. Every real data analysis has a causal component comprising the causal network assumed to have created the data set …

Thus, because statistical analyses need a causal skeleton to connect to the world, causality is not extra-statistical but instead is a logical antecedent of real-world inferences. Claims of random or “ignorable” or “unbiased” sampling or allocation are justified by causal actions to block (“control”) unwanted causal effects on the sample patterns. Without such actions of causal blocking, independence can only be treated as a subjective exchangeability assumption whose justification requires detailed contextual information about absence of factors capable of causally influencing both selection (including selection for treatment) and outcomes …

Given the absence of elaborated causality discussions in statistics textbooks and coursework, we should not be surprised at the widespread misuse and misinterpretation of statistical methods and results. This is why incorporation of causality into introductory statistics is needed as urgently as other far more modest yet equally resisted reforms involving shifts in labels and interpretations for p-values and interval estimates.

Sander Greenland

The elite illusion

9 Jun, 2021 at 13:13 | Posted in Statistics & Econometrics | Comments Off on The elite illusion

.

A great set of lectures — but yours truly still warns his students that regression-based averages is something we have reasons to be cautious about.

Suppose we want to estimate the average causal effect of a dummy variable (T) on an observed outcome variable (O). In a usual regression context one would apply an ordinary least squares estimator (OLS) in trying to get an unbiased and consistent estimate:

O = α + βT + ε,

where α is a constant intercept, β a constant ‘structural’ causal effect and ε an error term.

The problem here is that although we may get an estimate of the ‘true’ average causal effect, this may ‘mask’ important heterogeneous effects of a causal nature. Although we get the right answer of the average causal effect being 0, those who are ‘treated’ (T=1) may have causal effects equal to -100 and those ‘not treated’ (T=0) may have causal effects equal to 100. Contemplating being treated or not, most people would probably be interested in knowing about this underlying heterogeneity and would not consider the OLS average effect particularly enlightening.

The heterogeneity problem does not just turn up as an external validity problem when trying to ‘export’ regression results to different times or different target populations. It is also often an internal problem to the millions of OLS estimates that economists produce every year.

« Previous PageNext Page »

Blog at WordPress.com.
Entries and Comments feeds.

%d bloggers like this: