## Proper use of regression analysis

Level I regression analysis does not require any assumptions about how the data were generated. If one wants more from the data analysis, assumptions are required. For a Level II regression analysis, the added feature is statistical inference: estimation, hypothesis tests and confidence intervals. When the data are produced by probability sampling from a well-defined population, estimation, hypothesis tests and confidence intervals are on the table.

A random sample of inmates from the set of all inmates in a state’s prison system might be properly used to estimate, for example, the number of gang members in state’s overall prison system. Hypothesis tests and confidence intervals might also be usefully employed. In addition, one might estimate, for instance, the distribution of in-prison misconduct committed by men compared to the in-prison misconduct committed by women, holding age fixed. Hypothesis tests or confidence intervals could again follow naturally. The key assumption is that each inmate in the population has a known probability of selection. If the probability sampling is implemented largely as designed, statistical inference can rest on reasonably sound footing. Note that there is no talk of causal effects and no causal model. Description is combined with statistical inference.

In the absence of probability sampling, the case for Level II regression analysis is far more difficult to make …

The goal in a Level III regression analysis is to supplement Level I description and Level II statistical inference with causal inference. In conventional regression, for instance, one needs a nearly right model [a model is the ‘right’ model when it accurately represents how the data on hand were generated ], but one must also be able to argue credibly that manipulation of one or more regressors alters the expected conditional distribution of the response. Moreover, any given causal variable can be manipulated independently of any other causal variable and independently of the disturbances. There is nothing in the data itself that can speak to these requirements. The case will rest on how the data were actually produced. For example, if there was a real intervention, a good argument for manipulability might well be made. Thus, an explicit change in police patrolling practices ordered by the local Chief will perhaps pass the manipulability sniff test. Changes in the demographic mix of a neighborhood will probably not …

Level III regression analysis adds to description and statistical inference, causal inference. One requires not just a nearly right model of how the data were generated, but good information justifying any claims that all causal variables are independently manipulable. In the absence of a nearly right model and one or more regressors whose values can be “set” independently of other regressors and the disturbances, causal inferences cannot not make much sense.

The implications for practice in criminology are clear but somewhat daunting. With rare exceptions, regression analyses of observational data are best undertaken at Level I. With proper sampling, a Level II analysis can be helpful. The goal is to characterize associations in the data, perhaps taking uncertainty into account … Reviewers and journal editors typically equate proper statistical practice with Level III.

Richard Berk