Regression is a broad class of statistical models that is the foundation of data analysis and inference in the social sciences. Moreover, many contemporary statistical methods derive from the linear regression model. At its heart, regression describes systematic relationships between one or more predictor variables with (typically) one outcome. The flexibility of regression and its many extensions make it the primary statistical tool that social scientists use to model their substantive hypotheses with empirical data.
The original application of regression was Sir Francis Galton’s study of the heights of parents and children in the late 1800s. Galton noted that tall parents tended to have somewhat shorter children, and vice versa. He described the relationship between parents’ and children’s heights using a type of regression line and termed the phenomenon regression to mediocrity. Thus, the term regression described a specific finding (i.e., relationship between parents’ and children’s heights) but quickly became attached to the statistical method.
The foundation of regression is the regression equation; for Galton’s study of height, the equation might be: Childi = β0 + β1 (Parenti) + ∊i. Each family provides values for child’s height (i.e., Childi) and parent’s height (i.e., Parenti). The simple regression equation above is identical with the mathematical equation for a straight line, often expressed as y = mx + b. The two regression coefficients (i.e., β0 and β1) represent the y -intercept and slope. The y -intercept estimates the average value of children’s height when parent’s height equals 0, and the slope coefficient estimates the increase in average children’s height for a 1-inch increase in parent’s height, assuming height is measured in inches. The intercept and slope define the regression line, which describes a linear relationship between children’s and parents’ heights. Most data points (i.e., child and parent height pairs) will not lie directly on the simple regression line; the scatter of the data points around the regression line is captured by the residual error term εi, which is the vertical displacement of each datum point from the regression line.
The regression line describes the conditional mean of the outcome at specific values of the predictor. As such, it is a summary of the relationship between the two variables, which leads directly to a definition of regression: “[to understand] as far as possible with the available data how the conditional distribution of the response … varies across subpopulations determined by the possible values of the predictor or predictors” (Cook and Weisberg 1999, p. 27). This definition makes no reference to estimation (i.e., how are the regression coefficients determined?) or statistical inference (i.e., how well do the sample coefficients reflect the population from which they were selected?). Historically, regression has used least-squares estimation (i.e., coefficient values are found that minimize the squared errors εi) and frequentist inference (i.e., variability of sample regression coefficients are examined within theoretical sampling distributions and summarized by p-values or confidence intervals). Although leastsquares regression estimates and p-values based on frequentist inference are the most common default settings within statistical packages, they are not the only methods of estimation and inference available, nor are they inherently aspects of regression.
If regression only summarized associations between two continuous variables, it would be a very limited tool for social scientists. However, regression has been extended in numerous ways. An initial and important expansion of the model allowed for multiple predictors and multiple types of predictors, including continuous, binary, and categorical. With the inclusion of categorical predictors, statisticians noted that analysis of variance models with a single error term and similar models are special cases of regression, and the two methods (i.e., regression and analysis of variance) are seen as different facets of a general linear model.
A second important expansion of regression allowed for different types of outcome variables such as binary, ordinal, nominal, and count variables. The basic linear regression model uses the normal distribution as its probability model. The generalized linear model, which includes non-normal outcomes, increases the flexibility of regression by allowing different probability models (e.g., binomial distribution for binary outcomes and Poisson distribution for count outcomes), and predictors are connected to the outcome through a link function (e.g., logit transformation for binary outcomes and natural logarithm for count outcomes).
Beyond the general and generalized linear models, numerous other extensions have been made to the basic regression model that allow for greater complexity, including multivariate outcomes, path models that allow for multiple predictors and outcomes with complex associations, structural equation models that nest measurement models for latent constructs within path models, multilevel models that allow for correlated data due to nested designs (e.g., students within classrooms), and nonlinear regression models that use regression to fit complex mathematical models in which the coefficients are not additively related to the outcome. Although each of the preceding methods has unique qualities, they all derive from the basic linear regression model.
Research is a marriage of three components: Theory-driven research questions dictate the study design, which in turn dictates the statistical methods. Thus, statistical methods map the research questions onto the empirical data, and the statistical results yield answers to those questions in a well-designed study. Within the context of scientific inquiry, regression is primarily an applied tool for theory testing with empirical data. This symbiosis between theoretical models and statistical models has been the driving force behind many of the advances and extensions of regression discussed above.
Although regression can be applied to either observational or experimental data, regression has played an especially important role in observational data. With observational data there is no randomization or intervention, and there may be a variety of potential causes and explanations for the phenomenon under study. Regression methods allow researchers to statistically control for additional variables that may influence the outcome. For example, in an observational study of infidelity that focuses on age as a predictor, it might be important to control for relationship satisfaction, as previous research has suggested it is related to both the likelihood of infidelity and age. Because regression coefficients in multiple regression models are estimated simultaneously, they control for the presence of the other predictors, often described as partialing out the effects of other predictors.
Regression can also play a practical role in conveying research results. Regression coefficients as well as regression summaries (e.g., percentage of the outcome variability explained by the predictors) quantitatively convey the importance of a regression model and consequently the underlying theoretical model. In addition, regression models are prediction equations (i.e., regression coefficients are scaling factors for predicting the outcome based on the predictors), and regression models can provide estimates of the outcome based on predictors, allowing the researcher to consider how the outcome varies across combinations of specific predictor values.
Even though regression is an extremely flexible tool for social science research, it is not without limitations. Not all research questions are well described by regression models, particularly questions that do not specify outcome variables. As an example, cluster analysis is a statistical tool used to reveal whether there are coherent groups or clusters within data; because there is no outcome or target variable, regression is not appropriate. At the same time, because regression focuses on an outcome variable, users of regression may believe that fitting a regression model connotes causality (i.e., predictors cause the outcome). This is patently false, and outcomes in some analyses may be predictors in others. Proving causality requires much more than the use of regression.
Another criticism of regression focuses on its use for statistical inference. To provide valid inference (e.g., p-values or confidence intervals), the data must be a random sample from a population, or involve randomization to a treatment condition in experimental studies. Most samples in the social sciences are samples of convenience (e.g., undergraduate students taking introductory psychology). Of course, this is not a criticism of regression per se, but of study design and the limitations of statistical inference with nonrandom sampling. Limitations notwithstanding, regression and its extensions continue to be an incredibly useful tool for social scientists.
SEE ALSO Vector Autoregression
Berk, Richard A. 2004. Regression Analysis: A Constructive Critique. Thousand Oaks, CA: Sage.
Cohen, Jacob, Patricia Cohen, Stephen G. West, and Leona S. Aiken. 2003. Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences. Mahwah, NJ: Lawrence Erlbaum Associates.
Cook, Dennis R., and Sanford Weisberg. 1999. Applied Regression Including Computing and Graphics. New York: John Wiley.
Fox, John. 1997. Applied Regression Analysis, Linear Models, and Related Methods. Thousand Oaks, CA: Sage.
David C. Atkins
The Latin equivalent of regression means "return" or "withdrawal"; it also signifies a retreat or a return to a less-evolved state. There is no very precise psychoanalytic definition of the concept of regression. It is useful to introducs the idea of temporality. It could be said to represent an articulation between the atemporality of the unconscious, the primary processes, and the temporality of the secondary processes. Some analysts assign this notion a metaphoric value; it retains the connotations of a journey through time and the changes that will be necessary in psychoanalytic treatment.
Sigmund Freud introduced the notion of regression in The Interpretation of Dreams (1900a). The concept was necessary for his description of the psychic apparatus in terms of a topographical model, represented by an instrument whose component parts are agencies or systems with a spatial orientation. Excitation traverses the system in a determined temporal order, going from the sensory end to the motor end. In hallucinatory dreams, excitation follows a retrograde pathway. Dreams have a regressive character due to the shutdown of the motor system; the trajectory goes in the reverse direction, toward perception and hallucinatory visual representation. This regression is a psychological particularity of the dream process, but dreams do not have a monopoly on it. In the section of the last chapter of The Interpretation of Dreams titled "Regression," Freud wrote that "in all probability this regression, wherever it may occur, is an effect of a resistance opposing the progress of a thought into consciousness along the normal path. . . . It is to be further remarked that regression plays a no less important part in the theory of the formation of neurotic symptoms than it does in that of dreams" (pp. 547-548). In this last chapter Freud already distinguished between three types of regression: topographical regression, in the sense of the psychic system; temporal regression, in the case of a return to earlier psychic formations; and formal regression, where primitive modes of expression and representation replace the usual ones. He also noted: "All these three kinds of regression are, however, one at bottom and occur together as a rule; for what is older in time is more primitive in form and in psychical topography lies nearer to the perceptual end" (p. 548). This basic unity is central to his metapsychological use of the concept.
In Three Essays on the Theory of Sexuality (1905d) Freud implicitly invoked the idea of fixation, which is inseparable from regression. In "A Metapsychological Supplement to the Theory of Dreams" (1916-17f ), he underscored the distinction between "temporal or developmental regression" (of the ego and the libido) and topographical regression, and the fact that "[t]he two do not necessarily always coincide" (p. 227). Then, in the twenty-second of the Introductory Lectures on Psychoanalysis (1916-17a [1915-17]), he distinguished two types of regression affecting the libido: a return to the earliest objects marked by the libido, which are of an incestuous nature, and a return of the entire sexual organization to earlier stages. Libidinal regression is only an effect of temporal regression, with a reactivation of old libidinal structures preserved by fixation. At that point he asserted that regression was a "purely descriptive" concept, adding: "we cannot tell where we should localize it in the mental apparatus" (pp. 342-343). In making this assertion, he retrenched from his earlier position and denied regression its metaphysical status, which it would regain only after 1920 with the second theory of the instincts. It then becomes constitutive of the death instinct and can threaten to destroy psychic structures, but also becomes a mechanism that can be used by the ego.
According to Marilia Aisenstein's article "Des régressions impossibles?" (Impossible regressions?), "Freud's reticence around the notion of regression in 1917 was linked to its relation to the first theory of the instincts and the first topography. He had difficulty in situating and formulating regression not only in topographical terms, but above all in terms of the libido and the instincts of the ego.... It then became necessary to separate regression from disorganization, as the latter was envisioned by Pierre Marty and the psychosomaticians of the Paris School.... If the retrograde movement is not stopped by regressive systems involving fixations, the end result can be a process of somatization." Regression is indispensable to the work of psychoanalytic treatment; it implies the notion of change and is part of the healing process, according to Donald W. Winnicott (1958). Regression is a form of defense and remains in the service of the ego. From the analyst's point of view, formal regression provides another way of listening.
See also: Acute psychoses; Amphimixia/amphimixis; Benign/malignant regression; Choice of neurosis; Defense mechanisms; Disorganization; Dream; Ego and the Mechanisms of Defense, The ; Face-to-face situation; Fixation; Imago; Libidinal development; Libido; Maternal; "Metapsychological Supplement to the Theory of Dreams"; "Mourning and Melancholia"; Narcissistic withdrawal; Ontogenesis; Paranoia; Psychic causality; Psychic temporality; Psychoses, chronic and delusional; Psychosomatic; Psychotic transference; Representability; Sadomasochism; Self (true/false); Sleep/wakefulness; Stage (or phase); Suicide; Thalassa: A Theory of Genitality ; Time; Wish, hallucinatory satisfaction of a.
Aisenstein, Marilia. (1992). Des régressions impossibles? Revue française de psychanalyse, 56 (4), 995-1004.
Freud, Sigmund. (1900a). The interpretation of dreams. Parts I and II. SE, 4-5.
——. (1905d). Three essays on the theory of sexuality. SE, 7: 123-243.
——. (1916-17a [1915-17]). Introductory lectures on psycho-analysis. Parts I and II. SE, 15-16.
Winnicott, Donald W. (1958). Through paediatrics to psycho-analysis. London: Tavistock.
Balint, Michael. (1968). The basic fault. Therapeutic aspects of regression. London: Tavistock.
Blum, Harold P. (1994). The conceptual development of regression. Psychoanalytic Study of the Child, 49, 60-76.
Inderbitzin, Lawrence, and Levy, Steven. (2000). Regression and psychoanalytic technique: A concept's concretization. Psychoanalytic Quarterly, 69, 195-224.
Sandler, Joseph, and Sandler, Anne-Marie. (1994). Theoretical, technical comments on regression and anti-regression. International Journal of Psychoanalysis, 75, 431-440.
In statistical usage, regression refers in the simplest case (bivariate linear regression) to fitting a line to the plot of data from two variables, in order to represent the trend between them. Regression is asymmetric, that is, it assumes that one variable (Y, the dependent variable) is determined by the other (independent variable) X; that the relationship is linear (and hence that the variables are at the interval level of measurement); and that the fit is not perfect:Yi = α + βXi + εi (that is, the value of the dependent variable Y for individual i varies in a straight line with the value of X, together with an individual error term, e). The slope of this line is represented by a constant multiplier weight or ‘regression coefficient’, β, and a constant, α, which represents the intercept or point at which the regression line crosses the Y axis—as illustrated in the figure shown below.
Statistically, it is assumed that the error terms (εi) are random with a mean of 0, and are independent of the independent variable values. The main purpose of regression analysis is to calculate the value of the slope (β), often interpreted as the overall effect of X. This is normally done by using the Least Squares principle to find a best-fitting line (in the sense that the sum of the squared error terms— discrepancies between actual Yi values and those predicted by the regression line— is as small as possible). The correlation coefficient (r) gives a measure of how well the data fit this regression line (perfectly if r = ± 1 and as poorly as possible if r = 0).
Simple regression can be extended in various ways: to more than one independent variable (multiple linear regression) and to other functions or relationships (for example monotonic or non-metric regression for ordinal variables, used in multi-dimensional scaling, and logarithmic and power regression). In the multiple linear regression, the model is written as:Yi = α + β1 X1 + β2 X1 + β3 X1 + … βk X1 + εi where the regression weights βk now represent the effect of the independent variable Xi on Yi, controlling for (or ‘partialling out’, that is removing the linear effect of) the other independent variables. These ‘partial regression coefficients’ or ‘beta weights’ are of especial interest in causal models and structural equation systems (see M. S. Lewis-Beck , Applied Regression—An Introduction, 1990
). See also LOGISTIC (OR LOGIT) REGRESSION; MULTICOLLINEARITY; OUTLIER EFFECTS.
re·gres·sion / riˈgreshən/ • n. 1. a return to a former or less developed state. ∎ a return to an earlier stage of life or a supposed previous life, esp. through hypnosis or mental illness, or as a means of escaping present anxieties: [as adj.] regression therapy. ∎ a lessening of the severity of a disease or its symptoms: he seemed able to produce a regression in this disease. 2. Statistics a measure of the relation between the mean value of one variable (e.g., output) and corresponding values of other variables (e.g., time and cost).