1 Introduction
This chapter is still a draft and minor changes are made without notice.
Adrien-Marie Legendre was a French mathematician, who in the early 19th century took a particular interest in predicting the trajectories of comets. The understanding of the movements of celestial objects represented one of the main applications of mathematics at that time. There were great scientific challenges in astronomy, such as the stability of the solar system, but there were also more earthbound applications, such as celestial navigation. Legendre wanted to describe the trajectory of a comet by a parabolic path, and he wanted to use observations of points from a comet’s trajectory to infer the best path coefficients. When Legendre substituted the observed points into the path equation he was left with more linear equations in the unknown coefficients than the number of coefficients. There was no solution; the observations did not fall exactly on any parabolic path.
Legendre was not the first to deal with such a path fitting problem, and various techniques had been developed, none of which were completely satisfactory. One solution was to disregard some equations so that a unique solution could be found to the remaining. The resulting path would then match some observed points exactly. But which equations should be disregarded? And would the resulting path be a good approximation to the omitted observation points?
Legendre’s parabolic paths could only produce an approximation of the observations and the future trajectory of the comet. It would not be an exact match. The deviations – or errors – from predicted trajectories of celestial objects were primarily due to unsystematic inaccuracies in the measurements. For Legendre and his contemporaries there was no clear solution as to how deviations from a path should be described and handled. The statistical theory was in its infancy, and the numerical computations required to obtain a descriptive statistical understanding of their nature was very labor intensive. Today it is well known that such deviations can be described statistically, and the methods of path or curve fitting are beneficially studied using statistical theory.
Legendre found an elegant and applicable algebraic solution to his path fitting problem – as described in the section below – which was quickly adapted by astronomers and geodesists in France and Germany. He did, however, not realize the statistical nature of his method. This aspect was developed and expanded by others in the 19th and 20th century, and the theory evolved into the statistical subject we know today as regression.
Regression is one of the core pillars of statistics, and the statistical theory of regression is a theory about models and methods for relating observables, for example, the coordinates on the path of a comet. The general idea of relating observables through a combination of a systematic and an unsystematic component penetrates most areas of science, technology and business today. The systematic component represents a predictable relation between observables, while the unsystematic component is statistical by nature. It may represent measurement errors, sampling inaccuracies, lack of information or other mechanisms or combinations of mechanisms whose contribution cannot be captured by the systematic component.
The premise of the regression theory is that unsystematic components can be treated statistically. Based on this premise, the primary objective of the theory is to devise methods for fitting regression models to data and to investigate properties of such methods.
1.1 A brief history of regression
We give in this section a historical perspective on regression. Tracking the origin of regression methods and its terminology provides a historical context for the theory developed in this book. It is a multifaceted story, and the introduction will provide a birds eye perspective on what this book is about.
1.1.1 Linear regression
Legendre’s book (Legendre 1805) on the determination of the orbit of comets was published in 1805 containing the appendix Sur la Méthode des moindres carrés. It outlined a method for fitting a linear equation with unknown coefficients to observations. The English translation of the title of the appendix is On the Method of least squares, and it is the earliest known appearance of the method in print. Independently of Legendre, Carl F. Gauss (Gauss 1809) discovered the Gaussian distribution and used it to derive the method of least squares from Laplace’s principle of inverse probability. Gauss was like Legendre motivated by problems in astronomy – specifically the description and prediction of the motions of planets. The invention of the method of least squares marks a turning point in the history of regression. Earlier work by Pierre-Simon Laplace, and related work by Roger J. Boscovich with applications to geodesy, using the method of least absolute deviation, had been less successful. The method of least absolute deviation leads to a nonlinear estimator that is hard to compute. By contrast, the least squares estimator is a linear estimator that solves the (linear) normal equation, and Gauss presented a useful method for its numerical computation. The generality of the linear model framework considered by Gauss is impressive and corresponds essentially to that treated in Chapter 2 in this book.
In the years after the invention of the method of least squares, Gauss and Laplace derived several theoretical results about the method including sampling properties of the least squares estimator and confidence intervals for the parameters. The logical foundation of the method of least squares was, however, confusing at the time. Gauss’s initial derivation was – as mentioned above – based on Laplace’s principle of inverse probability. Later, Laplace as well as Gauss sought justification of the method in the sampling properties of the resulting estimator. Specifically by investigating if the estimator has a form of minimal estimation error. Gauss showed what is today known as the Gauss-Markov theorem: the least squares estimator has minimal variance among all linear unbiased estimators.
The models that Gauss fitted to data using his method of least squares were not known to him as regression models. That terminology originates from Sir Francis Galton (Galton 1886). In studies of hereditary traits Galton observed an apparent regression (in the sense of a movement) towards a mediocre state in the offspring compared to the parent(s). Originally he considered the size of plant seeds and later the height of human individuals. What Galton had discovered was the phenomenon known today as regression toward the mean. He explained the phenomenon statistically. The trait is partly inherited from the parent and partly a consequence of additional variation. This results in a bivariate distribution showing a co-relation of the trait in parent and offspring. Though Galton correctly described the phenomenon statistically, he wrongly explained the regression phenomenon as being caused by inheritance from the ancestry of the parent in addition to the parent themselve.
To understand the co-relation Galton observed in his heriditary studies, he contributed to the theory of the bivariate normal distribution and introduced the general notion of correlation. The technical details of what is known today as the Pearson correlation coefficient were worked out by Karl Pearson, though. Galton, nevertheless, realized the usefulness of regression and correlation in various application areas, and the regression terminology was adapted by others, notably Karl Pearson. Interestingly, the initial developments of regression analysis in British mathematical statistics were independent of the continental literature, and the regression models were, in particular, not framed in the linear model terminology of Gauss.
In the first half of the 20th century, Sir Ronald A. Fisher had an enormous impact on mathematical statistics. He developed Analysis of Variance (ANOVA), introduced the likelihood concept and advocated the general principle of maximum likelihood estimation among other things. The primary application of ANOVA was to designed experiments (Fisher 1935), and though far from Gauss’s applications on the modeling of the motion of celestial objects, the ANOVA models fit into the framework of the linear model.
Fisher did not rely on or acknowledge the contributions of his predecessors to any great extent, and he rediscovered known results about the method of least squares. But he also contributed significantly to the theory, for instance by deriving sampling distributions of test statistics, notably the \(F\)-test statistic. His presentation and applications of ANOVA models have been highly influential on the current terminology, which is, for instance, reflected in the anova()
function in R for testing linear hypotheses within the linear model. Subsequent developments with contributions from many people have turned the linear model into a unified and general statistical theory as outlined in one of the first unifying textbooks on the subject by Shayle R. Searle (Searle 1971).
The history of regression is intertwined with the general history of statistics, and for the history before 1935 the work by Anders Hald (Hald 2007) is an indispensible source. For the more recent, and much richer, development, of nonlinear regresssion we give a brief outline in the following sections that focus on a few selected original sources.
1.1.2 Beyond linear regression
Regression models, methods and applications proliferated in the second half of the 20th century. Computers played a pivotal role, first of all by easing the solution of nonlinear estimating equations by iterative methods, which paved the way for applications of nonlinear regression models and nonlinear estimation methods. The use of computer simulations and Markov Chain Monte Carlo (MCMC), in particular, also made Bayesian regression modeling a practical possibility. In this brief account of the history, it is impossible to do justice to all the different directions, and we will therefore narrow the focus to those regression models that are treated in this book.
Not all types of regression problems fit into the linear modeling framework, and one of the early applications of a nonlinear regression model was to the modeling of mortality as a function of the dosage of a toxic agent. Chester I. Bliss (Bliss 1935) proposed a transformation method where dosage is log-transformed and death frequency is transformed by the inverse distribution function for the normal distribution with mean 5 and unit variance. The resulting transformed death frequencies were called probits. The transformed data should fall on a straight line, and Bliss suggested using the method of weighted least squares for fitting such a line.
Bliss (1934) presented a similar idea, one year earlier, with the probits computed from an S-shaped curve inspired by, but not identical to, the distribution function for the normal distribution. His 1935-paper had, however, an important appendix written by Fisher, which showed how to appropriately deal with dosages for which none or all of the organisms survied. In this appendix, Fisher outlined how the method of maximum-likelihood could be used for estimation. In doing so he came up with a correction step in the weighted least squares fit, which is what we today know as the Fisher scoring algorithm.
The probit model introduced by Bliss is one among a number of nonlinear regression models – notably models of counts – that share a similar structure. These regression models were unified by John Nelder and Robert Wedderburn (Nelder and Wedderburn 1972) in the framework they called generalized linear models, which includes the linear model as a special case but extends it considerably. Nelder and Wedderburn showed that the maximum likelihood estimator in a generalized linear model can be computed iteratively by the Fisher scoring algorithm. Each step in this algorithm is the solution of a weighted least squares problem, and they called it the Iterative Weighted Least Squares (IWLS) algorithm. They also paraphrased some terminology and methodology from linear models, in particular the use of analysis of deviance as a generalization of ANOVA.
As a curious detail, Nelder and Wedderburn outlined in their original paper how generalized linear models could form a useful basis for courses in statistics, which would unify the treatment of a number of statistical models that had previously been treated separately. The theory of generalized linear models was further developed in the 70s and 80s, and an authoritative reference on this classical theory is the book by McCullagh and Nelder (1989). In addition to his theoretical contributions, Nelder also initially chaired the GLIM Working Party, who developed the statistical software program GLIM (Generalized Linear Interactive Modeling). This program contributed to making generalized linear models applicable in practice, and it was a source of inspiration for the later implementations (Chambers and Hastie 1991) in S and R.
Another type of regression problem that does not fit directly into the linear modeling framework either is the modeling of survival times. As for generalized linear models there is a question of modeling scale; at which scale is survival time most appropriately associated to other variables? In addition to this there is the practical problem that observations may be right censored. Survival models and estimation of life tables have a long history, in particular in actuarial science and medical statistics, but a systematic approach to regression survival models was not made until the seminal contributions by Sir David R. Cox (Cox 1972). He introduced the semiparametric proportional hazards model, which has since become the most widely used survival regression model. Cox made two major contributions in his paper. He introduced what he later called the partial likelihood, which provided a means for deriving a sensible estimating equation, and he presented approximate sampling distributions of the resulting estimator.
The paper by Cox inspired a rapid development in the theory of survival regression models, which clarified some of his heuristic arguments and extended his model class considerably. Central to this development was the use of methods from the theory of stochastic processes, in particular continuous time martingale theory. Modern survival analysis theory is heavily based on the use of stochastic processes, and an authoritative account of this theory is the by now classical book Statistical Models Based on Counting Processes (Andersen et al. 1993).
On the practical side, the estimating equation derived from Cox’s partial likelihood in the semiparametric framework is nonlinear – and so are parametric likelihood based estimating equations. The estimator thus has to be computed by iterative methods. It turns out that the practical computations can be carried out using the framework of generalized linear models, and the GLIM program could be used to fit the proportional hazards models in Cox’s semiparametric framework as well as for several parametric models. Though this is today mostly a curious observation, it was important for the early applicability of these models and methods. Later, specialized software implementations were developed, such as the survival package for Splus by Terry Therneau (Therneau and Grambsch 2000). The package was ported to R, and it is today the classical core package for survival analysis using R.
1.1.3 Regression, machine learning and computing
There were notable developments outside of mainstream statistics related to regression models, but little interaction between the different scientific communities. Frank Rosenblatt (Rosenblatt 1962) explored perceptrons and artificial neural networks for binary classification from multiple input variables. Support vector machines (Cortes and Vapnik 1995) were developed for similar purposes, with the conceptual framework established in 1965 by Valdimir Vapnik.
Artificial neural networks and support vector machines are today core subjects in the field of machine learning. Both can be understood as regression models of a binary variable, and both can be modified to models of a continuous variable. A support vector machine is, however, not a probabilistic model, and neither model gives a simple interpretable relation between the input variables and the distribution of the binary variable. The developments in statistics were dominated by the interaction with other scientific fields, where explanations, interpretations and answers to inferential questions were sought. Inferential questions that are central to statistics did not appeal to the young machine learning community, whose focus was on classification performance.
In 2001 Leo Breiman (Breiman 2001) divided the data modeling communities into “two cultures”: those using stochastic (or probabilistic) data models and those using algorithmic models. Breiman ascribed the developments of algorithmic modeling largely to the machine learning community, and he strongly criticized the conventional statistical community for their (simplistic) probabilistic data modeling. Undoubtedly, contemporary regression modeling has benefited greatly from developments in machine learning, and the “two cultures” interact more closely today than ever before.
The development of lasso (Tibshirani 1996) is an interesting early example. While initially developed for the linear regression model, the lasso esimator is nonlinear and biased – as opposed to the least squares estimator. It had, however, certain computational and statistical benefits that made it possible to tackle large scale data analysis in a new and useful way.
Not long after the introduction of lasso, in 2001, the first edition of The Elements of Statistical Learning (Hastie, Tibshirani, and Friedman 2009) was published. It is hardly a coincidence that this was the same year as Breiman’s paper was published. The book had a notable impact on the statistics community by bridging the gap between the machine learning literature and the statistics literature. Along with lasso, the book introduced a range of nonlinear regression models and optimization methods to a broad statistical audience. This was a timely publication and in combination with new applications to large scale data modeling it spurred a substantial development of new regression modeling methods and exciting challenges.
This brief historical account concludes by mentioning some developments that have been highly influential on how we work with regression models today.
Data analysis with regression models has first of all been transformed completely by the increase of computer power and the developments of statistical programming environments such as R. The computer serves first of all the data analyst, and interactive and exploratory data analysis in the spirit of Tukey (1977) is now in routine use. Developments in nonparametric and resampling based statistical methods, that require substantial computations, have also made data analysis possible in situations where we lack theoretical methods, where we want to impose minimal assumptions or where model assumptions are violated.
The computer has, in addition, become the lab bench of statistics, and it has brought an experimental component to statistics. Most methodologies and theoretical results are today scrutinized via simulation studies, which can inform us about the applicability and limitations of the methods and the theory.
1.1.4 Regression modeling
The historical developments in regression modeling contain three recurrent components.
- Models
- Estimation methodology
- Sampling properties
The first component – the model development – was driven by applications, and the second component was developed to compute estimates of model parameters from data. Thus over the course of history, a supply of models, such as the linear and generalized linear models, have been developed together with methods or principles for parameter estimation, such as the methods of least squares or maximum likelihood. The third component – the sampling properties – was developed to provide frequentistic quantification of uncertainty.
There is an important fourth component.
- Model critique
From Gauss to Cox there has been concern about whether the model assumptions were fulfilled in applications, and techniques of more or less formal character have been used throughout history for validating that model assumptions are plausible.
The four components: models; estimation methodology; sampling properties; and model critique, are recurring components in this book as they are in history. The intention of this book is to teach the reader a modern perspective on these cornerstones of statistics in the framework of regression modeling.
1.2 Regression modeling in this book
This book is on associational regression modeling. A regression model relates the distribution of a response variable to one or more predictor variables. All models considered are direct models of the conditional distribution of the response given the predictors. Such models are descriptive models of the association between response and predictors, and one main purpose of an associational model is to be predictive of the response given observations of the predictors. This is used for medical diagnosis and prognosis, in businesses to predict customer behavior, to predict risk in insurance companies, pension funds and banks, to make weather forecasts and in many other areas where it is of interest to know what we cannot (yet) observe. These are arguably all important applications, but there are also other purposes. Most importantly, associational regression models are used to investigate causal relations, which is discussed further below.
The book treats linear models, generalized linear models and survival regression models. They are widely used in practice and obligatory parts of an education in statistics. At the same time they serve as good examples when developing general statistical principles and methodologies such as likelihood based methods. The intention of this book is to bridge the gap between a mathematical treatment of the model classes considered and their practical use.
In predictive modeling it is fairly clear what makes a good model. Given a method for quantification of predictive accuracy, the best model is the most accurate model. The accuracy of a prediction is typically quantified by a loss function, which actually quantifies how inaccurate a prediction is. The smaller the loss is the more accurate is the prediction. The expected loss quantifies how accurate the predictive model is on average. The specification of a relevant loss function is, perhaps, not always easy. A good loss function should ideally reflect the consequences of wrong predictions. On the other hand there exists a selection of useful, reasonable and convenient standard loss functions that can cope with many situations of practical interest. Examples include (weighted) squared error loss, the 0-1-loss and the negative log-likelihood loss. The book will not plunge into any elaborate discussions of the choice of loss but focus on how to fit models to data with standard choices.
One of the biggest challenges in practice is to select and fit a model to a data set in such a way that it will preserve its predictive accuracy when applied to new independent data. The model should be able to generalize well to cases not used for the model fitting and selection process. Otherwise the model has been overfitted to the data, and this is to be avoided. The book will develop a range of methods for fitting models to data, for assessing the model fit and for making appropriate model selections while avoiding overfitting.
Associational regression models are also widely used for investigating causal relations between the response and one or more predictors. It is emphasized that an associational model does not have a causal interpretation in itself. It only describes an association. If we want to extract causal information from an associational model, or if we want to use associational models for causal inference in other ways, we need additional knowledge. For example that the data is from a randomized study. The additional knowledge is often referred to as untestable assumptions because we cannot infer this knowledge from data. The assumptions are not just technical distributional assumptions but structural modeling assumptions about how some or all parts of a model remain invariant to interventions. A thorough understanding of the subject matter field, or knowledge from other studies, might allow us to make such necessary structural assumptions.
Irrespectively of what the purpose of the associational regression model is, this book deals with some basic statistical questions of general interest within the framework of regression models. Besides fitting good regression models to data, the two central problems are:
- Quantifications of conditional associations.
- Tests of conditional independencies.
The first problem deals with quantification of the associational strength between the response and one predictor given the other predictors. The other problem deals with formally testing the hypothesis that there is no association between the response and one predictor given the other predictors.
1.3 Organization
The book consists of a combination of theory and larger case studies. The theory chapters focus on presenting the mathematical framework for regression modeling together with the relevant mathematical theory such as derivations of likelihood functions, estimators, estimation algorithms and properties of estimators. Other chapters focus on more general methodological questions. This includes developing methods for dealing with the complex decision process that practical data analysis is.
While smaller data examples are presented in the theory chapters, the larger case studies are developed to better illustrate how theory supports applications. The hope is that this will ease the transition from theory to practice. The price to pay is that there are some distractions due to real world data challenges. Data rarely behaves well. There are missing values and outliers, the models do not fit the data perfectly, the data comes with a strange encoding of variables and many other issues. Issues that require decisions to be made and issues on which many textbooks on statistical theory are silent.
By working through the case studies in detail it is the hope that many relevant practical problems are illustrated and appropriate solutions are given in such a way that the reader is better prepared to turn the theory into applications on their own.
1.4 Using R
We use the programming language R1 throughout to illustrate how good modeling strategies can be carried out in practice on real data, but the book will not provide an introduction to the R language. Consult the R manuals (R Core Team 2025+) or the many textbooks on R programming and data analysis (Wickham 2019; Wickham, Çetinkaya-Rundel, and Grolemund 2023). The classical and influential book Modern Applied Statistics with S (Venables and Ripley 2002) and the corresponding MASS package (comes with the R distribution) should also be mentioned. Several classical statistical models and methods are supported by the MASS package and documented in that book.
The case studies in this book are complete with R code that covers all aspects of the analysis. They represent an integration of data analysis with documentation and reporting. This is an adaptation to data analysis of literate programming (Knuth 1984). The main idea is that the writing of a report that documents the data analysis and the actual data analysis are merged into one document. This supports the creation of reproducible analysis, which is a prerequisite for reproducible research. To achieve this integration the book was written using the Quarto2 publishing system and the R package knitr3. Quarto and knitr allows you to integrate text with chunks of R code and produce an output document in various formats, e.g., a webpage or a pdf document, that include text, code and output from evaluating the code. The use of the RStudio4 or Positron5 integrated development environments (IDE) is also recommended. Both are developed by Posit PBC as open source projects and the desktop versions are free to use.
The book source is available from the RwR GitHub repository, which also contains information on how to recreate the R environment used for writing the book and how to obtain the RwR package containing the datasets used.