On Overfitting in Statistics and Machine Learning: Uncovering the Connection
Overfitting is a concept that arises in both statistics and machine learning when attempting to fit a model to data. It involves the danger of creating a model that adapts too well to the provided data, resulting in a loss of its ability to accurately generalize to new, unseen data points. The implications of overfitting, and the means to combat it, are topics of great interest and significance in both fields.
Recently, a lecturer held an accelerated course in Statistical data analysis for fundamental science for the Instats site. In just 15 hours of online lectures, the course covered a broad range of topics, including parameter estimation, hypothesis testing, modeling, goodness of fit, and ancillary concepts such as ancillarity and conditioning. Additionally, an introduction to machine learning was provided to the students. The lecturer reflects on the success of the course and the interesting points that emerged regarding the connection between statistics and machine learning.
In statistics, the problem of overfitting is deeply intertwined within the realm of parameter estimation, which is further divided into point estimation and interval estimation. Interval estimation, the larger portion of parameter estimation, remains an active area of research. The focus of the lectures was on practical application rather than theoretical issues, aiming to equip students with the ability to comprehend the procedures and guard against common misconceptions. Leveraging the students’ existing knowledge of machine learning, the lecturer drew parallels between statistics and machine learning to enhance understanding.
When fitting a model to data in statistics, the goal is to estimate the model parameters and their uncertainty. Regardless of the method employed, the resulting estimates will only hold value if the underlying model is sensible. It is often assumed, rightly or wrongly, that data are generated by complex mechanisms that cannot be perfectly replicated by any functional form, no matter how flexible. This assumption drives an inclination towards more complex models with a larger number of parameters. In principle, this may seem reasonable, as adding parameters can allow a more flexible model to collapse into simpler ones by appropriately fixing certain coefficients. However, excessively complex models tend to provide a poor representation of the true distribution being inferred. A warning sign of overfitting is when the chisquared probability becomes too high, indicating that the model excessively adheres to the observed data.
During the course, the lecturer conducted an experiment where students were asked to choose a model from three options that reasonably fit a given set of data points. Interestingly, the students often opted for models with one or two additional parameters than necessary. To illustrate the mistake, the lecturer prompted the students to consider the meaning of the uncertainty bars associated with the data and assess how many of these bars were intercepted by their chosen models. It became evident that a sound model, with one-sigma uncertainty bars, should miss approximately 32% of them. This exercise helped students understand the importance of avoiding overfitting and the need for model selection based on sound statistical reasoning.
On the other hand, in machine learning, the overfitting problem manifests differently. It is not possible to count hits and misses when training models without employing validation data or cross-validation techniques. Without validation, the loss function used in training will continually decrease, making it difficult to determine the point at which training should stop. However, a validation set typically exhibits an increase in loss before reaching the desired point, similar to the psychological bias observed in data fitting, favoring overly complex models. Interestingly, increasing the size of the machine learning model, such as adding more nodes and layers to a neural network or increasing the number of trees in a random forest, can still result in a decrease in validation loss. The question arises: where does the extra generalization power come from, and can such a phenomenon be observed in statistical data fitting?
In machine learning, overparametrization is a crucial technique that enhances the predictive power of models. By allowing multiple sets of optimal solutions, the shape of the loss function in the parameter space is smoothed, ensuring convergence to the minimum via gradient methods. The non-uniqueness of the solution does not impose any concerns as long as it exhibits high generalization power. If the same relaxed approach were applied to statistical model fitting, one might wonder if additional precision could be gained in the fits. The answer lies in considerably expanding the space of functions used and devising a stochastic method to select the best one. Surprisingly, the distinction between a closed-form function used for fitting and a complex neural network with numerous nodes and layers is mostly artificial. The core problem remains consistent, and it is the algorithms that make them appear distinct.
While there is value in recognizing the connection between statistics and machine learning, it is essential to maintain a clear separation between the statistical problem of data fitting and the machine learning problem of finding highly generalizing models. The severity of overfitting differs significantly between the two disciplines due to their distinct objectives—correctness of inference claims in statistics and performance of results in machine learning.
This exploration of the complex challenges faced in data analysis serves as a powerful didactic tool. It encourages lateral thinking and prompts reflection on other examples where seemingly disparate fields share common underlying principles. The comprehension and avoidance of overfitting are vital for researchers and practitioners seeking accurate representations and predictions with their data.