Connecting OLS and Deep Learning – Stephen V. Brown, Ph. D.

I’ve talked with a number of people who have a strong background in statistical analysis and use linear and nonlinear regression models on a regular basis, but who haven’t had the opportunity to use neural networks yet. My area of focus is Natural Language Processing (NLP)—the analysis of text—so my research strongly depends on deep learning models. But if you’re mostly doing hypothesis testing and the like, deep learning likely has less relevance in your daily work.

With this article, I hope to show that if you’re familiar with linear regression models, such as Ordinary Least Squares (OLS), you’re already familiar with some of the key aspects of deep learning models. At the very least, I hope to provide a good bridge to let you explore some of the basics of deep learning.

Warning

I want to be clear that I’m not suggesting you should use neural networks or deep learning over OLS or some other traditional regression model, because they’re intended for very different purposes. They are not really substitutes for most situations. For example, OLS is much more interpretable and is appropriate for hypothesis testing, something neural networks are not generally suited for. Neural networks excel at making predictions with the greatest possible accuracy.

First, we’ll generate some very simple linear data and then fit an OLS model to that data. I’ll then show how we can fit that model numerically instead of using the usual OLS closed-form solution. That will allow us to draw comparisons to neural networks and then connect neural networks to deep learning. Along the way I’ll show you the smallest bit of Python code needed to fit each model. Each step is quite small on its own, so hopefully by the end you’ll see the tight connections between all these methods.

Sample Data

First, let’s create a simple dataset. We’ll keep it in two dimensions to make it easier to visualize, but essentially everything I discuss in this article extends immediately to higher dimensions.

For this data, we’ll use the function \(y = x\) as our data generating process. We’ll choose 100 values for \(x\), where each value will be a random number between zero and ten. \(y\) will be the same as \(x\) but with some noise added so that it will approximate a straight line (\(y = x + \epsilon\)). Due to our choice of \(x\) and \(y\), any models we fit to this data should tell us that the data is from the function \(y = x\).

Using NumPy to generate the data:

import numpy as np

rng = np.random.default_rng(42)
x = rng.random((100, 1)) * 10
noise = rng.normal(loc=0, scale=0.5, size=x.shape)
y = x + noise

1: ensure we get the same dataset every time
2: x is a vector of random numbers in [0, 10)
3: random noise from the normal distribution
4: observed \(y\) is approximately equal to \(x\); true relationship is \(y = x\)

If we plot these points and add the line \(y = x\), we can see that the data is approximately linear:

	x	y
0	7.739560	7.939448
1	4.388784	3.936045
2	8.585979	8.396898
3	6.973680	7.623294
4	0.941773	0.763641
...	...	...
95	6.302826	7.033548
96	3.618126	3.064603
97	0.876499	0.429136
98	1.180059	1.501722
99	9.618977	9.421674

100 rows × 2 columns

Ordinary Least Squares (OLS)

For this section, I assume a basic understanding of OLS regression.

As a reminder, a linear regression model with one independent variable (\(x\)) and one dependent variable (\(y\)) can be written as:

\[ y = \beta_0 + \beta_1 x + \epsilon \]

where \(\beta_0\) is the intercept, \(\beta_1\) is the coefficient for \(x\), and \(\epsilon\) is the error term, or residual.

Two key features of OLS models are that they’re linear and that they minimize the sum of squared residuals. In other words, the model is trying to find the straight line that minimizes the sum of the squared differences between the observed \(y\) values and the predicted \(y\) values—the mean-squared error (MSE).

\[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \]

With OLS, we can easily solve this using a closed-form solution that is guaranteed to be the best possible solution:

\[ \hat{\beta} = (x^Tx)^{-1}x^Ty \]

where \(\hat{\beta}\) is the vector of fitted coefficients (\(\beta_0\) and \(\beta_1\) in our example) and \(x\) and \(y\) are our data vectors.

Now, we’ll fit a standard OLS model to the data:

import statsmodels.api as sm

x_with_const = sm.add_constant(x)
ols_model = sm.OLS(y, x_with_const).fit()

1: add a constant term for the intercept

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.969
Model:                            OLS   Adj. R-squared:                  0.969
No. Observations:                 100   F-statistic:                     3053.
Covariance Type:            nonrobust   Prob (F-statistic):           1.14e-75
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.0257      0.101     -0.253      0.801      -0.227       0.175
x1             1.0038      0.018     55.257      0.000       0.968       1.040
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Note that the intercept coefficient is close to zero and the slope coefficient is close to one, which nearly matches our initial data generating process of \(y = x\).¹ The fitted line basically overlaps our line for \(y = x\), indicating the OLS model is fitting the data well:

¹ The numbers are not exactly zero and one due to randomness in the data generation; they will approach the expected values as the sample size increases.

If we used the same data to fit the model again, we’d get the same result because we’re using the closed-form solution, which is guaranteed to be the best possible solution.²

² The only potential source of variation is the inverse of the \(x^Tx\) matrix since that is not usually the true inverse. However, the method for calculating this is usually stable within libraries (almost always the Moore-Penrose pseduoinverse).

Stochastic Gradient Descent (SGD)

Rather than the typical OLS approach, we could instead fit the model numerically using an iterative method. This approach could look something like this:

Pick a random value for \(\beta_0\) and \(\beta_1\).
Calculate the mean-squared error for the predicted values (\(\hat{y}\)) and the true values (\(y\)).
Pick another value for \(\beta_0\) and \(\beta_1\) and see if it’s better. If so, keep that one; otherwise, use the original one.
Repeat steps 1-3 N times or until the model converges.

This is a brute force approach. We could get clever and try to pick our values for \(\beta\) more strategically depending on whether they seem to improve as we move in one direction or the other. There are a variety of algorithms for doing this—optimizing values numerically is an entire subfield of academic research.

One such approach is called Stochastic Gradient Descent (SGD). The gradient is the derivative of our loss function (MSE) with respect to the coefficient and intercept. It tells us the direction and magnitude of the greatest increase in the loss function. We can use the gradient to give us a more informed guess about how we should choose the new \(\beta\) values to improve the model fit.³

³ The gradient (\(\nabla\)) of the function \(f(\cdot)\) with respect to the parameters \(\beta\) is the vector of partial derivatives of \(f\) with respect to each model parameter: \(\nabla_{\beta} f(\beta) = \left( \frac{\partial f}{\partial \beta_0}, \frac{\partial f}{\partial \beta_1} \right)\)

Without getting into the precise details of SGD, we can use it to fit a linear model by numerically optimizing the coefficients and intercept:

from sklearn.linear_model import SGDRegressor

sgd_model = SGDRegressor(loss="squared_error", penalty=None)
sgd_model.fit(x, y.ravel())
sgd_coefficent = sgd_model.coef_[0]
sgd_intercept = sgd_model.intercept_[0]

1: loss="squared_error" is the mean-squared error loss function (this is the default loss, so this arg is not necessary); penalty=None means no L1/L2 regularization to make it closer to what OLS does

sgd_coefficient=0.9816
sgd_intercept=0.0730

We’re still using the MSE loss function, although we do have other options now. We can also regularize the coefficients, but we’ve turned that off here (penalty=None) to make it closer to OLS.⁴ In the end, we have a well-fitting linear model that serves the same purpose as OLS but gets us there in a different way.

⁴ Setting penalty=l1 performs a Lasso regression and penalty=l2 performs a Ridge regression.

One difference with OLS is that every time we fit the SGD linear model, we’ll get a different coefficient and intercept. The randomness in this case is due to the input data being randomized before fitting. The takeaway here is that where you begin the numerical optimization matters. It matters so much that you may end up in a local optimum, rather than the global one. Getting stuck like this isn’t possible with OLS.

What SGD Loses Versus OLS

OLS has interpretable t-statistics and p-values for each coefficient, which can tell us how confident we are that the coefficient is not zero. We have options for this with SGD, such as bootstrapping, but we don’t get it “out of the box.”

Unless we set the random_state parameter of SGDRegressor() to a constant number, we also get some instability in the results. Even if we do prevent the randomness, OLS is a deterministic, guaranteed-best solution, while SGD is only an approximation.

On the other hand, the coefficients and intercept are still interpretable as a straight line. If we do get some sort of confidence interval, we can also make claims about the significance of the coefficients. Overall, we haven’t lost much compared to OLS, although we may have caused ourselves a bit more work.⁵

⁵ We also need to choose various hyperparameters for SGD, such as the learning rate and number of iterations.

What SGD Gains Versus OLS

So is there any reason to choose SGD over OLS?

The short answer is that if OLS works for you, then no. But SGD gives us much more flexibility. For example:

We could regularize the parameters to encourage sparsity in our coefficients without switching to a different model. Regularization can also be helpful in addressing multicollinearity.
As an iterative method, SGD can process data in batches, so it will likely have an easier time fitting the linear model if the data is too big to fit into our machine’s memory. Realistically, simple linear models are not going to be frequently used in situations like this though.
Related to #2, we can add more data later and “update” the model without having to start over from scratch.

Neural Network

So far we have used two techniques—OLS from classical statistics and SGD from machine learning—to fit a linear regression model. So how does this connect to deep learning and neural networks? A neural network is a series of linear models. Each linear model is called a layer and each layer is connected to the one before it and after it in the chain. These connections are called weights (what we’ve been calling coefficients) and biases (what we’ve been calling intercepts).

The first layer is called the input layer and the last layer is called the output layer. Any layers in between those two are called hidden layers. We must have an input layer and an output layer, but hidden layers are optional. If we have quite a few hidden layers, we call the neural network a “deep” neural network, thus the term deep learning.⁶

⁶ “Quite a few” depends on context and who you ask, but this probably means at least 3-5 layers but could mean hundreds or thousands. Even within deep learning models, some have many layers and parameters, others have far fewer.

For our simple linear model, we have one input layer (\(x\)), one output layer (\(y\)), and no hidden layers. In general, each layer can take vectors and matrices as inputs and outputs, but our simple example just has a one-dimensional vector (a scalar) for both the input and output layers. If x were instead a vector (x), like we would see with multivariate regression, we would have the more general linear model that we frequently see when using OLS:

graph LR
    x["**x**"] -->|bias| y["y"]
    x -->|weight 1| y
    x -->|weight 2| y
    x -->|...| y
    x -->|weight n| y

With OLS, we have a closed-form, deterministic solution, although we saw earlier that stochastic gradient descent can do something similar numerically, albeit with some randomness and other potential drawbacks. Unfortunately we no longer have a closed form solution to solve the general representation used by neural networks. On the bright side, since we have to fit the model using optimization techniques anyway, we can make a vast number of changes to the model architecture with little additional penalty.

The optimal weights and biases for the neural network will be solved iteratively through a process called training. I’m not going to get into the specific details of how to train neural networks, but the general procedure is to:

Initialize the weights (coefficients) and biases (intercept) to some random values.
Pass the data through the network to get the predicted values using the current weights and biases (the forward pass).
Calculate the difference between the predicted values and the true values. OLS calls this the error or residual; it’s called the loss when training neural networks. We calculate it using a loss function, which is often the mean-squared error.
Use the loss to determine how we can adjust the weights and biases to hopefully improve the model fit. We can use the SGD method from Section 3 to do a lot of the heavy lifting for us by using the loss function as the input for the SGD optimizer (begins the backward pass).
Update the weights and biases in the direction that minimizes the loss.
Repeat steps 2-5 N times or until the model converges. Each pass through the data is called an epoch. We will typically set some maximum number of epochs to prevent the model from training forever, with the possibility of stopping early if the model fit is no longer improving.

This training loop is implicit with the SGDRegressor() function in the previous section; for neural networks, we’ll need to build it ourselves for maximum control and flexibility. In python code, the iterative training loop looks something like this for our sample data:

import torch
from torch.optim.sgd import SGD

nn_model = torch.nn.Linear(1, 1)  # 1 linear layer with 1 input and 1 output
calculate_mse = torch.nn.MSELoss()
sgd_optimizer = SGD(nn_model.parameters(), lr=0.01)

# convert numpy arrays to pytorch tensors
x_tensor = torch.from_numpy(x).float()
y_tensor = torch.from_numpy(y).float()

# training loop
NUM_EPOCHS = 1000
for _ in range(NUM_EPOCHS):
    # forward pass to calculate the loss
    outputs = nn_model.forward(x_tensor)
    loss = calculate_mse(outputs, y_tensor)

    # backward pass to update weights/biases to try to improve next epoch
    sgd_optimizer.zero_grad()  # prepare to run backward pass
    loss.backward()  # compute gradient
    sgd_optimizer.step()  # use gradient to update weights

nn_weight = nn_model.weight.item()
nn_bias = nn_model.bias.item()

1: loss function is mean-squared error
2: optimizer is stochastic gradient descent
3: nn_model.forward(x) is equivalent to nn_model(x); you may see either

nn_weight=1.0026
nn_bias=-0.0179

Once again we see our weight/coefficient is close to one and the bias/intercept is close to zero, as expected. OLS, SGD, and neural networks all give us approximately the same result when fitting a linear model to our simple dataset:

Simplicity Matters

We should always choose the simplest effective model. Researchers frequently favor interpretability over accuracy so they can effectively test hypotheses, leading them to choose OLS. Simpler models also help reduce overfitting. But if raw accuracy is paramount—often the case when making predictions—or we know we have highly nonlinear data, neural networks give us extreme flexibility.

Deep Learning

The next question is how to get from our simple neural network to practical deep learning models. At a basic level a deep neural network means a neural network with many hidden layers. But there are many considerations and options when building a usable deep learning model.

Activation Functions

Our first instinct might be to add more layers to our current neural network since having “many layers” is a prerequisite for deep learning. But it turns out that these mathematically collapse down to a single linear layer. For this to work, we need to add in some sort of nonlinearity to our model. We can do this with a nonlinear activation function, which sits between the hidden layers and takes the linear output of one hidden layer and transforms it into a nonlinear input for the next layer.⁷ Some activation functions work better than others; commonly used options are: ⁸

⁷ The Universal Approximation Theorem says that a neural network can approximate any continuous function with the use of hidden layers and nonlinear activation functions. For example, a logistic regression looks similar to our linear regression model but with a sigmoid activation function.

⁸ Research is ongoing regarding optimal activation function choice, but some version of the ReLU is commonly used these days

Sigmoid: \(\sigma(x) = \frac{1}{1 + e^{-x}}\)
Hyperbolic Tangent: \(\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\)
Rectified Linear Unit (ReLU): \(relu(x) = \max(0, x)\)

An Abundance of Choice

In addition to choosing an activation function, we need to make some other decisions, which is where the art of working with deep learning models comes in. For example:

Model Size How many hidden layers should we have and how many weights should each layer have? A high number of parameters (# weights + # biases) could overfit the data, but too few will underfit—fortunately, there are many techniques for handling overfitting. Large models can take considerably longer and take more resources to train and use.
Training Procedure We need to tune the training loop by choosing, for example:
1. optimizer algorithm (SGD is only one option) and related parameters, such as learning rate
2. loss function
3. number of epochs
4. batching
5. early stopping rules
Training Data How should we structure the training data? We usually have some sort of train-test split, but there are many options for splitting the data.⁹ Do we have enough of the right type of data and how do we know that?
Model Architecture Real-world models often use a more complex structure than what I’ve presented here. The following architectures are common, with new options being developed all the time:
1. Convolutional Neural Networks (CNNs)
2. Recurrent Neural Networks (RNNs)
3. Long Short-Term Memory (LSTM) networks
4. Transformer networks: commonly used in Natural Language Processing (NLP)

⁹ The training set is used to fit the model, the test set is used to evaluate the model fit. However, terminology varies wildly and can be used inconsistently between people.

Classification

So far, I’ve mainly discussed regression, where we try to fit/predict a continuous output variable. I’ve mostly avoided references to classification, which assigns an observation to one of two or more categories.¹⁰ A great deal of deep learning is actually focused on classification, including in the natural language processing (NLP) field.

¹⁰ A common example from statistics is logistic regression, which supports two classes. It uses the sigmoid function as it’s “activation function” so we could easily replicate it with our neural network.

¹¹ \(\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{n} e^{x_j}}\)

¹² The outputs are all between 0 and 1, and they collectively sum to 1.

Not only can the inputs to a neural network be a vector—as they are in multivariate regression—but the outputs can be a vector as well. For classification, we can use this ability to have one output per class we’re interested in predicting. Then we’ll use the softmax function ¹¹ to convert the output of the neural network into a set of numbers we can interpret as probabilities.¹² We will then predict the class as the one with the highest number. How close the number is to 1 gives us an idea about how confident the model is in its prediction.

Conclusion

I didn’t even scratch the surface of deep learning in this article, but if you’ve been curious about deep learning and haven’t yet taken the plunge, I hope this article has helped you see that you already have a foundation to start from if you know a bit about basic statistics.

Next steps for you might include:

Books:
- Deep Learning Illustrated by Jon Krohn
- Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville
Videos:
- 3Blue1Brown’s Neural Networks Series
- Andrej Karpathy