R^2: A Simple Guide to Understanding Model Fit
A number to answer, 'How well does my model fit the data?'
Published on 2023-08-16 | 4m 52s
There was a time in one of my previous work projects where I had to help someone model data coming from our database. I exported the data, and it now means I had to pick the best model. Is a linear model better? An exponential model? A polynomial model? There is one number I used to help me in my decision making: the R^2.
Explanation of R^2
Before we go into the mathematics of the R^2, what the number tells us is how much a model can explain the given data. If a model’s R^2 is 1, it means every data point is perfectly explained by the model and lies on plane. If it’s 0.5, it means that it can explain 50% of the datapoints.
R^2 tells us, “What percentage of our data can be explained by our model?”
This is the reason it’s one of the numbers I used to pick which model is best to use. If the model has an R^2 of 0.95, it means there’s only 5% it cannot understand which is usually better than one with a lower R^2.
The Math Part
The formula for R^2 is very simple: 1 minus the ratio of the sum of squared residuals and sum of total squares. I’ll explain later what they are. For now, let’s get into the equation.
The SSR is called the “Sum of Squared Residuals” while SST is called the “Sum of Total Squares”.
The SSR measures the discrepancy between the predicted value and the true value and squaring them. It measures how wrong the model was in predicting the value.
The SST is taken by subtracting the true value from the mean value of the dataset and squaring them. It measures the total variation in the data itself.
The reason we square both is to ensure we don’t get negative values and to also give larger weight to larger residuals and variations - i.e. larger errors get magnified. Looking at the equation for R^2, you’ll notice that we’re trying to divide the discrepancy from the variation in the model. The ratio explains, “What is the unexplained variation in the model?” Subtracting it by 1 allows us to see how much is explained by the model itself.
Code Part
Now, unless you hate computers, you won’t be doing these by hand. The codes below should calculate it for us in both Python and Google Sheets. The amazing thing is that the formula should work for any regression model - including machine learning or deep learning models.
import numpy as np
def calculate_r_2(y_true, y_predicted):
"""Calculates R^2 for a particular model.
Args:
y_true: A numpy array on the true values of the data
y_predicted: A numpy array on the predicted values of the data
Returns:
A floating point number that can be expressed as a percentage
on how much of the data is explained by the model.
"""
y_mean = np.mean(y_true)
ss_residual = np.sum((y_true - y_predicted)**2)
ss_total = np.sum((y_true - y_mean)**2)
r_squared = 1 - (ss_residual / ss_total)
return r_squared
# Pure Python Version
def calculate_r_2(y_true, y_predicted):
"""Calculates R^2 for a particular model
Args:
y_true: A list on the true values of the data
y_predicted: A list on the predicted values of the data
Returns:
A floating point number that can be expressed as a percentage
on how much of the data is explained by the model.
"""
y_mean = sum(y_true) / len(y_true)
ss_residual = sum((y_true_val - y_pred_val)**2 for (y_true_val, y_pred_val) in zip(y_true, y_predicted))
ss_total = sum((y_true_val - y_mean)**2 for y_true_val in y_true)
r_squared = 1 - (ss_residual / ss_total)
return r_squared
You can also use the code below to calculate the R^2 value for Google Sheets. You will also find it when you graph the results and get a best line fit.
=1 - SUM(ARRAYFORMULA((y_true - y_predicted)^2)) / SUM(ARRAYFORMULA((y_true - AVERAGE(y_true))^2))
Caveats and Conclusion
Now, I must give a caveat: R^2 is not the only metric you should use to evaluate whether you should use a particular model. Even if we assume your data is perfect and methods are accurate, picking the model with the highest R^2 isn’t the end-all and be-all.
For one, you’re most likely going to explain the models to someone else. And from my experience, no one likes to hear, “I don’t know but it works”. Which is why I sometimes picked the exponential model over a polynomial model despite having a slightly lower R^2. It is easier to say, “We grow 5% year on year currently” than “Well, if we square the number of years since this time and multiply it by this number, we get this plus the time multiplied by this number. Plus, the constant is this”.
Another is the model’s hardware performance. Sometimes, it is acceptable to have a high R^2 while the model takes some time to load the results. Other times, you want it to be as fast as possible.
Also, R^2 isn’t used for classification models. Don’t - it’s not the proper one to use. The better ones are accuracy, f1-scores, and confusion matrices.
Don’t also forget that you could be overfitting. If you ever tried interpolation, the R^2 will always be 1 for a function but it will be the wrong model for predicting future values.
Furthermore, don’t discount a model because of a low R^2. For extremely complex phenomena, it is normal to have low values. R^2 explains how much of the model explains the data. If R^2 is 0.5, it means it explains 50% of the variation and cannot explain the other 50%. If that is the best model you can come up with, you probably still want to use it with that caveat.
In summary, R^2 is a tool - don’t use it for things it’s not meant for and don’t abuse it for things it never was meant to do. Always take the numbers together and judge models based on those numbers. If R^2 is high, p-values are low, and test accuracy is high, there’s a good chance you have a good model. But if your R^2 is high and every other relevant number is low, then you should probably ditch the model.