Math and science::INF ML AI

R-squared

\( R^2 \) measure

\( R^2 \), the coefficient of determination, or often just referred to as "variance explained", is defined as:

\[ R^2 = 1 -\frac{\operatorname{Var}[\text{residuals}]}{\operatorname{Var}[\text{outcome}]} \]

Linear regression

Think of the standard linear regression and two differences, the error down to the regression line, and the "error" term that contributes to the data variance.

The term \( \frac{A}{B} \):

\[ \frac{(y_{i} - y_{pred_i})^2}{(y_{i} - \mu_{y})^2} \]

Is the \( \frac{A}{B} \) in:

\[ R^2 = 1 - \frac{A}{B} \]

The numerator of \( \frac{A}{B} \) can be thought as the residual unexplained term that contributes to the variance. The fraction \( \frac{A}{B} \) then being the fraction of unexplained variance.

Mean squared error perspective

The \( R^2 \) value can also be phrased in terms of mean squared error.

Baseline mean squared error: The flattened 1D array of \( y \) values has \( \frac{1}{n} \sum_{1}^{n} (y - \operatorname{E}[y])^2\) calculated. Note how the calculation involves \( \frac{1}{n} \) and not \( \frac{1}{n-1} \), which highlights how MSE is a good perspective, and the calculation is not an estimate of the true variance.
Model mean squared error: After calculating model predictions, a 1D array \( \hat{y} \), the model mean squared error is calculated as \( \frac{1}{n} \sum_{1}^{n} (y - \hat{y})^2 \).

\[ R^2 = 1 - \frac{\text{Model mean squared error}}{\text{Baseline mean squared error}} \]

From this perspective, the peculiarity of flat lines is more apparent: a flat line necessarily has a low \( R^2 \) value, as the model's MSE will be similar to the baseline MSE. So, the \( R^2 \) value is not just a goodness-of-fit measure, but a measure of dependency of \( y \) on \( x \). Alarm bells should ring if the \( x-y \) space makes sense to be transformed, say by a matrix, as this would suggest an equivalency between all straight lines, each of which would have a different \( R^2 \) value.

Calculate \( R^2 \) with numpy


import numpy as np

def test_linearity(x, y):
    """
    Tests if a straight line models the given data well.
    
    Parameters:
    x (np.ndarray): 1D array of independent variable values.
    y (np.ndarray): 1D array of dependent variable values.
    
    Returns:
    dict: A dictionary containing the slope, intercept, and R^2 score.
    """
    # Fit a line using least squares
    A = np.vstack([x, np.ones_like(x)]).T
    slope, intercept = np.linalg.lstsq(A, y, rcond=None)[0]

    # Predict y values based on the line
    y_pred = slope * x + intercept

    # Calculate R^2
    ss_res = np.sum((y - y_pred) ** 2)
    ss_tot = np.sum((y - np.mean(y)) ** 2)
    r_squared = 1 - (ss_res / ss_tot)

    return {
        "slope": slope,
        "intercept": intercept,
        "r_squared": r_squared,
    }

# Example data
x = np.array([0, 1, 2, 3, 4, 5])
y = np.array([0, 2.1, 4.2, 6, 8.2, 10])

result = test_linearity(x, y)
print(f"Slope: {result['slope']}")
print(f"Intercept: {result['intercept']}")
print(f"R^2 Score: {result['r_squared']}")

# Interpretation
if result["r_squared"] > 0.95:  # Threshold can be adjusted
    print("The data is well modeled by a straight line.")
else:
    print("The data is not well modeled by a straight line.")

27.11.2024