CISC482 - Lecture15

Evaluating Model Performance 2

Dr. Jeremy Castagno

Class Business

Schedule

  • Reading 6-2: Mar 10 @ 12PM, Friday
  • Reading 6-3: Mar 22 @ 12PM, Wednesday
  • Proposal: Mar 22, Wednesday
  • Propsal Template!
  • HW5 - Mar 29 @ Midnight, Wednesday

Today

  • Review Overfit/Underfit
  • Reivew Regression and Classification Metrics
  • Training vs Testing Set

Review

Overfit/Underfit

  • Overfit - model is too complex to fit the data well.
    • Fitting the data too closely
    • Incorporating too much noise (meaningless variation)
    • Misses the general trend
  • Underfit - model is too simple to fit the data well.
    • Large systematic errror

Question - Underfit or Overfit?

Regression Metrics

  • R-squared, \(R^2\) : Percentage of variability in the outcome explained by the regression model (in the context of SLR, the predictor)

    \[ R^2 = \frac{\text{variation explained by regression}}{\text{total variation in the data}} = \frac{\sum (\hat{y}_i - \bar{y})^2}{\sum (y_i - \bar{y})^2} \]

  • Root mean square error, RMSE: A measure of the average error (average difference between observed and predicted values of the outcome)

\[ RMSE = \sqrt{\frac{\sum_{i = 1}^n (y_i - \hat{y}_i)^2}{n}} \]

  • What is the range? Units?

Binary Classification Metrics

  • True Positive (TP) is an outcome that was correctly identified as positive
  • True Negative (TN) is an outcome that was correctly identified as negative.
  • False Positive (FP) is an outcome that was incorrectly identified as positive
  • False Negative (TN) is an outcome that was incorrectly identified as negative

Metrics

  • Accuracy: \(\frac{TP + TN}{TP + TN + FP + FN}\)
  • Precision: \(\frac{TP}{TP + FP}\)
  • Recall: \(\frac{TP}{TP + FN}\)

True Model Evaluation

Purpose of model evaluation

  • \(R^2\), \(recall\), etc. tells us how our model is doing to predict the data we already have
  • But generally we are interested in prediction for a new observation, not for one that is already in our sample, i.e. out-of-sample prediction
  • We have a couple ways of simulating out-of-sample prediction before actually getting new data to evaluate the performance of our models

Splitting data

  • There are several steps to create a useful model: parameter estimation, model selection, performance assessment, etc.
  • Doing all of this on the entire data we have available leaves us with no other data to assess our choices
  • We can allocate specific subsets of data for different tasks, as opposed to allocating the largest possible amount to the model parameter estimation only (which is what we’ve done so far)

The Split

  • Training data is used to fit a model.
  • Validation data is used to evaluate model performance while adjusting hyperparameter estimates and conducting feature selection. We also use this choose between two models. This is not always needed!
  • Test data is used to evaluate final model performance and compare different models.
  • The ratio for this split: 80/10/10 or 70/10/20

Visualization

An Example!

Code
from sklearn.metrics import r2_score
linear_model = LinearRegression()
linear_model.fit(X, y)
print(f"Linear Model R^2 = {r2_score(y, linear_model.predict(X)):.2f}")

degree = 5
quadratic_model = np.poly1d(np.polyfit(x, y, degree)) # quadratic
print(f"Polynomial Model R^2 = {r2_score(y, quadratic_model(x)):.2f}")

x_graph = np.linspace(0, 10, 100)
ax = sns.scatterplot(x=x, y=y, label="All Data")
ax.plot(x_graph, linear_model.predict(x_graph[:, np.newaxis]), color='r', label='Linear Regression')
ax.plot(x_graph, quadratic_model(x_graph), color='m', label='Polynomial Regression')
ax.legend();

An Example!

Linear Model R^2 = 0.55
Polynomial Model R^2 = 0.64

Split the Data!

Code
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=0)

fig, ax = plt.subplots(nrows=1, ncols=2) # create 2 plots!
# Plot training result
sns.scatterplot(x=X_train[:,0], y=y_train, ax=ax[0], label="Training Data")
ax[0].set_title("Training Set")
ax[0].legend()
# Plot testing result
sns.scatterplot(x=X_test[:,0], y=y_test, ax=ax[1], label="Testing Data")
ax[1].set_title("Testing Set")
ax[1].legend();

Train/Test Split

Code
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=0)

linear_model.fit(X_train, y_train)
quadratic_model = np.poly1d(np.polyfit(X_train[:,0], y_train, degree)) # quadratic

print(f"TRAIN SET - Linear Model R^2 = {r2_score(y_train, linear_model.predict(X_train)):.2f}; Polynomial Model R^2 = {r2_score(y_train, quadratic_model(X_train[:,0])):.2f}")
print(f"TEST SET - Linear Model R^2 = {r2_score(y_test, linear_model.predict(X_test)):.2f}; Polynomial Model R^2 = {r2_score(y_test, quadratic_model(X_test[:,0])):.2f}")


fig, ax = plt.subplots(nrows=1, ncols=2) # create 2 plots!
# Plot training result
x_graph = np.linspace(0, 10, 100)
sns.scatterplot(x=X_train[:,0], y=y_train, ax=ax[0], label="Training Data")
ax[0].plot(x_graph, linear_model.predict(x_graph[:, np.newaxis]), color='r', label='Linear Regression')
ax[0].plot(x_graph, quadratic_model(x_graph), color='m', label='Polynomial Regression');
# ax[0].scatter(np.sort(X_train[:, 0]), quadratic_model(np.sort(X_train[:, 0])), color='m');
ax[0].set_title("Training Set")
ax[0].legend()
# Plot testing result
sns.scatterplot(x=X_test[:,0], y=y_test, ax=ax[1], label="Testing Data")
ax[1].plot(x_graph, linear_model.predict(x_graph[:, np.newaxis]), color='r', label='Linear Regression')
ax[1].plot(x_graph, quadratic_model(x_graph), color='m', label='Polynomial Regression');
# ax[1].scatter(np.sort(X_test[:, 0]), quadratic_model(np.sort(X_test[:, 0])), color='m');
ax[1].set_title("Testing Set")
ax[1].legend();
TRAIN SET - Linear Model R^2 = 0.44; Polynomial Model R^2 = 0.62
TEST SET - Linear Model R^2 = 0.65; Polynomial Model R^2 = 0.41

Cross Validation

Cross Validation

  • It seems a little strange to now judge our models just on the random selection of observations of validation set.
  • Its like the choice of our model is biased towards this random selection of validation set.
  • Solution: Do this process (fitting a model, validating model) multiple times using different subsets of data -> Cross Validation

K-Folds

  • k-fold cross-validation is a popular method of evaluating model performance
  • Step 1: Choose k, usually 10
  • Step 2: ShuffleDivide our data, X, into X_fold and X_test.
  • Step 3: Divide X_fold into k groups (folds)
  • Step 4: Model is trained and validated repeatedly using these groups

K-Folds Split (k=10)

K-Folds Train and Validate

Choosing k

  • the larger the k, the more models need to be trained
  • the larger the k, larger training folds
  • this vastly increases computational requirements.
  • For large models, you need to have a small \(k\). e.g. chat gpt!
  • Most common for small/medium models is k=10

Class Activity

Class Activity

Model Evaluation