CISC482 - Lecture15

Evaluating Model Performance 2

Dr. Jeremy Castagno

Class Business

Schedule

Reading 6-2: Mar 10 @ 12PM, Friday
Reading 6-3: Mar 22 @ 12PM, Wednesday
Proposal: Mar 22, Wednesday
Propsal Template!
HW5 - Mar 29 @ Midnight, Wednesday

Today

Review Overfit/Underfit
Reivew Regression and Classification Metrics
Training vs Testing Set

Review

Overfit/Underfit

Overfit - model is too complex to fit the data well.
- Fitting the data too closely
- Incorporating too much noise (meaningless variation)
- Misses the general trend
Underfit - model is too simple to fit the data well.
- Large systematic errror

Question - Underfit or Overfit?

Regression Metrics

R-squared, \(R^2\) : Percentage of variability in the outcome explained by the regression model (in the context of SLR, the predictor)

\[ R^2 = \frac{\text{variation explained by regression}}{\text{total variation in the data}} = \frac{\sum (\hat{y}_i - \bar{y})^2}{\sum (y_i - \bar{y})^2} \]
Root mean square error, RMSE: A measure of the average error (average difference between observed and predicted values of the outcome)

\[ RMSE = \sqrt{\frac{\sum_{i = 1}^n (y_i - \hat{y}_i)^2}{n}} \]

What is the range? Units?

Binary Classification Metrics

True Positive (TP) is an outcome that was correctly identified as positive
True Negative (TN) is an outcome that was correctly identified as negative.
False Positive (FP) is an outcome that was incorrectly identified as positive
False Negative (TN) is an outcome that was incorrectly identified as negative

Metrics

Accuracy: \(\frac{TP + TN}{TP + TN + FP + FN}\)
Precision: \(\frac{TP}{TP + FP}\)
Recall: \(\frac{TP}{TP + FN}\)

True Model Evaluation

Purpose of model evaluation

\(R^2\), \(recall\), etc. tells us how our model is doing to predict the data we already have
But generally we are interested in prediction for a new observation, not for one that is already in our sample, i.e. out-of-sample prediction
We have a couple ways of simulating out-of-sample prediction before actually getting new data to evaluate the performance of our models

Splitting data

There are several steps to create a useful model: parameter estimation, model selection, performance assessment, etc.
Doing all of this on the entire data we have available leaves us with no other data to assess our choices
We can allocate specific subsets of data for different tasks, as opposed to allocating the largest possible amount to the model parameter estimation only (which is what we’ve done so far)

The Split

Training data is used to fit a model.
Validation data is used to evaluate model performance while adjusting hyperparameter estimates and conducting feature selection. We also use this choose between two models. This is not always needed!
Test data is used to evaluate final model performance and compare different models.
The ratio for this split: 80/10/10 or 70/10/20

Visualization

An Example!

Code

from sklearn.metrics import r2_score
linear_model = LinearRegression()
linear_model.fit(X, y)
print(f"Linear Model R^2 = {r2_score(y, linear_model.predict(X)):.2f}")

degree = 5
quadratic_model = np.poly1d(np.polyfit(x, y, degree)) # quadratic
print(f"Polynomial Model R^2 = {r2_score(y, quadratic_model(x)):.2f}")

x_graph = np.linspace(0, 10, 100)
ax = sns.scatterplot(x=x, y=y, label="All Data")
ax.plot(x_graph, linear_model.predict(x_graph[:, np.newaxis]), color='r', label='Linear Regression')
ax.plot(x_graph, quadratic_model(x_graph), color='m', label='Polynomial Regression')
ax.legend();

An Example!

Linear Model R^2 = 0.55
Polynomial Model R^2 = 0.64

Split the Data!

Code

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=0)

fig, ax = plt.subplots(nrows=1, ncols=2) # create 2 plots!
# Plot training result
sns.scatterplot(x=X_train[:,0], y=y_train, ax=ax[0], label="Training Data")
ax[0].set_title("Training Set")
ax[0].legend()
# Plot testing result
sns.scatterplot(x=X_test[:,0], y=y_test, ax=ax[1], label="Testing Data")
ax[1].set_title("Testing Set")
ax[1].legend();

Train/Test Split

Code

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=0)

linear_model.fit(X_train, y_train)
quadratic_model = np.poly1d(np.polyfit(X_train[:,0], y_train, degree)) # quadratic

print(f"TRAIN SET - Linear Model R^2 = {r2_score(y_train, linear_model.predict(X_train)):.2f}; Polynomial Model R^2 = {r2_score(y_train, quadratic_model(X_train[:,0])):.2f}")
print(f"TEST SET - Linear Model R^2 = {r2_score(y_test, linear_model.predict(X_test)):.2f}; Polynomial Model R^2 = {r2_score(y_test, quadratic_model(X_test[:,0])):.2f}")


fig, ax = plt.subplots(nrows=1, ncols=2) # create 2 plots!
# Plot training result
x_graph = np.linspace(0, 10, 100)
sns.scatterplot(x=X_train[:,0], y=y_train, ax=ax[0], label="Training Data")
ax[0].plot(x_graph, linear_model.predict(x_graph[:, np.newaxis]), color='r', label='Linear Regression')
ax[0].plot(x_graph, quadratic_model(x_graph), color='m', label='Polynomial Regression');
# ax[0].scatter(np.sort(X_train[:, 0]), quadratic_model(np.sort(X_train[:, 0])), color='m');
ax[0].set_title("Training Set")
ax[0].legend()
# Plot testing result
sns.scatterplot(x=X_test[:,0], y=y_test, ax=ax[1], label="Testing Data")
ax[1].plot(x_graph, linear_model.predict(x_graph[:, np.newaxis]), color='r', label='Linear Regression')
ax[1].plot(x_graph, quadratic_model(x_graph), color='m', label='Polynomial Regression');
# ax[1].scatter(np.sort(X_test[:, 0]), quadratic_model(np.sort(X_test[:, 0])), color='m');
ax[1].set_title("Testing Set")
ax[1].legend();

TRAIN SET - Linear Model R^2 = 0.44; Polynomial Model R^2 = 0.62
TEST SET - Linear Model R^2 = 0.65; Polynomial Model R^2 = 0.41

Cross Validation

It seems a little strange to now judge our models just on the random selection of observations of validation set.
Its like the choice of our model is biased towards this random selection of validation set.
Solution: Do this process (fitting a model, validating model) multiple times using different subsets of data -> Cross Validation

K-Folds

k-fold cross-validation is a popular method of evaluating model performance
Step 1: Choose k, usually 10
Step 2: ShuffleDivide our data, X, into X_fold and X_test.
Step 3: Divide X_fold into k groups (folds)
Step 4: Model is trained and validated repeatedly using these groups

K-Folds Split (k=10)

K-Folds Train and Validate

Choosing k

the larger the k, the more models need to be trained
the larger the k, larger training folds
this vastly increases computational requirements.
For large models, you need to have a small \(k\). e.g. chat gpt!
Most common for small/medium models is k=10

Class Activity

Model Evaluation