Machine Learning in Risk and Finance
Developing a Residual Value Risk Model
About this Course¶
You will find all course material and setup instructions in the following repository.
What you will learn¶
- Fundamental Concepts: Overview of ML and its relationship with Artificial Intelligence (AI) and Deep Learning (DL).
- ML in Risk Modelling: Application of ML in managing residual value risk in car leasing.
- Developing a ML Risk Model: Based on a real-world use case:
- Problem understanding
- Data preprocessing
- Feature selection
- Model training
- Model evaluation
- Commonly Used ML Algorithms: Overview of Bagging and Boosting methods.
- Data Splitting Process: Techniques for splitting data into training and test sets, including sampling strategies like cross-validation.
- ML Pipeline: Automate the inference process of applying the ML model.
What is Machine Learning?¶
To grasp the definition of Machine Learning (ML), it's helpful to narrow the focus by exploring its relationship with Artificial Intelligence (AI) and Deep Learning (DL).
ML is a subset of AI that focuses on developing algorithms capable of learning from data to make predictions or decisions, while DL is a specialized branch of ML that utilizes neural networks to model complex patterns in large datasets.
Image source: Author
Artificial Intelligence (AI)
AI is the overarching field that encompasses all technologies and systems designed to simulate human intelligence. This includes tasks like
reasoning, problem-solving, learning, and decision-making.
Machine Learning (ML):
ML is a subset of AI that involves constructing predictive models from labeled datasets, enabling systems to make informed decisions based
on data. It is categorized into three main types:
- Supervised Learning: The model learns from labeled data, where input-output pairs are provided, allowing it to predict outcomes for new, unseen data.
- Unsupervised Learning: The model identifies patterns and structures in unlabeled data, discovering hidden relationships without predefined labels.
- Reinforcement Learning: The model learns through interactions with an environment, receiving rewards or penalties based on its actions, which guides its decision-making process.
In this course, we will focus specifically on supervised learning, utilizing tree-based regression methods to develop predictive models.
Deep Learning (DL):
Deep learning is a specialized branch of ML that uses artificial neural networks inspired by the human brain. These networks are
particularly effective at processing large amounts of unstructured data, such as images, text, and audio.
Application of Machine Learning in Risk Management¶
Machine Learning (ML) has become an essential tool in risk management due to its ability to analyze vast amounts of data and uncover patterns that traditional methods may overlook. In today's complex financial landscape, organizations face numerous risks, and ML provides a robust framework for modeling and steering these risks effectively.
Why Use Machine Learning?¶
- Enhanced Predictive Accuracy: ML algorithms improve risk predictions by learning from historical data and adapting to new information, crucial for credit risk modeling.
- Real-Time Analysis: ML processes data in real-time, enabling swift responses to emerging risks in dynamic markets.
- Complex Data Handling: ML excels at processing vast amounts of complex financial data, allowing for comprehensive risk assessments.
- Model Risk Management: ML helps identify and mitigate model risk by providing insights into model performance and reliability.
- Automation and Efficiency: ML automates routine risk assessment tasks, freeing up resources for strategic decision-making, especially in credit risk management.
A real-world use case: Residual Value Risk Modelling¶
In the following, we will work on a real-world use case that car leasing entities typically face. It is the risk associated with the uncertainty of what a leased out car can be resold at when the car returns to the leasing entity. This asset risk is commonly known as residual value risk.
Why do leasing companies care about residual values?
Leasing companies prioritize residual values because they are essential for estimating an asset's worth at the end of its lease term, which influences lease payments and financial risk management. Accurate predictions of residual values help prevent financial losses when remarketing used vehicles.
What determines the residual value of a vehicle?
The residual value of a vehicle is determined by various factors, including its condition (age, mileage, and damage), shifting customer preferences, economic fluctuations, market transparency, advancements in technology (like electric vehicles), and regulatory uncertainties, making accurate predictions challenging.
Developing a Machine Learning Model¶
Developing a ML model involves a multi-step process that is typically organized into pipeline structures.
Image source: Author
Understanding the Problem¶
Creating a residual value model is a regression task focused on predicting the future resale price of a vehicle at the end of its lease term, leveraging features like vehicle characteristics. The starting point for the model development is historical resale data that then needs to undergo a thorough cleaning and feature engineering process. We work with used car data for the US market that is freely available on Kaggle.
Data Preprocessing¶
Data Cleaning¶
First, we import pandas
a popular Python library that is designed to wrangle rectangular data.
import pandas as pd
pd.set_option('display.float_format', '{:.2f}'.format)
Then we read the data used car data.
df_uc = pd.read_csv('data/used_car.csv', parse_dates=['posting_date'])
How many observations (samples) and variables (features) do we have?
df_uc.shape
(426812, 17)
Overall, there 17 features with 426,812 samples.
Next, we will take a first look at the data and examine the data types.
df_uc.sample(3)
id | region | price | year | manufacturer | model | condition | cylinders | fuel | odometer | transmission | VIN | drive | type | paint_color | state | posting_date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
248122 | 7314143256 | las vegas | 33999 | 2015.00 | ford | f150 supercrew cab | NaN | 6 cylinders | gas | 66812.00 | automatic | 1FTEW1CPXFKD59968 | NaN | pickup | white | nv | 2021-04-29 |
200197 | 7316290358 | grand rapids | 6900 | 2000.00 | ford | f350 | good | 8 cylinders | diesel | 293000.00 | automatic | NaN | 4wd | NaN | white | mi | 2021-05-03 |
161659 | 7311899881 | omaha / council bluffs | 29592 | NaN | NaN | n Frontier | NaN | 6 cylinders | gas | 4142.00 | automatic | 1N6DD0EV0KN871266 | 4wd | pickup | silver | ia | 2021-04-24 |
A short description of each variable can be found here.
df_uc.dtypes
id int64 region object price int64 year float64 manufacturer object model object condition object cylinders object fuel object odometer float64 transmission object VIN object drive object type object paint_color object state object posting_date datetime64[ns] dtype: object
Note the following definition of data types:
int64
: integer type.float64
: floating point type.object
: generic container for any Python object (here used for string variables)datetime64[ns]
: date and time information with nanosecond precision
The automatic data type conversion of pandas seems to make sense.
To avoid skewing the learning process, it is essential to eliminate duplicate entries from our data, as machine learning models may assign undue importance to these duplicates, particularly if they belong to a specific class. This can lead to a biased model that fails to accurately represent the true distribution of the data.
df_uc.drop(columns=['id']).duplicated().sum().__str__()
'4222'
There are 4222 duplicates in our raw data. We keep only the first entry.
df_uc = df_uc.drop(columns=['id']).drop_duplicates(keep='first')
A common approach to gain initial insights into the variables is by examining their descriptive statistics, which can be easily accomplished
using the pandas method describe()
. This method provides a summary of key statistical measures, helping to understand the distribution
and characteristics of the data.
# Descriptive stats for integers, floats and datetime
df_uc.describe(exclude=['object'])
price | year | odometer | posting_date | |
---|---|---|---|---|
count | 422590.00 | 421475.00 | 418324.00 | 422590 |
mean | 75774.14 | 2011.23 | 97993.36 | 2021-04-23 08:18:04.557609216 |
min | 0.00 | 1900.00 | 0.00 | 2021-04-04 00:00:00 |
25% | 5950.00 | 2008.00 | 37531.00 | 2021-04-17 00:00:00 |
50% | 13988.00 | 2013.00 | 85377.00 | 2021-04-25 00:00:00 |
75% | 26500.00 | 2017.00 | 133600.00 | 2021-05-01 00:00:00 |
max | 3736928711.00 | 2022.00 | 10000000.00 | 2021-05-05 00:00:00 |
std | 12243930.69 | 9.47 | 213704.62 | NaN |
We have identified several outliers in the data:
- maximum
price
of $3,736,928,711 is unreasonably high for a used car. - maximum
odometer
reading of 10,000,000 is also unrealistic. - a model
year
of 1900 seems to be another data quality issue.
In the following we remove these outliers.
# 1: Remove samples with a price >= 100000
df_uc = df_uc.loc[df_uc['price']<100000]
# 2: Remove samples with a year < 2000
df_uc = df_uc.loc[df_uc['year']>=2000]
# 3: Remove samples with a odometer > 200000
df_uc = df_uc.loc[df_uc['odometer']<=200000]
# Descriptive stats for strings only
df_uc.describe(include=['object'])
region | manufacturer | model | condition | cylinders | fuel | transmission | VIN | drive | type | paint_color | state | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 372280 | 361468 | 369065 | 218465 | 214072 | 370244 | 370764 | 247417 | 259796 | 297770 | 261213 | 372280 |
unique | 404 | 41 | 23395 | 6 | 8 | 5 | 3 | 108744 | 3 | 13 | 12 | 51 |
top | columbus | ford | f-150 | good | 6 cylinders | gas | automatic | 1FMJU1JT1HEA52352 | 4wd | sedan | white | ca |
freq | 3194 | 60453 | 7019 | 106911 | 83390 | 310918 | 293591 | 261 | 117001 | 80398 | 69925 | 43164 |
Looking at manufacturer
, for example, we see that
- there is remarketing data from 41 distinct manufacturers
- the most frequent manufacturer is Ford
- with 60,453 observations
Alternatively, we can visualize the univariate distribution of each variable. To facilitate this, I have provided a convenient function called
plot_univariate()
, which can be found in the util.py module.
from util import plot_univariate
plot_univariate(
df_uc,
columns=[
'manufacturer', 'state', 'price', 'year', 'odometer',
'condition', 'fuel', 'type', 'drive', 'transmission',
'cylinders', 'paint_color'],
log_dict={'price': True, 'odometer': True},
ncols=4, size=(20,15), wspace=.3)
The plot indicates that the dataset includes used car prices from various manufacturers across different states in the US and model years, with significant variance in odometer readings, a key factor in determining a used car's residual value. To refine the analysis, we will remove samples with prices below $1,000, as they are not relevant in the context of residual value assessment.
df_uc = df_uc.loc[df_uc['price'] > 1000]
In the context of residual value risk management, we aim to determine whether we can train a model that accurately predicts the residual value, or used-car price, **based on vehicle information* available at the inception of the lease contract.
Translating these questions into a ML problem can be stated as follows:
- We want to predict vehicles' used car prices which is the target variable, $y$ ...
- ... based on information about the vehicle at time of remarketing, referred to as features, $X$ ...
- ... whereas the resulting model needs to fulfill a certain minimum level of prediction quality, assessed through a scoring function that evaluates the prediction error, denoted as, $s\left(\epsilon \right)$.
# Target variable
y = df_uc['price']
# Features
X = df_uc.drop(columns=['price'])
Train-Test Split¶
In machine learning projects, splitting the dataset into training and test sets is essential.
Why Split the Data?
- Model Evaluation: Train the model on the training set and evaluate its performance on the test set to ensure it generalizes well to new data.
- Prevent Overfitting: Monitor and mitigate overfitting by keeping a separate test set. Overfitting occurs when a model performs well on training data but poorly on new data. Later in the course, we will learn about cross-validation to further prevent overfitting.
When to Split the Data?
- Early in the Pipeline: Splitting early ensures data integrity and prevents data leakage, maintaining an unbiased estimate of the model's performance.
We split the data at random into training and test sets using the train_test_split()
method from the popular
scikit-learn
library that provides a wide range of ML algorithms and other convenience functions that are typically used in ML pipelines. We will
keep 20% of the data aside in the test set. For reproducibility we set a seed via the random_state
argument.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=333)
Handling Missing Values¶
Most machine learning models cannot handle missing values, so we employ data imputation to replace these missing values with reasonable substitutes, which helps maintain the model's ability to learn from the training data. This method is preferred over simply dropping samples with missing values, as it preserves more observations and, consequently, more patterns for the model to learn.
X_train.isnull().sum()
region 0 year 0 manufacturer 7169 model 2011 condition 101347 cylinders 111302 fuel 1601 odometer 0 transmission 1087 VIN 90247 drive 79740 type 51882 paint_color 73975 state 0 posting_date 0 dtype: int64
Most of the features have missing values. There are different imputation strategies and, depending on the feature, one or the other makes more sense.
year¶
Domain knowledge is crucial for defining an effective imputation approach for vehicle data. The 10th character of the Vehicle Identification Number (VIN) specifically indicates the vehicle's model year. The VIN serves as a unique identifier for each vehicle and is included in our dataset
Image source: AutoCheck
We import a dictionary that maps the 10th digit of the VIN to the model year. The dictionary can be read from the util.py module.
from util import vin_to_year
X_train.loc[X_train.year.isnull(), 'year'] = X_train.loc[
X_train.year.isnull()].VIN.apply(lambda x: x[9]).map(vin_to_year)
manufacturer¶
The same applies to the manufacturer. 2nd and 3rd character of the VIN give information about the car manufacturer. Again, we use this information to impute the manufacturer column.
from util import vin_to_manufacturer
X_train.loc[
X_train.manufacturer.isnull() & X_train.VIN.notnull(),
'manufacturer'
] = X_train.loc[
X_train.manufacturer.isnull() & X_train.VIN.notnull()
].VIN.apply(lambda x: x[0:3]).map(vin_to_manufacturer)
model¶
A valid imputation strategy for categorical variables is to introduce a new category labeled "unknown" to replace missing values. We will implement this
approach for the model column using the SimpleImputer
class from scikit-learn
.
from sklearn.impute import SimpleImputer
X_train[['model']] = SimpleImputer(
strategy='constant',
fill_value='unknown').fit_transform(X_train[['model']])
paint_color¶
We can use the SimpleImputer
also to impute missing values with the most frequent value. We do so for the paint_color variable.
X_train[['paint_color']] = SimpleImputer(
strategy='most_frequent').fit_transform(X_train[['paint_color']])
cylinders, fuel, transmission, drive, type¶
We impute missing values in cylinders, fuel, transmission, drive, type with the most frequent value conditioned on the manufacturer and the vehicle's body
type. We use the ConditionalImputer
for this purpose.
from util import ConditionalImputer
from tqdm import tqdm
columns_to_impute = ['cylinders', 'fuel', 'transmission', 'drive']
for column in tqdm(columns_to_impute):
# Create a list of condition columns excluding the current target column
relevant_cols = ['manufacturer', 'type'] + [column]
# Impute the target column using the adjusted condition columns
X_train[column] = ConditionalImputer(
target_col=column,
condition_cols=['manufacturer', 'type'],
strategy='most_frequent').fit_transform(X_train[relevant_cols])[column]
100%|██████████| 4/4 [01:10<00:00, 17.56s/it]
odometer¶
An intuitive imputation strategy for the odometer (miles driven) is to assume that older cars generally have on average a higher mileage. To implement this, we first calculate the vehicle's age by determining the difference between the model year and the posting date.
X_train['age'] = X_train['posting_date'].dt.year - X_train['year']
We now impute the missing odometer with the most similar used cars in terms of vehicle age. For this purpose, we use scikit-learn
's
KNNImputer
class.
from sklearn.impute import KNNImputer
X_train[['age', 'odometer']] = KNNImputer(
n_neighbors=10).fit_transform(X_train[['age', 'odometer']])
The remaining observations with missing values are going to be dropped.
X_train = X_train.loc[X_train.drop(
columns=['VIN', 'condition']).notnull().all(axis=1)]
y_train = y_train.loc[X_train.index]
X_train.isnull().sum()
region 0 year 0 manufacturer 0 model 0 condition 99550 cylinders 0 fuel 0 odometer 0 transmission 0 VIN 86985 drive 0 type 0 paint_color 0 state 0 posting_date 0 age 0 dtype: int64
X_train.shape
(259935, 16)
We now have a clean training data set with 291,384 complete samples.
Feature Engineering¶
Feature Selection¶
When selecting features for a machine learning model, it's essential to ensure that the information is available at the time of inference, particularly in risk management where predictions, such as lease rates based on residual values, must be made at the start of a lease without relying on end-of-lease data.
Key Considerations:
- The car's sale region is unknown at the lease's start, so we disregard it unless the off-lease strategy is pre-planned.
- The vehicle's final condition is also unknown.
- Age and mileage can be estimated from the contractual lease period and maximum mileage.
Conclusion: We will exclude 'condition', 'region', 'state', and 'VIN' from our feature set due to their limited predictive value for residual value.
X_train = X_train.drop(columns=['condition', 'region', 'state', 'VIN'])
Further feature engineering involves creating new features to enhance a model's predictive power. For instance, we have derived the age feature from the manufacturing year and the posting date of the used car.
X_train = X_train.drop(columns=['year', 'posting_date'])
Special equipment, such as sport packages, often depreciates faster than standard vehicles. We can utilize the model variable to determine if a vehicle is a "sports version".
X_train['sport'] = X_train['model'].apply(
lambda x: 'sport' in x.lower()).astype(int)
X_train = X_train.drop(columns=['model'])
One could consider additional feature engineering steps. For instance, the counting how often the same VIN occurs in the data could serve as a proxy for the number of previous owners. However, this is beyond the scope of this course.
Encoding Categorical Variables¶
Encoding categorical variables is essential before training a machine learning model, as most ML algorithms require numerical input and
cannot process categorical data directly. Techniques like one-hot encoding, which converts each category into a binary
vector, can be implemented using the OneHotEncoder
class from scikit-learn
.
from sklearn.preprocessing import OneHotEncoder
# Step 1: Fit and transform the data
encoder = OneHotEncoder(
sparse_output=False,
min_frequency=5,
handle_unknown='infrequent_if_exist').fit(
X_train[['manufacturer', 'fuel', 'cylinders',
'paint_color', 'type', 'drive', 'transmission']])
# Transform the data
encoded_data = encoder.transform(
X_train[['manufacturer', 'fuel', 'cylinders',
'paint_color', 'type', 'drive', 'transmission']])
# Step 2: Create a DataFrame from the encoded data
encoded_df = pd.DataFrame(
encoded_data, columns=encoder.get_feature_names_out())
# Step 3: Concatenate the original DataFrame with the encoded DataFrame
X_train = pd.concat([
X_train.reset_index(drop=True),
encoded_df.reset_index(drop=True)],
axis=1)
# Step 4: Drop the original categorical columns if no longer needed
X_train.drop(
columns=['manufacturer', 'fuel', 'cylinders',
'paint_color', 'type', 'drive', 'transmission'],
inplace=True)
All features are now converted into numerical format.
X_train.head(3)
odometer | age | sport | manufacturer_acura | manufacturer_alfa-romeo | manufacturer_aston-martin | manufacturer_audi | manufacturer_bmw | manufacturer_buick | manufacturer_cadillac | ... | type_sedan | type_truck | type_van | type_wagon | drive_4wd | drive_fwd | drive_rwd | transmission_automatic | transmission_manual | transmission_other | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 86000.00 | 8.00 | 0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 1.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 |
1 | 52159.00 | 2.00 | 0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 |
2 | 107862.00 | 6.00 | 0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 1.00 | 0.00 | 1.00 | 0.00 | 0.00 |
3 rows × 91 columns
Model Training¶
A crucial step in the development of a ML model is that different algorithms have varying strengths and weaknesses, and selecting the most suitable one can significantly impact the model's performance and accuracy.
The number of algorithms for supervised learning tasks for regression is relatively wide spread. A good overview can be found here.
In this course we focus on tree-based models. The basic building block, also known as base learner, in a tree-based model is a Decision Tree.
Decision Tree¶
Decision Tree Logic:
- Splitting: The data is divided into subsets based on feature values to identify the best feature and split point that separate the data according to the target variable.
- Decision Rules: Each node represents a binary decision rule based on a feature, such as "Is feature X greater than value Z?"
- Leaf Nodes: Terminal nodes (leaves) provide the model's prediction, typically the mean value of the target variable for samples in that leaf.
In the following, we train a simple decision tree using DecisionTreeRegressor
from scikit-learn
. With the the arguments
max_depth
and min_samples_split
we control the complexity of the final tree:
max_depth
: The maximum number of subsequent binary splits, i.e. how large the tree is allowed to getmin_samples_split
: The minimum number of samples that need to be in a terminal leaf
from sklearn import tree
model = tree.DecisionTreeRegressor(
max_depth=2,
min_samples_leaf=1000).fit(X_train, y_train)
tree_plot = tree.export_graphviz(
model, # The trained decision tree model to visualize.
feature_names=X_train.columns, # Names of the features used in the model for labeling nodes.
filled=True, # Fills the nodes with colors based on the predicted class.
rounded=True, # Draws node boxes with rounded corners for better aesthetics.
special_characters=True, # Allows the use of special characters in feature names for proper rendering.
proportion=True, # Displays the proportion of samples at each node instead of absolute counts.
impurity=False, # Omits impurity values (like Gini or entropy) from the node display.
precision=0 # Sets the precision for floating-point values to zero for cleaner output.
)
import graphviz
graphviz.Source(tree_plot)
The decision tree visualization shows that training observations are first split based on the vehicle's odometer reading, with cars having 70,976 miles or less moving left and those with more than 70,976 miles moving right. For vehicles under the mileage threshold, 4-cylinder engines average $19,877, while other engine types average $32,813; for those above the threshold, diesel engines average $32,553, compared to $12,411 for non-diesel engines.
Decision trees are considered weak learners due to their tendency to overfit training data, particularly when they are deep, which can hinder their ability to generalize to unseen data. Conversely, shallow trees, or stumps, also exhibit weak learning characteristics as they capture limited information, resulting in high bias and low prediction accuracy.
Overfitting is related to the concepts of variance and bias:
- High Variance: Overfitting models have high variance, meaning they are too sensitive to the fluctuations in the training data.
- Low Bias: Overfitting models typically have low bias, meaning they fit the training data very closely.
Image source: Author
One effective way to visualize a model's performance is by plotting its predictions against the true values, which we will demonstrate next. This comparison allows us to assess how well the model captures the underlying patterns in the data.
from util import plot_predictions
y_pred = model.predict(X_train)
plot_predictions(y_train, y_pred)
Random Forest¶
Random Forest logic:
- Ensemble of Trees: It builds a large number of decision trees (hence "forest") and merges their results to improve accuracy and control overfitting.
- Bootstrap Aggregation (Bagging): Each tree is trained on a random subset of the training data, sampled with replacement (bootstrapping). This helps in reducing variance.
- Random Feature Selection: At each split in the tree, a random subset of features is considered for splitting, which helps in reducing correlation among the trees and improves model robustness.
We train a Random Forest using RandomForestRegressor
from scikit-learn
. We control the training process by means of the following
argument:
n_estimators
: The number of decision treesmax_depth
: The maximum depth of a treebootstrap
: Whether a random fraction of the samples should be used when training a tree, i.e. bootstrap samplingmax_features
: The number of features to consider when looking for the best split in a tree (sqrt(number_features))random_state
: Sincebootstrap
andmax_features
add randomness to the training process,random_state
sets a seed for reproducibility
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(
n_estimators=100,
max_depth=20,
bootstrap=True,
max_features='sqrt',
random_state=333).fit(X_train, y_train)
Model performance.
y_pred = model.predict(X_train)
plot_predictions(y_train, y_pred)
Gradient Boosting¶
Gradient Boosting logic:
- Sequential Learning: Add decision trees one at a time. Each new tree is trained to predict the residual errors (the difference between the actual values and the predictions of the current model) of the combined ensemble of previous trees.
- Gradient Descent: The model uses gradient descent to minimize the loss function. The gradients of the loss function with respect to the model's predictions are used to fit the new tree.
- Update Model: The predictions of the new tree are added to the ensemble with a certain weight (learning rate), which controls the contribution of each tree.
from sklearn.ensemble import GradientBoostingRegressor
model = GradientBoostingRegressor(
n_estimators=100,
max_depth=20,
learning_rate=0.1,
max_features='sqrt',
random_state=333).fit(X_train, y_train)
Model performance.
y_pred = model.predict(X_train)
plot_predictions(y_train, y_pred)
Model Fine Tuning¶
So far, we have arbitrarily set the hyperparameters of the machine learning models without considering whether these values are optimal for minimizing prediction errors in estimating vehicle residual values. Hyperparameters play a crucial role in determining the effectiveness of the model in achieving its predictive task.
Here are some key points about hyperparameters:
- Predefined: Hyperparameters are set manually before training the model.
- Model Control: They control aspects of the training process, such as the learning rate or the complexity of the model.
- Tuning: Hyperparameter tuning is the process of finding the optimal values for these parameters to improve model performance.
Hyperparameters | Decision Tree | Random Forest | Gradient Boosting |
---|---|---|---|
max_depth |
Maximum depth of the tree | Maximum depth of each tree | Maximum depth of each tree |
min_samples_split |
Minimum number of samples required to split an internal node | Minimum number of samples required to split an internal node | Minimum number of samples required to split an internal node |
min_samples_leaf |
Minimum number of samples required to be at a leaf node | Minimum number of samples required to be at a leaf node | Minimum number of samples required to be at a leaf node |
max_features |
Number of features to consider when looking for the best split | Number of features to consider when looking for the best split | Number of features to consider when looking for the best split |
criterion |
Function to measure the quality of a split (e.g., mse , mae ) |
Function to measure the quality of a split (e.g., mse , mae ) |
Loss function to be optimized (e.g., mse , mae ) |
splitter |
Strategy used to choose the split at each node (e.g., best , random ) |
- | - |
n_estimators |
- | Number of trees in the forest | Number of boosting stages to be run |
bootstrap |
- | Whether bootstrap samples are used when building trees | - |
learning_rate |
- | - | Shrinks the contribution of each tree by learning_rate |
Important aspects of fine-tuning ML models:
- Prevent Overfitting: The goal is to find hyperparameters that generalize well to unseen data rather than just fitting the training data. Cross-validation is essential for assessing model performance on independent datasets and selecting optimal hyperparameters.
- Computational Efficiency: Testing all combinations of hyperparameters can be computationally expensive, so methods like Grid Search or Random Search are used to efficiently approximate the best hyperparameter configuration.
The cross-validation process involves splitting the dataset into k subsets, training the model on k-1 folds while validating it on the remaining fold, and repeating this k times to obtain an average performance metric.
Image source: Author
Grid Search is a hyperparameter tuning technique that exhaustively searches through a specified subset of hyperparameters to find the optimal combination for a machine learning model.
Random Search is a hyperparameter tuning technique that randomly samples a specified number of hyperparameter combinations from a given range to find the optimal set for a machine learning model.
Image source: Author
We will now implement 3-fold cross-validation using a Random Search strategy to identify the optimal hyperparameters. We will use
RandomizedSearchCV
to do so.
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
param_grid = {
'decision_tree': {
'max_depth': randint(1, 20),
'min_samples_leaf': randint(1, 20),
},
'random_forest': {
'n_estimators': randint(100, 300),
'max_depth': randint(1, 20),
'min_samples_leaf': randint(1, 20),
'max_features': ['sqrt', 'log2'],
'bootstrap': [True, False],
},
'gradient_boosting': {
'n_estimators': randint(100, 300),
'max_depth': randint(1, 20),
'min_samples_leaf': randint(1, 20),
'max_features': ['sqrt', 'log2'],
'learning_rate': [0.01, 0.1, 0.2, 0.3],
}
}
models = {
'decision_tree': tree.DecisionTreeRegressor(),
'random_forest': RandomForestRegressor(random_state=333),
'gradient_boosting': GradientBoostingRegressor(random_state=333),
}
results = {}
for model in models.keys():
# Initialize Random Search
grid_search = RandomizedSearchCV(
estimator=models[model],
param_distributions=param_grid[model],
n_iter=20,
cv=3,
scoring='neg_root_mean_squared_error',
return_train_score=True,
random_state=333,
n_jobs=-1)
# Fit Random Search
grid_search.fit(X_train, y_train)
# Get the best parameters and best score
results[model] = grid_search.cv_results_
from util import print_best_models
for model in models:
print_best_models(results, model)
Model: decision_tree Best score: 5709.03 RMSE Best parameters: {'max_depth': 17, 'min_samples_leaf': 2} Model: random_forest Best score: 6423.77 RMSE Best parameters: {'bootstrap': False, 'max_depth': 18, 'max_features': 'sqrt', 'min_samples_leaf': 16, 'n_estimators': 140} Model: gradient_boosting Best score: 4973.33 RMSE Best parameters: {'learning_rate': 0.2, 'max_depth': 15, 'max_features': 'sqrt', 'min_samples_leaf': 8, 'n_estimators': 251}
The best performing model is a gradient boosting with an Root Mean Squared Error (RMSE) of $4,973.
The fine-tuned hyperparameters look as follows:
Hyperparameter | Optimal value |
---|---|
learning_rate |
0.2 |
max_depth |
15 |
max_features |
'sqrt' |
min_samples_leaf |
8 |
n_estimators |
251 |
Model Evaluation¶
It is good practice to put all steps of data preprocessing, feature engineering and model training into a pipeline. This encapsulation helps ensure that
all steps are executed in the correct order not only on the training data but also on the test
data and later on production data. We use the Pipeline
class from scikit-learn
for
this purpose.
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from util import VINReplacer, ConditionalImputer, AgeCalculator, SportColumn, ColumnDropper, DataFrameSimpleImputer
Data cleaning steps.
pipeline_steps = [
# Data Preprocessing
('vin_replacer', VINReplacer(
vin_to_year=vin_to_year, vin_to_manufacturer=vin_to_manufacturer)),
('column_transformer_model', ColumnTransformer(
transformers=[('model_imputer', SimpleImputer(
strategy='constant', fill_value='unknown'), ['model'])],
remainder='passthrough',
verbose_feature_names_out=False)
),
('column_transformer_paint_color', ColumnTransformer(
transformers=[('paint_color_imputer', SimpleImputer(
strategy='most_frequent'), ['paint_color'])],
remainder='passthrough',
verbose_feature_names_out=False)
),
('cylinders_imputer', ConditionalImputer(
target_col='cylinders', condition_cols=['manufacturer', 'type'])),
('fuel_imputer', ConditionalImputer(
target_col='fuel', condition_cols=['manufacturer', 'type'])),
('transmission_imputer', ConditionalImputer(
target_col='transmission', condition_cols=['manufacturer', 'type'])),
('drive_imputer', ConditionalImputer(
target_col='drive', condition_cols=['manufacturer', 'type']))
]
Feature engineering steps.
pipeline_steps.extend([
# Feature Engineering
('age_calculator', AgeCalculator()),
('column_transformer_knn', ColumnTransformer(
transformers=[('knn_imputer', KNNImputer(
n_neighbors=10), ['age', 'odometer'])],
remainder='passthrough',
verbose_feature_names_out=False)
),
('sport_column', SportColumn()),
('column_transformer_encoder', ColumnTransformer(
transformers=[('one_hot_encoder', OneHotEncoder(
sparse_output=False, min_frequency=5, handle_unknown='infrequent_if_exist'),
['manufacturer', 'fuel', 'cylinders',
'paint_color', 'type', 'drive', 'transmission'])],
remainder='passthrough',
verbose_feature_names_out=False)
),
('column_dropper', ColumnDropper(
columns=['VIN', 'year', 'posting_date', 'model'])),
('final_simple_imputer', DataFrameSimpleImputer(
strategy='most_frequent'))
])
Create pipeline object.
pipeline = Pipeline(steps=pipeline_steps)
pipeline
Pipeline(steps=[('vin_replacer', VINReplacer(vin_to_manufacturer={'AA9': 'tr-tec', 'AAA': 'audi', 'AAK': 'faw', 'AAM': 'man', 'AAP': '', 'AAV': 'volkswagen', 'AAW': 'challenger-trailer', 'ABJ': 'mitsubishi', 'ABM': 'bmw', 'AC5': 'hyundai', 'ACV': 'isuzu', 'ADB': 'mercedes-benz', 'ADD': '', 'ADM': 'general-motors', 'ADN': 'nissan', 'ADR': 'renault', 'ADX': 'tata', 'AFA': '', 'AFB': 'maz... OneHotEncoder(handle_unknown='infrequent_if_exist', min_frequency=5, sparse_output=False), ['manufacturer', 'fuel', 'cylinders', 'paint_color', 'type', 'drive', 'transmission'])], verbose_feature_names_out=False)), ('column_dropper', ColumnDropper(columns=['VIN', 'year', 'posting_date', 'model'])), ('final_simple_imputer', DataFrameSimpleImputer(strategy='most_frequent'))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('vin_replacer', VINReplacer(vin_to_manufacturer={'AA9': 'tr-tec', 'AAA': 'audi', 'AAK': 'faw', 'AAM': 'man', 'AAP': '', 'AAV': 'volkswagen', 'AAW': 'challenger-trailer', 'ABJ': 'mitsubishi', 'ABM': 'bmw', 'AC5': 'hyundai', 'ACV': 'isuzu', 'ADB': 'mercedes-benz', 'ADD': '', 'ADM': 'general-motors', 'ADN': 'nissan', 'ADR': 'renault', 'ADX': 'tata', 'AFA': '', 'AFB': 'maz... OneHotEncoder(handle_unknown='infrequent_if_exist', min_frequency=5, sparse_output=False), ['manufacturer', 'fuel', 'cylinders', 'paint_color', 'type', 'drive', 'transmission'])], verbose_feature_names_out=False)), ('column_dropper', ColumnDropper(columns=['VIN', 'year', 'posting_date', 'model'])), ('final_simple_imputer', DataFrameSimpleImputer(strategy='most_frequent'))])
VINReplacer(vin_to_manufacturer={'AA9': 'tr-tec', 'AAA': 'audi', 'AAK': 'faw', 'AAM': 'man', 'AAP': '', 'AAV': 'volkswagen', 'AAW': 'challenger-trailer', 'ABJ': 'mitsubishi', 'ABM': 'bmw', 'AC5': 'hyundai', 'ACV': 'isuzu', 'ADB': 'mercedes-benz', 'ADD': '', 'ADM': 'general-motors', 'ADN': 'nissan', 'ADR': 'renault', 'ADX': 'tata', 'AFA': '', 'AFB': 'mazda', 'AFD': 'baic', 'AHH': 'hino', 'A... 'CN1': 'tr-tec', 'DF9/': 'laraki', 'EBZ': 'nizhekotrans', 'H0D': 'taizhou-qianxin-vehicle-co-ltd', ...}, vin_to_year={'1': 2001, '2': 2002, '3': 2003, '4': 2004, '5': 2005, '6': 2006, '7': 2007, '8': 2008, '9': 2009, 'A': 2010, 'B': 2011, 'C': 2012, 'D': 2013, 'E': 2014, 'F': 2015, 'G': 2016, 'H': 2017, 'J': 2018, 'K': 2019, 'L': 2020, 'M': 2021, 'N': 2022, 'P': 2023, 'R': 2024, 'S': 2025, 'T': 1996, 'V': 1997, 'W': 1998, 'X': 1999, 'Y': 2000})
ColumnTransformer(remainder='passthrough', transformers=[('model_imputer', SimpleImputer(fill_value='unknown', strategy='constant'), ['model'])], verbose_feature_names_out=False)
['model']
SimpleImputer(fill_value='unknown', strategy='constant')
passthrough
ColumnTransformer(remainder='passthrough', transformers=[('paint_color_imputer', SimpleImputer(strategy='most_frequent'), ['paint_color'])], verbose_feature_names_out=False)
['paint_color']
SimpleImputer(strategy='most_frequent')
passthrough
ConditionalImputer(condition_cols=['manufacturer', 'type'], target_col='cylinders')
ConditionalImputer(condition_cols=['manufacturer', 'type'], target_col='fuel')
ConditionalImputer(condition_cols=['manufacturer', 'type'], target_col='transmission')
ConditionalImputer(condition_cols=['manufacturer', 'type'], target_col='drive')
AgeCalculator()
ColumnTransformer(remainder='passthrough', transformers=[('knn_imputer', KNNImputer(n_neighbors=10), ['age', 'odometer'])], verbose_feature_names_out=False)
['age', 'odometer']
KNNImputer(n_neighbors=10)
passthrough
SportColumn()
ColumnTransformer(remainder='passthrough', transformers=[('one_hot_encoder', OneHotEncoder(handle_unknown='infrequent_if_exist', min_frequency=5, sparse_output=False), ['manufacturer', 'fuel', 'cylinders', 'paint_color', 'type', 'drive', 'transmission'])], verbose_feature_names_out=False)
['manufacturer', 'fuel', 'cylinders', 'paint_color', 'type', 'drive', 'transmission']
OneHotEncoder(handle_unknown='infrequent_if_exist', min_frequency=5, sparse_output=False)
passthrough
ColumnDropper(columns=['VIN', 'year', 'posting_date', 'model'])
DataFrameSimpleImputer(strategy='most_frequent')
We now apply the pipeline steps to the training data using the fit()
method.
pipeline_fitted = pipeline.fit(X_train)
Given the fitted pipeline object, we can then transform both training and test data.
X_train_transformed = pipeline_fitted.transform(X_train)
X_test_transformed = pipeline_fitted.transform(X_test)
For evaluating a machine learning (ML) model, having a benchmark model is essential for several reasons:
- Baseline Performance: It provides a reference point to compare the performance of more complex models, helping to determine if they offer significant improvements.
- Model Validation: A benchmark model validates the effectiveness of the ML model; if the ML model doesn't outperform it, it may not be capturing the data patterns effectively.
- Simplicity and Interpretability: Benchmark models are typically simpler and more interpretable, offering insights into data relationships.
- Setting Expectations: They help set realistic performance expectations for more complex models by establishing a target based on the benchmark's performance.
Common benchmark models include:
- Mean Predictor: Always predicts the mean of the target variable.
- Median Predictor: Always predicts the median of the target variable.
- Simple Linear Regression: A basic linear model to capture linear relationships.
In the following, we estimate a linear regression as baseline model.
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(X_train_transformed, y_train)
Model performance.
y_pred_benchmark = model.predict(X_test_transformed)
plot_predictions(y_test, y_pred_benchmark)
A linear regression yields a RMSE of $7,698 on the held out test data.
Now we (re-)estimate the best performing gradient boosting model with the fine-tuned hyperparameters and evaulate its out-of-sample performance.
model = GradientBoostingRegressor(
n_estimators=251,
max_depth=15,
learning_rate=0.2,
max_features='sqrt',
min_samples_leaf=8,
random_state=333).fit(X_train_transformed, y_train)
Model performance.
y_pred = model.predict(X_test_transformed)
plot_predictions(y_test, y_pred)
Model Deployment¶
When we talk about deploying a model, we mean making a trained machine learning model available for use. This is like putting a finished product on a store shelf so that customers can buy it.
FastAPI for deployment¶
FastAPI streamlines the deployment process by enabling developers to create APIs (Application Programming Interfaces), which facilitate communication between different software programs. An API includes an inference point where a trained model is hosted, allowing users to send input data and receive real-time predictions, thus integrating the model's capabilities into various applications without requiring users to understand the underlying technology.
!fastapi dev ml_fastapi.py
Access the API documentation at http://127.0.0.1:8000/docs.
Streamlit for deployment¶
Streamlit is an open-source app framework that simplifies the deployment of machine learning models by allowing developers to create interactive web applications with minimal effort. It provides a user-friendly interface for users to input data and interact with the model in real-time, making it easier to demonstrate the model's capabilities and gather user feedback.
!streamlit run ml_streamlit.py
Wait for the local server to open in your web browser.