Here I will show you how to use generalised linear models (GLM) with mortality claims data. We will fit a poisson GLM to model count data using the h2o package. This post will cover:
- Preparing data
- Basic h2o functions
- Use of offsets
- Fitting a basic GLM
- Extracting a rating plan from the model
This article will be split into two parts - part 1 will cover the above and part 2 will cover cross validation and 'tuning' a GLM using h2o's grid search functionality. Finally, in a later post we will look at how to build simple pricing apps using the models we develop.
Background¶
Imagine a portfolio of lives with lump sum death benefit cover - we want to be able to model how many claims will arise from this portfolio. The number of claims will obviously vary depending on how long each life has insurance cover in place.
GLMs are widely used by actuaries to set rating plans that are used to price insurance products - particularly in general insurance (personal lines etc). One of the benefits of using GLMs is that it's usually easy to extract an interpretable, often simple, rating plan from the model (this is discussed in more detail later). A GLM also allows us to update an existing rating plan rather than set an entirely new one. An existing rating plan may have been set, for example, using standard tables.
Data¶
I'll use a simulated dataset created in a previous post. This dataset contains 1.5m records. Each record represents an individual and the columns contain their age, gender, occupation, location, salary details, exposure (how long, in years, the individual was insured) and whether or not a claim was made under their policy. To create this dataset you can follow the previous tutorial or alternatively any similar dataset should be fine.
# Read in the data
dt = read.csv('SimData.csv')
head(dt)
For ideas on exploring this dataset using summary statistics and visualisations, refer to this post.
Import Libraries¶
The h2o package will be used extensively. Dplyr will also be needed. Import these libraries using the following commands:
library(h2o)
library(dplyr)
In order to use h2o functions, we need to initialise an h2o cluster. Use the h2o.init()
function. For example:
h2o.init(nthreads = -1)
# convert data to h2o frame
h2o_data = as.h2o(dt)
Existing Plan¶
Often, a rating plan is already in place and rather than start from scratch we may just want to add to (or update) this existing model. For example, we might already have a best estimate of claim rates for the general population which aren't being updated as part of this exercise, but we want to use the GLM to refine these rates. In this example I'll define the existing estimate of claim rates to be:
- Individuals under 40 will be assumed, on average, to claim at a rate of $\frac{1}{1000}$
- Individuals aged 40-49 have a claim rate of $\frac{1.5}{1000}$
- Ages 50 + have a claim rate of $\frac{2}{1000}$
These baseline rates vary by age but not gender, occupation and so on. The GLM can be used to refine these initial estimates so that the final predicted claim count allows for our other predictor variables.
Next, we need to add these base rates to each record. Of course, each individual won't necessarily have been the same age for the entire period they were covered under a policy! However, for simplicity here I'm going to assume that age is the average age over the exposure period and just set our best estimate of mortality based on this average age. In reality, a more realistic expectation of mortality may be used - based on age varying over the exposure period.
ages = as.vector(h2o_data$Age_Last)
Base_Rate = ifelse(ages %in% c(0:39), 0.001,
ifelse(ages %in% c(40:49), 0.0015, 0.002))
h2o_data$Base_Rate = h2o.asnumeric(as.h2o(Base_Rate))
To allow for these base rates in the GLM model an offset column needs to be added.
Adding an Offset Column¶
An offset is a covariate in the model with a coefficient constrained to be one (rather than being estimated). In this model, we need to add two offsets - one to allow for exposure and one to allow for the fact that we do not wish to update our base rates as part of this process.
Why use an offset to allow for exposure?¶
Mathematically, the poisson model takes the form:
$$log(\mu) = \alpha + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n$$
However, since our response variable is actually claims per unit of exposure, we are modelling a rate. Which takes the form:
$$log(\frac{\mu}{exposure}) = \alpha + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n$$
This is equivalent to:
$$log(\mu) - log(exposure) = \alpha + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n$$
Finally, $log(exposure)$ enters the right hand side of the equation as an offset with coefficient constrained to one, giving:
$$log(\mu) = log(exposure) + \alpha + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n$$
Why use an offset for base rates?¶
As mentioned earlier, our base rates are derived outside of this model and are not being updated here. However, these base rates will still form part of the overall rating plan and so the GLM model needs to be made aware of them. The other loadings will be optimised once our base rates have been applied. Therefore, base_rate enters the RHS of the equation as follows:
$$log(\mu) = log(exposure) + log(base\_rate) + \alpha + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n$$
And since $log(a) + log(b) = log(ab)$ this can be simplified as:
$$log(\mu) = log(exposure\times base\_rate) + \alpha + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n$$
Which is equivalent to:
$$log(\mu) = log(\mathbb{E}[deaths]) + \alpha + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n$$
Therefore, the final offset to be included in the model is $log(\mathbb{E}[deaths])$
So, next we need to calculate $\mathbb{E}[deaths]$ for each record. This will be our base rate multiplied by the exposure period. Then the data can be split into train, test and holdout sets and the offset columns are added.
# Calculate the 'expected deaths' for each record
h2o_data$Expected_Deaths = h2o_data$Exposure * h2o_data$Base_Rate
# First, set Occupation and Location variables as nominal factor variables
# (salary band isn't strictly nominal but we'll treat it as such here)
h2o_data[c('Occupation', 'Location', 'Salary_Band')] =
h2o.asfactor(h2o_data[c('Occupation', 'Location', 'Salary_Band')])
# Separate into a train, test and holdout set
# 60% in training and 20% in test & holdout respectively
split = h2o.splitFrame(h2o_data, ratios = c(0.6,0.2), seed = 123)
train = split[[1]]
test = split[[2]]
holdout = split[[3]]
Since the dataset is large and the claim event is rare I am going to reduce the datasets using aggregation. In the aggregated datasets each row will represent a feature set and the claims and exposure columns will be the sum of all claims/exposure in that given set. In practice, for a poisson GLM it doesn't make a difference to the model whether we use aggregated or full individual data.
# Define a helper function to aggregate the train, test and holdout sets
agg = function(h2o_frame){
dat = as.data.frame(h2o_frame)
agg_data = aggregate(x = dat[c('Exposure','Claim','Expected_Deaths')],
by = list(Age = dat$Age_Last,
Male = dat$Male,
Occupation = dat$Occupation,
Location = dat$Location,
Salary_Band = dat$Salary_Band),
FUN = sum)
agg_data
}
# create aggregated versions of datasets
train_aggregated = as.h2o(agg(train))
test_aggregated = as.h2o(agg(test))
holdout_aggregated = as.h2o(agg(holdout))
# Aggregated data now looks like:
head(train_aggregated)
# Now, add the offsets to train, test and holdout sets
train_aggregated$Offset = log(train_aggregated$Expected_Deaths)
test_aggregated$Offset = log(test_aggregated$Expected_Deaths)
holdout_aggregated$Offset = log(holdout_aggregated$Expected_Deaths)
# Finally, let's just make sure we set all our factor variables properly in the
# aggregated data sets
factorVars = c('Salary_Band', 'Occupation', 'Location', 'Male')
train_aggregated[factorVars] = h2o.asfactor(train_aggregated[factorVars])
test_aggregated[factorVars] = h2o.asfactor(test_aggregated[factorVars])
holdout_aggregated[factorVars] = h2o.asfactor(holdout_aggregated[factorVars])
Fitting the GLM¶
The GLM will be built using the h2o package. First, define the set of predictors we will use and the name of the column to be used as an offset. Then, the model is fit using h2o.glm()
. Be careful to define the 'family' as poisson. Setting lambda = 0 means that this baseline model will not apply any regularisation. To check the performance against holdout data use h2o.performance()
.
# Fit a basic GLM - with NO regularisation (discussed in part 2)
# Need to set alpha = 0 and lambda = 0 as h2o uses regularisation by default.
# Setting these to 0 overrides this.
predictors = c('Salary_Band', 'Occupation', 'Location', 'Male')
response = 'Claim'
offset = 'Offset'
mod1 = h2o.glm(x = predictors,
y = response,
training_frame = train_aggregated,
validation_frame = test_aggregated,
family = 'poisson',
offset_column = offset,
alpha = 0,lambda = 0,
intercept = TRUE,
compute_p_values = TRUE,
seed = 123)
Extracting the Ratings¶
To set a rating plan we need to extract the coefficient estimates from the model and convert these to ratings. Since the poisson GLM uses a log link function we can get to the final ratings by taking exponentials of both sides of the GLM equation. For example, to derive $\mu$ our equation becomes:
$$\mu = \exp(log(\mathbb{E}[deaths]) + \alpha + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n) $$
Which is equivalent to:
$$\mu = \mathbb{E}[deaths] \times \exp(\alpha) \times \exp(\beta_1 x_1) \times \exp(\beta_2 x_2) \times ... \times \exp(\beta_n x_n)$$
We can see that the final rating plan will therefore be multiplicative.
# Extracting the ratings
mod1_ratings = mod1@model$coefficients_table
mod1_ratings$rating = exp(mod1_ratings$coefficients)
# View the table of coefficients
# (some rounding has been applied using dplyr mutate function)
mod1_ratings %>%
mutate_if(is.numeric, funs(round(., 3)))
The final rating plan takes the form:
Base rate $\times$ exposure $\times$ intercept $\times$ salary rating $\times$ location rating $\times$ occupation rating $\times$ gender rating
This simple plan can then be used to estimate the claim rate for an individual or a group of lives with similar features. We can work through the first row in the holdout dataset to see how the model could be used in practice.
head(holdout,1)
The ratings for a male, occupation class 2, living in location category 4 in salary band 4 are as follows:
- Gender - 0.919
- Occupation - 0.859
- Location - 1.182
- Salary band - 0.986
So the claim rate would be:
$$ 0.001 \times 1.124758 \times 0.919 \times 0.859 \times 1.182 \times 0.986 = 0.00103$$
Conclusion¶
You should now be able to fit a basic GLM using the h2o package and use the resulting model to set an insurance rating plan. In part 2 I will cover:
- Cross validating the model
- Automatically tuning the model using h2o's grid search functionality - this allows us to automatically try hundreds (even thousands) of different GLM's, with different input parameters, in order to select the best model for this dataset
Comments
comments powered by Disqus