Brazil vehicle insurance analysis of claim drivers

Brazil vehicle insurance analysis of claim drivers

Summary

This report covers an extensive analysis on a portolio of policies from Brazil automobile insurance industry with data based on AUTOSEG (Statistical System for Automobiles) for the year 2011. The aim of the analysis was to get an understanding of the factors that influenced the claims performance of the porfolio especially with respect to claims frequency (i.e. number of claims) and claims severity (i.e. average claim size). Generalized linear models were used to model the components of claims cost and it was identified that gender, drivers age and sum insured were significant factors in explaining the variation in claims frequency and claims severity. It was also established that the average experience differed by state with Sao Paulo experiencing the largest average claims frequency but interestingly also with the lowest claims severity. A more indepth description of the methodology used, results and conclusions follows below.

Introduction

The Brazil Gross Premiums Written (GPW) by the auto insurance industry was US$14.7BN in 2011 and was estimated to be US$20BN in 2019. According to 2017 statistics, the GPW for the entire insurance industry was approximately 4.1% of the GDP. Given the above, it highlights the significance of the sector to the social and economic welfare of the country.

The main objective of the questions being answered in this report are around the trends observed in the claims experience for this book of policies in the calendar year 2011. The questions of interest are: –

  • What were the main drivers of loss/claims costs for the portfolio in question, in particular did the number and size of claims differ by the different demographic factors? If so, how sensitive are the claims to these factors?
  • Did the experience differ by state?

Data

As a starting point, I had a look at the dataset which had 1,965,355 vehicle insurance policies in total. It had R$7.3BN (US$4BN in 2011) in GPW which was approximately 27% of market share. In addition, R$4.6BN (US$2.5BN) in claims cost at 63.4% Claims Ratio, were received from 1,484,861 claims but with 81% of policies having no claims. Variables of interest in the data were: –

  • Gender - A character string (“factor”) for the gender (also indicates corporate policies).
  • DrivAge - A character string (“factor”) for the driver age group.
  • VehYear - A numeric for the vehicle year. VehCode - A character string (“factor”) for the vehicle group.
  • State - A character string for the state name (“factor”). Area - Local area name (“factor”).
  • ExposTotal - Total exposure for period.
  • SumInsAvg - Average of sum insured.
  • ClaimNbRob, ClaimNbPartColl, ClaimNbTotColl, ClaimNbFire, ClaimNbOther - Number of claims during the exposure period, respectively for robbery, partial collision, total collision, fire and other guarantees. These were combined to calculate ClaimsNumberTotal.
  • ClaimAmountRob, ClaimAmountPartColl, ClaimAmountTotColl, ClaimAmountFire, ClaimAmountOther - Claim amounts during the exposure period, respectively for robbery, partial collision, total collision, fire and other guarantee. These were combined to calculate Claims Total.

Exploratory Data Analysis (EDA)

In the exploratory data analysis, I identified that the data had missing information. Upon further investigation, I discovered that Gender had 3 levels, Male, Female and Corporate. The corporate policies had a lot of missing data (e.g. no driver age) but the quantities in the ExposTotal field, were large in comparison to the individual policies leading me to conclude that these could possibly have been policies covering fleets. As a result, I decided to exclude these policies from the analysis and focus only on individual policies.

As part of the EDA, a lot of interesting trends and observations were identified which were incorporated in the model building process. Some of the key findings are noted below.

The illustration above shows how the average size of a claim (bottom panel) and average number of claims (top panel) varies by Gender over the different age groups, for select states (Sao Paulo, Rio de Janeiro and Minas Gerais). As can be seen in the plots, Males tend to have a higher average claim size and this varies by age group. VehCode was a variable that was considered to be potential variable of interest, but this was highly related to sum insured, so only one of these would be used for modelling to avoid multicollinearity.

Model

Model Selection

To get higher resolution into the factors underlying the performance of the portfolio and also avoid the confounding effect of potentially 2 trends cancelling out each and hence not being significant on the aggregate level, the following approach to modelling was carried out: –

\(Loss\:Cost=\frac{Total\:Claims}{Exposure\:Total}=\frac{Number\:of\:Claims}{Exposure\:Total}\times\frac{Total\:Claims}{Number\:of\:Claims}=(Claims\:Frequency){\times}(Claims{\:}{Severity})\)

Two separate models were then created for Claims Frequency and Claims Severity using Poisson and Gamma distributions respectively. The structure of the models is explained further in the Final Models section. The Poisson and Gamma distributions were chosen because of their positive support/domain and also their long right tail which was apparent in the EDA plots for the distributions for Claim frequency and severity.

Model selection was undertaken by creating various models to account for different interactions and effects, considering both random and fixed effects. EDA findings, Forward and Backward selection was used in conjuction with ANOVA tests, to determine candidate predictors and interactions to add or remove from the models. Additionally, the AIC of the models as well as the p-values of the predictors were assessed at each step to check model fit and prevent overfitting. In the final models chosen, the majority of predictors were significant and the AIC was lower than the next best models, with the significance of difference confirmed by ANOVA tests.

States were the only hierarchy that was eventually used but nested hierachies using Area were considered as well, but models including these did not converge. Additionally, random slopes, such as for Gender and Driver age were considered however, again, the models failed to converge for any random slopes and as such, only random intercepts by State were used in the model building process.

Final Models

The final models chosen were as follows: –

Claim Frequency

\(log(\lambda_{ij})=\beta_0+\gamma_{0j}^{State} +\beta_1{GenderMale_{i}} +{\sum_{j=2}^5}\beta_{2j}{DrivAge_{ij}} +\beta_3{SumInsAvg_{i}} +{\sum_{j=2}^5}\beta_{4j}{GenderMale_i:DrivAge_{ij}} +{\sum_{j=2}^5}\beta_{5j}{{SumInsAvg}_i:DrivAge_{ij}} + log({Exposure\:Total}_{i})\)

\(where \: ClaimNumber_{i}|{x_{ij}}\)~\(Poisson(\lambda_{i}) \: and \: \gamma_{0j}^{State}\)~\(N(0,\sigma^2)\)
Fixed Effects
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 0.19872 0.18024 0.21911 -32.43393 0.00000
GenderMale 0.79268 0.78600 0.79943 -53.76766 0.00000
DrivAge18-25 0.89001 0.87431 0.90599 -12.83155 0.00000
DrivAge26-35 0.85754 0.84936 0.86579 -31.43867 0.00000
DrivAge36-45 0.85835 0.85080 0.86598 -33.86064 0.00000
DrivAge46-55 0.97949 0.97045 0.98861 -4.38130 0.00001
SumInsAvg 0.87425 0.86719 0.88136 -32.51229 0.00000
GenderMale:DrivAge18-25 1.65656 1.62173 1.69213 46.56025 0.00000
GenderMale:DrivAge26-35 1.47979 1.46201 1.49778 63.55781 0.00000
GenderMale:DrivAge36-45 1.13345 1.12058 1.14646 21.50899 0.00000
GenderMale:DrivAge46-55 1.04924 1.03668 1.06195 7.82312 0.00000
DrivAge18-25:SumInsAvg 1.16396 1.13742 1.19112 12.90022 0.00000
DrivAge26-35:SumInsAvg 1.04539 1.03283 1.05811 7.19610 0.00000
DrivAge36-45:SumInsAvg 1.15875 1.14718 1.17043 28.78803 0.00000
DrivAge46-55:SumInsAvg 1.06150 1.04990 1.07323 10.64747 0.00000

Claim Severity

\(log(\mu_{ij})=\beta_0+\gamma_{0j}^{State} +\beta_1{GenderMale_{i}} +{\sum_{j=2}^5}\beta_{2j}{DrivAge_{ij}} +\beta_3{SumInsAvg_{i}} +{\sum_{j=2}^5}\beta_{4j}{GenderMale_i:DrivAge_{ij}} +{\sum_{j=2}^5}\beta_{5j}{SumInsAvg_i:DrivAge_{ij}}\)

\(where \: {ClaimSeverity}_{i}|{x_{i}}\)~\(Gamma(\alpha_{i},\lambda_{i}) \: and \: \gamma_{0j}^{State}\)~\(N(0,\sigma^2)\)

Fixed Effects
exp(Est.) 2.5% 97.5% t val. p
(Intercept) 2.24913 2.04749 2.47063 16.91313 0.00000
GenderMale 1.27468 1.26146 1.28804 45.61882 0.00000
DrivAge18-25 1.74252 1.70552 1.78033 50.70972 0.00000
DrivAge26-35 1.43276 1.41657 1.44914 62.00845 0.00000
DrivAge36-45 1.28679 1.27326 1.30047 46.74329 0.00000
DrivAge46-55 1.14945 1.13678 1.16227 24.62609 0.00000
SumInsAvg 1.69923 1.68562 1.71296 129.15353 0.00000
GenderMale:DrivAge18-25 1.12817 1.09882 1.15830 8.96770 0.00000
GenderMale:DrivAge26-35 0.99041 0.97572 1.00531 -1.26516 0.20582
GenderMale:DrivAge36-45 1.05769 1.04297 1.07263 7.84069 0.00000
GenderMale:DrivAge46-55 1.05099 1.03546 1.06675 6.54784 0.00000
DrivAge18-25:SumInsAvg 1.09924 1.07076 1.12847 7.06462 0.00000
DrivAge26-35:SumInsAvg 1.03977 1.02717 1.05251 6.27120 0.00000
DrivAge36-45:SumInsAvg 0.85682 0.84828 0.86544 -30.23269 0.00000
DrivAge46-55:SumInsAvg 0.89208 0.88250 0.90177 -20.72510 0.00000

In the final models, Female drivers aged 55 and older were used as the baseline factors that are absorbed into the intercept in both Claim Frequency and Severity.

Model Assessment

Model assessment was primarily undertaken by considering the RMSE for both models. The final RMSE for the models for the claims frequency and severity was 4.67 and 54228 respectively. Multicollinearity was also checked on both models using generalized VIF. No factor had a generalized VIF value above 5 in both models, confirming no multicollinearity. The plot of predicted vs actual showed a linear trend with a few outliers but nothing raised immediate concerns. The outliers though, signalled potentially in the future, a need to split the modelling of claims into attritional (moderate claims) and very large claims. This is a consideration because in the insurance industry, it has been observed that the extreme claim events tend to have different claim distributions than attritional claims, justifying use of extreme value distributions like for example Gumbel distribution.

Results

Of primary interest, from the summary statistics of the models shown above, is that the Maximum Likelihood Estimates for the coefficients, for the economic and demographic indicators Gender, Drivers Age, Sum Insured and the intercations between Drivers Age with Gender and Sum Insured, are for both the Claims Frequency and Severity models, significant (except for GenderMale:DrivAge26-35 in the claims severity model). This suggests that expected number of claims and average size of each claim, differs significantly by demographic group. For example, the models suggest that males aged 55 and older, have 21% less number of claims but with the average size of each claim being 27% higher than their female counterparts. Similarly, female drivers aged 55 and older, seem to experience a higher number of claims but with lower size per claim when compared to the other age groups e.g. age group 26-35 has approximately 14% lower expected number of claims but with a 43% higher average size of claims. The interactions highlight that as male drivers get younger, then the expected number of claims and average size of a claim also increase, with rate of increase highest for 18-25 drivers at increases of 66% and 12% respectively. On the other hand, it was interesting to note that as the sum insured increased, the expected number of claims would decrease but with the average size of a claim increasing as well (although at a slightly higher rate). The above comments on sensitivities of claims severity and frequency to changes in the demographic or economic factors, are made assuming all other factors remained constant.

The expected number of claims per policy for females aged 55 and over with an average sum assured of R$38,287, is 0.2, i.e. we expect 2 claims from every 10 policies with this demographic profile that is insured. This is not surprising, given the large proportion of policies that had no claims in the data. This experience as reflected in the dotplot above for claims frequency, varies by state with Sao Paulo increasing the expected frequency by 100% (\(e^{0.746}-1\)) and Rio Grande do Sul decreasing by 38%.

Given that a policy receives at least one claim, the model predicts that the average size of a claim will be R$2,249 for females aged 55 and over with an average sum insured of R$33,872. This varies by state with Parana increasing the average claim size by 67% and Sao Paulo on the opposite end, experiencing a reduction of 46%.

Conclusion

The above results show that gender, drivers age and the sum insured on the policy are significant indicators for the difference in policy experience with respect to number of claims and average size of each claim. In addition, the expected average experience for both claims frequency and severity differs by state to a fairly significant extent. The result of this investigation shows that it would potentially be detrimental to model aggregate claims distribution because of some trends offsetting each other, resulting in overall insignificance of factor at aggregate level. For example being Female seems to increase claims frequency but also results in a reduced claims severity with reverse true for Males. At an aggregate level, this might manifest as if Gender, is not a significant predictor of loss/claims cost, but depending on the insurers specific circumstances, those underlying trends might have different implications. For example, if claims processing cost is significantly high for one insurer, the frequency of claims might be of importance to them. For claims severity, the Sum insured seems to be the most important predictor, so the insurer could use this as main risk proxy for managing claims severity exposure or as a dashboard metric.

It is important to note at this stage some limitations of this analysis and modeling framework. Firstly, residual plots for the model were made but due to most variables being categorical, that made these difficult to interpret. Model assessment was limnited in this respect and rather a lot of significance placed on RMSE and consistency of performance between the training and test sets. Secondly, the removal of Corporate policies from the data might have introduced indirect bias into the analysis, so a seperate analysis on these might need to be done, to check if the inferences on the individual policies translate to corporate. A limited number of variables were included in the data, so more comprehensive data set might be required to further reduce the residual variation that is still unexplained by the model.

For further work, I was considering comparing the model performance to other more advanced algorithms like random Forests. Modelling of the aggregate claims/loss cost using a tweedie distribution, would also be beneficial to compare performance relative to approach taken here.

Appendix

References