Let’s talk about the confusion matrix.
I have some things I need to get off my chest related to this construct and how actuaries, including me, have leveraged it in the accelerated underwriting (AU) space. First off, let’s address where the confusion matrix came from. Actuaries who have been involved in predictive analytics should be quick to recognize that a confusion matrix is an essential tool in machine-learning algorithms. Medical studies have used the confusion matrix to determine the efficacy of different diagnostic tests.
On the surface, a confusion matrix is nothing more than a table showing counts of actual and predicted results for a given number of categories. However, using this matrix, other disciplines have created a variety of metrics to help choose between different models or evaluate the performance of a diagnostic test. For those interested in a very straightforward summary related to machine learning methods and analysis, including the confusion matrix, see Jeff Heaton’s article “Regression and Classification: A Deeper Look” in the July 2016 issue of the Predictive Analytics and Futurism newsletter.
How did the confusion matrix make it to the AU space? Some might assume that AU uses machine learning classification models and therefore, actuaries are using them to help determine the most appropriate model to classify risks. While this can be true for some programs, the more widespread use of the confusion matrix is tied to AU auditing programs. For the purposes of this article, I’m going to assume that the reader is familiar with random holdout (RHO) and post-issue attending physician statement (APS) audits. For background on these topics, you can read “Accelerated Underwriting: Checking the Gauges” in the July 2019 issue of Product Matters! and authored by Taylor Pickett and me.
So, why did we start using the confusion matrix to summarize AU audit results? Audit results are comprised of cases that have been underwritten twice—once using a more traditional method and then again using the accelerated process. This data structure, an applicant with two observed outcomes, seemed to fit the confusion matrix construct well with the traditional underwriting decision being the “actual” outcome and the accelerated decision the “predicted” outcome. One wrinkle here is that a confusion matrix traditionally has the same number of rows and columns. In other words, the list of actual categories is identical to the list of predicted categories. For AU audits, we typically have a situation where certain actual risk classes are not available as a predicted risk class. This is because AU decisions are typically limited to standard or better risks, whereas the actual risk class decision can be substandard or even a decline.
With the actual and predicted risk classes defined, categorical asymmetry aside, the confusion matrix was a natural fit for summarizing results from an audit sample. So, now we can leverage all those metrics that other disciplines have created for analyzing model performance, right? Wrong. Unfortunately, most of the metrics that have been developed by other disciplines have been largely ignored by actuaries in our analysis of audit results. The reason for this is two-fold. First, many of the metrics that have been developed are used in choosing between multiple models. In the case of an audit sample, we only have one AU process, and the result is what it is; there is no need to compare the metric to an alternate process. Secondly, one of the main uses of audit results is to try and quantify the mortality impact for an AU program. This problem is unique to our discipline, and as such, it has spurred the creation of other metrics and calculations, which have a variety of assumptions associated with them. It is those assumptions that I would like to spend the remainder of this article discussing.
Now that we’ve established the origins of the confusion matrix and how actuaries have borrowed it for the purposes of AU, let’s dig into the assumptions behind calculating the mortality impact from audit results. In particular, let’s talk about relative mortality, assigning actual audit results, and lastly, on-top adjustments. Let me start by giving a short definition of what I mean by mortality impact. At the policy level, mortality impact is simply the relative mortality for the actual class divided by the relative mortality for the predicted class.
Table 1: Relative Mortality
|Risk Class||Relative Mortality|
Table 2: Mortality Impacts
In Tables 1 and 2, you can see an illustrative example of relative mortalities and the associated mortality impacts for a selection of actual and predicted classes. With this definition of mortality impact, the result is a factor that could be applied to our pre-AU mortality expectation to adjust for the impact from misclassification associated with AU. It is also possible for these factors to be less than 100 percent, implying that sometimes the AU decision may be more conservative than what traditional underwriting would have been. With each audit case assigned a mortality impact, you can then determine the mortality impact for any given cohort, such as by predicted class, by summing up the mortality impacts for that cohort and dividing by the number of observations in that cohort (a simple average). You can also weight the mortality impact by face amount to get a view of mortality impact by amount. I’d like to point out here that with this methodology, we can calculate mortality impact at a policy level and in aggregate without even referencing a confusion matrix. Thus, a confusion matrix is a useful way to visualize audit results and provide some shortcuts to the calculations for mortality impact, but it is not synonymous with mortality impact nor necessary to calculate it.
With mortality impact outlined, the first question becomes what is the relative mortality? In simplest terms, relative mortality is the mortality outcome of one group relative to the mortality outcome of a reference group. For life insurance, the reference group is typically aggregate standard (non-rated) mortality. In terms of a traditional mortality study, one could create relative mortality percentages by taking actual to expected ratios for each risk class and divide them by the overall actual to expected ratio (excluding substandard risks). This shows how each risk class performed, on average, relative to the overall result of the study.
When it comes to determining relative mortalities, there are a few key considerations. The source for relative mortality could be from a mortality study as I just mentioned, or it could be from an internal mortality assumption. When using mortality assumptions, should you use the duration 1 difference between your preferred classes or take an actuarial present value? An important consideration with duration 1 differences is that they may vary by age, as seen with the 2015 VBT RR tables. It’s very common for AU demographics to skew toward younger ages, so should you adjust the duration 1 relative mortality to account for this? Fortunately, there are no wrong answers here. These are instead items that should be discussed by the actuaries and underwriters involved in the AU program to ensure all are comfortable with the estimated mortality impacts.
With actuaries keenly dialed into mortality assumption setting, determining the relative mortality for standard or better classes can be straightforward. Declines, on the other hand, is one area where we don’t typically have mortality experience. We need a relative mortality assumption for declines because it is one of the possible “actual” results for our audit cases. The question is, what is the mortality for a declined case? There are a few factors that I think about when considering how to set the relative mortality assumption for declines. Number one, what is the maximum table rating that your company issues? Number two, what are common reasons for cases being declined? Number three, what is the mix (if known) of tobacco and non-tobacco applicants for declined cases? The maximum table rating helps set the stage for what the highest possible issued medical risk might look like. Because not all decline reasons are medical, looking at the common decline reasons could lead you to pull back your assumption from the maximum table rating. Taking the maximum table rating, adjusted for decline reasons, and blending based on an assumed mix of nontobacco/tobacco users, we can land on an estimate for the relative mortality of declines.
I’d like to turn my attention now to assigning the “actual” audit results, specifically as it relates to post-issue APS audits. In determining the mortality impact as I’ve defined it, we have an actual and predicted relative mortality. Relative mortalities for “actual” decisions are grounded in the risk class decisions associated with how underwriting has been performed historically for those classes. For full underwriting at the ages and face amounts typical of AU, these decisions are usually based on a paramedical exam and an insurance lab panel. An APS, while possible, is not a typical requirement for the core ages and amounts associated with AU.
With this understanding, the question becomes, can an underwriter recreate the decision they would make with a paramedical exam and labs using just an APS? For the purposes of calculating our mortality impact, we assume this to be yes. In practice, however, we know that an APS could have more information or less information than a traditional exam and labs. This will lead to discrepancies between what the “actual” decision would be using traditional evidence vs. the “actual” decision using the APS. One saving grace here is that these discrepancies tend to go both ways, and in total, they could net out to no impact overall. For this assumption to have the best chance at holding together, underwriters performing audits should use the APS to re-underwrite the case using all the information available to them. If post-issue APS audits are exclusively used to check for material misrepresentation, then it is less likely that audit results would tie back to the actual class needed to calculate the mortality impact.
Lastly, I’d like to talk about adjustments to mortality impact that occur on top of the impact from misclassification. Tim Morant and Philip Janz wrote an article titled “The Impact on Relative Mortality and Prevalence from Triage in an Accelerated Underwriting Program” in the July 2019 issue of Reinsurance News. In this article, they illustrate how some tools, like credit-based mortality scores, can identify risks that have better mortality outcomes relative to their risk class. For example, preferred risks with a score below a given score threshold may exhibit mortality outcomes of 90 percent of all preferred risks, while preferred risks over that threshold would exhibit mortality outcomes that are 120 percent of all preferred risks (see Graph 3 from their article for a clear example of this). If this threshold is part of how applicants are selected to be accelerated, should we include the impact from these scores in the mortality impact calculation? One item to note is that the two impacts together should balance out in total. The threshold is just a way of further segmenting the policies within a given risk class; we have not created or removed any mortality just by dividing up the preferred class (although this may change if placement rates vary by accelerated status). If you do choose to include this type of impact, you could simply multiply the actual relative mortality by the anticipated adjustment related to the risk selection tool. The important thing to recognize, however, is that you must reflect both sides of this impact, the upside and downside. In the case of audit results, we’re typically only thinking about the upside, or applying a discount based on selecting the best risks for acceleration. If our estimate of the mortality impact from the audits includes this adjustment, we must then apply the residual load to our expectation for cases that are not accelerated.
Mortality impact as I’ve discussed here is very dependent on the relative mortality that has been cultivated through years of fully underwritten business. These relative mortalities are based on the average result of any given risk class, and there can be a range of mortality outcomes within any given risk class. With underwriting programs adopting new evidence such as medical claims, clinical labs, electronic health records, and who knows what else; will we start to shift our underwriting outcomes, such that the prevailing risk class decisions no longer align with those that created the historical relative mortality outcomes? Put another way, what if the evidence used in a new process can better classify mortality compared to our prior underwriting practices, thus narrowing, or even shifting the range of results for a given class? If this is the case, how do we determine relative mortality and mortality impacts for new underwriting programs? Personally, I believe that this is a likely outcome, and just as actuaries borrowed the confusion matrix to assist with accelerated underwriting, we will need to adapt and create new ways to solve this evolving challenge.
Posted with permission of the ©Society of Actuaries, Schaumburg, Illinois.