Underwriting
  • Articles
  • June 2026

RGA Validation: Milliman Irix® Risk Score 4.0

By
  • Laiping Wong-Stewart
  • Dr. Guizhou Hu
  • Hezhong (Mark) Ma
Skip to Authors and Experts
Tablet with x-rays on a table with other medical equipment
In Brief
RGA’s independent validation finds that Milliman’s Irix® Risk Score 4.0 delivers stronger mortality segmentation and improved calibration accuracy compared with version 3.0. The analysis confirms that combining Prescription Data, Medical Data, and Credit Data enhances predictive power, while also highlighting structural changes in score distribution and performance across age, duration, and use cases.

Key takeaways

  • Milliman Risk Score version 4.0 more effectively distinguishes mortality risk across applicants and aligns predicted risk more closely with actual outcomes than version 3.0.
  • Combining Prescription Data, Medical Data, and Credit Data consistently enhances mortality segmentation, with the most comprehensive scores delivering the strongest results – especially in younger age groups.
  • Changes in score distributions mean legacy cutoffs from version 3.0 may not translate directly, potentially affecting class size and risk selection if not adjusted. 

Milliman’s Irix® Risk Score (mortality) is a predictive model based on Prescription Data (Rx) that qualifies the relative mortality of life insurance applicants. The model’s Rx-based input can be combined with Medical Data (Rx, Dx) or Credit Data (Rx, Cr) or both (Rx, Dx, Cr) to get a single, multivariate score. Milliman recently released Risk Score 4.0, which includes a Credit Data-only score (Cr). RGA conducted an independent analysis of the five risk scores’ performance using data provided by Milliman.

Study Data 

Milliman provided the dataset, which includes five types of risk scores for 57.6 million individual insurance applicants ages 18 to 85 whose application dates span 2005 to 2024. The data includes the following lines of business: 

  • Life (62% of the dataset)  
  • Health (11%)  
  • Final expense (7%) 
  • Medicare supplement (6%)  
  • Long-term care (3%)  
  • Disability (1%) 
  • Other (10%) 

The cohort was followed through September 2024, yielding 383 million exposure years and 3.1 million deaths identified through the Social Security Master Index and third-party death records. Compared to RGA’s previous series of white papers validating Milliman Risk Scores, we also received Credit Data-only scores, as well as the nicotine status flag, AnyNicotine. Milliman indicated that some or all of these datasets may have been used in score development, which should be considered when interpreting validation results. 

Mortality was measured using the relative actual-to-expected (AE) ratio, with expected mortality based mostly on U.S. population tables, but certain versions of 2015 VBT tables were also evaluated. To account for underreporting deaths in the data, the analysis used relative mortality, or Relative AE, defined as raw AE divided by the total population AE. 

Image of a red puzzle in the shape of a human head
Finding the right partner to put advanced biometric risk insights into action can boost your business.

Hit rate and mortality of no-hit 

Due to the methodological shift in Prescription Data used in the study, Milliman stopped providing “Eligible only” as an Rx-hit type in the current dataset. Compared with the dataset used in RGA’s 2020 validation study of Risk Score 3.0, both Rx (excluding eligibility only) and Dx hit rates increased in the current dataset. 

Interestingly, the current dataset shows lower mortality in the Rx no-hit population than in the Rx-hit population. It may reflect changes in how Rx data was compiled between the two datasets. The new applicants starting in 2020 have a higher AE, most likely due to the COVID-19 pandemic. In that cohort, those with Rx hits have higher relative AE than those without. Some cases prior to 2020 were reclassified from no-hit to hit. Those populations likely have higher mortality rates than those who were already a hit in the 2020 study. Carriers need to evaluate the expected hit rate of Rx in their production cases and the distribution of Rx histories and scores to make sure the mortality expectation is aligned to the population in this study.

Milliman ordered Credit Data on every adult for this study; therefore, “Credit Not Ordered” is no longer a hit type. 

Nicotine flags 

For each case, Milliman provided a nicotine indicator, AnyNicotine, which is consistent with Irix Rules with ICG1 = B. In the study, 8.6% of the population was flagged as ever-nicotine users. The AnyNicotine indicator was developed using a seven-year look-back period following receipt of the Medical and/or Prescription Data order from carriers. Milliman also noted that the study dataset included a lower prevalence of nicotine users than observed in production data, where more than 18% of Medical and/or Prescription Data queries include a nicotine indicator. Cases with a nicotine indicator have more than a two-fold increase in mortality, compared to those without the indicator. They also tend to have higher risk scores. Table 2 compares the average risk scores for cases with and without the nicotine indicators.  

Assessing the mortality segmentation of risk scores 

The primary purpose of these scores is risk selection and classification. In this context, a better-performing score is one that provides stronger mortality segmentation. 

Figure 1 illustrates cumulative mortality by percentile for Risk Score 4.0 (RxDx) compared with Risk Score 3.0 (RxDx). Better-performing scores will have lower mortality at the same percentile. A horizontal comparison indicates that the 4.0 score qualifies a larger share of cases at the same mortality level. A vertical comparison indicates lower mortality for the 4.0 score at the same share of qualified cases. 

For example, as shown in Figure 1, at a 30% STP target, cumulative mortality is 29% for Risk Score 4.0 compared with 32% for Risk Score 3.0. At a cutoff corresponding to 50% cumulative mortality, Risk Score 4.0 qualifies 67% of applications, compared with 64% for Risk Score 3.0. Not shown here, the analysis showed similar improvements from Risk Score 3.0 to their counterparts in version 4.0 for all five sets of risk scores (not just RxDx). 

 

Score performance by sex and age group 

Table 3 summarizes mortality segmentation with relative mortality at the 30th percentile across various scores by age group and sex, along with the improvement from version 3.0 to version 4.0. For illustrative purposes, the 30th percentile serves as an approximation to a cutoff level that carriers may consider when identifying a preferred class. Scores that can be used to identify the best mortality risks in the top 30th percentile will have lower relative mortality in Table 3. 

The key findings are as follows: 

  • Adding Dx to Rx, or adding Cr to Rx and Dx, improves mortality segmentation across all demographic groups. 
  • Score performance is broadly similar between males and females. 
  • All scores perform best in the 36-55 age group, followed by ages 56-65 and 18-35, and then worst in the ages 66+. 
  • The largest performance gain is observed in the 18–35 age group when Cr is added to RxDx, reducing relative mortality from 31% to 26%, a five-percentage-point improvement. 

Notable improvements from version 3.0 to version 4.0 include: 

  • All groups show improvement. This suggests that more experience and better models could increase segmentation power without introducing entirely new types of consumer data. 
  • Scores that already include Cr, such as RxCr and RxDxCr, show less improvement than Rx only and RxDx scores. 
  • Improvement is greater for males than for females. 
  • The largest gains occur at younger ages, with improvement declining as age increases. 

Score performance by duration 

In Figure 2, mortality segmentation is measured by the mortality ratio between the top and bottom deciles of the risk score. The mortality segmentation of all risk scores gradually declines with durations. The pattern of decline is similar for versions 3.0 and 4.0. These findings suggest that the interpretation of risk scores depends on duration. 

Carriers should therefore use caution when applying the mortality segmentation observed in this study to their own populations, particularly when drawing pricing-related conclusions. It is worth noting that, even after 16 years, positive segmentation power remains evident. 

Assessing the mortality prediction accuracy of risk scores 

Risk score accuracy has two dimensions: 

  1. Discrimination accuracy, or mortality segmentation, which measures how well a score separates mortality risk 
  2. Calibration accuracy, which measures how closely predicted risk aligns with actual mortality 

As discussed in the previous section, discrimination accuracy is particularly relevant for underwriting. Calibration accuracy, by contrast, is most relevant for making mortality inferences – for example, whether a 10% difference in risk score corresponds to an approximately 10% difference in mortality between two groups. 

Figure 3 presents the calibration accuracy of two scores – RxDx and RxDxCr – across both versions, 3.0 and 4.0. It shows the percentage difference between actual mortality (relative AE) and predicted risk (the score) across the full score percentile range. When actual mortality is higher than the predicted mortality, the point lies above the 0% line; when actual mortality is lower than predicted mortality, the point lies below the 0% line. 

In Figure 3, both version 3.0 scores show points above the 0% line at lower-risk percentiles and below the 0% line at higher-risk percentiles, creating a downward trend. This pattern indicates that version 3.0 tends to understate mortality for lower-risk cases and overstate mortality for higher-risk cases. By contrast, the version 4.0 scores are generally closer to the 0% line and appear relatively flat across percentiles, indicating improved calibration accuracy compared with version 3.0. 

As noted above, the mortality segmentation of the risk scores declines with durations, which inevitably causes calibration accuracy to vary over time. Accordingly, the calibration results shown in Figure 3 should be interpreted in the context of the study dataset used in this analysis. 

Score distribution shift between 3.0 and 4.0 

Carriers currently using Score 3.0 and considering an upgrade to version 4.0 should recognize that applying the same score cutoff may lead not only to a different mortality impact, as discussed above, but also to a different class size because of the shift in the score distribution. Using a 0.3 cut point as an example, Table 4 below shows the percentage of cases that qualify at select cut points for the various scores. Note that the risk scores were rounded to the nearest tenth in comparison with the cut point. 

  

This pattern also highlights a broader difference in score behavior across versions. Adding more data to a score – for example, Dx to Rx or Cr to RxDx – generally widens the score distribution, allowing more cases to qualify at lower score thresholds. However, version 4.0 score distributions are more concentrated to the middle range than their counterparts in version 3.0. This reflects the refinement of version 4.0 to improve calibration accuracy. 

Carriers should assess how a score version update may affect class size within their own portfolios. The percentages shown in Table 4 may not reflect a carrier’s actual business, as results can vary with differences in population mix. 

Hypothetical business case 

Many carriers use risk-based scores as rule-out criteria. If an applicant has a score over a certain cut point, the case can be declined or become ineligible for accelerated underwriting. This makes sense in that the mortality rate is generally low, and the overall mortality of a pool can be determined by a small percentage of people with very poor mortality risks. However, that kind of use case does not apply the segmentation power of risk-based scores across the entire spectrum of the mortality risks. 

Carriers can further use the segmentation power of these risk-based scores to separate preferred risks from a group consistent with a standard class in full medical underwriting. This analysis carved out a population with target mortality close to 100% of the 15 VBT AGG table to represent the standard class in full medical underwriting. The criteria are cases with ages 18-60, with a line of business labeled as life, no nicotine flags, and no Prescription Data flagged as “red.” Nicotine effects were assumed to be addressed separately.  

For illustrative purpose, the analysis assumed the underreporting rates of death at 14%. With that assumption, the population with RxDx 4.0 below 0.7 has the mortality at approximately 100% of 15 VBT AGG. Further, the lowest 40% of the population in the study were considered preferred class. That translates to a cut point of 0.370. The mortality ratio of the preferred class relative to the standard class is 62%. That mortality ratio is within the range of reasonableness in terms of a typical life insurance underwriting program. Using risk-based scores could potentially help with preferred underwriting classification. 

Certainly, this finding should not replace a retrospective study conducted by an individual carrier to fully assess how the score compares with conventional underwriting. RGA is experienced in assisting clients with deeper and more contextual analysis. 

Summary and limitations 

This analysis indicates that, across demographic segments, 4.0 scores generally differentiate between lower- and higher-risk lives more effectively than the prior version, although performance remains strongest in middle ages and weaker at older ages. Adding Dx to Rx, and Cr to RxDx, is associated with improved segmentation, while the AnyNicotine flag is associated with materially higher mortality and higher average risk scores. 

The hypothetical business case further suggests that Risk Score 4.0 (RxDx) can identify a segment with mortality broadly consistent with a typical preferred class in a simulated population. 

These findings should be interpreted in context. Score performance declines with duration, and calibration may vary depending on the duration mix of the population. Changes in data compilation also limit comparability with prior studies, including differences in hit types, hit rates, and the observed relationship between hit status and mortality. 

In addition, score distributions differ meaningfully between versions 3.0 and 4.0, so legacy cut points may not translate directly to the new model. 

The source data provided by Milliman may overlap with model development data and include limitations such as underreported deaths and assumptions used in the illustrative business case. Carriers should therefore evaluate performance and implementation impacts within their own portfolios before relying on these scores for underwriting or pricing decisions. 


More Like This...

Meet the Authors & Experts

Laiping Wong-Stewart
Author
Laiping Wong-Stewart

Vice President and Actuary, USIM

Guizhou Hu
Author
Dr. Guizhou Hu
Vice President, Head of Risk Analytics, Global Underwriting, Claims, and Medical 
Mark Ma
Author
Hezhong (Mark) Ma

Vice President and Managing Actuary, USIM