GenAI in Insurance Update: Q1 2026

Can I trust my AI output? Prompt design, variance, and validation of LLM-driven insurance systems

Jeff Heaton

In Brief

As insurers increasingly integrate LLMs into underwriting, claims, and operational workflows, precise prompt engineering and systematic variance testing have become essential. Structured prompts, semantic comparison techniques, and ongoing output audits help ensure accuracy, consistency, and trustworthiness in evolving LLM-based systems.

Key takeaways

Precise prompt design is essential to reduce variance in LLM outputs and ensure more reliable, auditable performance in regulated insurance environments.
LLM-driven systems must be evaluated at both the model-output and system levels, using structured prompts and mathematical or semantic similarity methods.
Regular variance testing is critical as models evolve, helping insurers distinguish prompt-induced variability from true model changes over time.

Large language models (LLMs) are increasingly becoming standard components of modern enterprise software systems at insurance companies.

These LLMs can be found in agentic systems that operate within governed enterprise environments, supporting regulated decision processes while ensuring auditability, explainability, and risk controls. Agentic software can understand a situation, decide what to do next, and take action to achieve a goal, often with limited human input. Such systems have been adopted across underwriting, claims management, fraud detection, customer service, policy administration, and compliance monitoring.

LLMs – including Claude, Llama, GPT, and others – typically give slightly different answers to questions each time, depending on their configuration. This is similar to asking the same question to a human several times over the course of a day; answers may vary. In addition, the companies behind these LLMs usually update their models periodically. These factors can make it difficult to evaluate the accuracy and consistency of LLM-dependent computer systems over time, as output-only comparisons are often insufficient to ensure that system behavior has not meaningfully changed.

This quarter’s AI update, originally published in the SOA’s Actuarial Intelligence Bulletin, looks at how insurance companies audit and test systems built on constantly changing LLMs that, even without a version upgrade, already produce inconsistent results.

Learn more about how an AI-driven partnership with RGA can help your business.

Dealing with inconsistent results

Enterprise systems that use LLMs must be designed to mitigate the effects of inconsistent results. Companies should test these systems at two levels: the LLM’s actual outputs and the enterprise system’s overall accuracy.

Starting with overall accuracy, prompts should be designed to mitigate variance in answers as much as possible. Consider the following poorly conceived prompt:

Produce a one-sentence summary of life insurance underwriting concerns for the following medical document.

Running this prompt on a synthetic medical dataset produced the following two outputs in response to two subsequent queries:

A 50-year-old male presents with moderately increased underwriting risk due to uncontrolled hypertension, early type 2 diabetes (A1c 6.6%), overweight status, and suboptimal medication adherence, but with preserved renal function, no cardiovascular events, and a favorable non-smoking history.
A 50-year-old male presents with life insurance underwriting concerns due to uncontrolled hypertension with inconsistent medication adherence, newly diagnosed early type 2 diabetes (A1c 6.6%), overweight status, dyslipidemia, sedentary lifestyle, and a family history of cardiometabolic disease, collectively increasing long-term cardiovascular risk despite currently normal renal function and no end-organ damage.

Making the prompt more precise produces much more consistent results. The following prompt requests a list of medical codes and the top concerns associated with them:

Extract the most important life insurance underwriting concerns from the medical document. Output only CSV with three fields per row: Concern, ICD-10 Chapter, Severity (1–10). Use standardized medical terms, rank by severity (highest first), and include only documented conditions or risk factors.

This prompt is considerably more robust. It describes exactly what is sought and the precise form in which it should be delivered. The precise prompt has several benefits, including more consistent results in the form of a simple list with three columns: a medical code, the name, and the LLM’s assessment of severity.

Severity scores likely will contain the most variance. In practice, insurers likely would have an entire prompt dedicated to this severity ranking with very detailed expressions – if they even trust LLMs to rank severity at all. That said, if using only the data and not the assessment, severity scores still can help rank the LLM-extracted data.

From here, testing the system for accuracy can proceed more efficiently, focusing less on the output of individual LLM queries.

Do two different answers mean the same thing?

Determining the similarity of two responses requires testing an LLM’s actual output. From a quick read-through, the two responses in the previous section – elicited by an ill-conceived prompt – appear quite similar. However, some important, although relatively minor, differences become evident upon closer inspection.

The key is to apply a method to compare LLM responses and quantify their differences. Rather than simply comparing individual words, several mathematical approaches can be used. These include measuring the degree of semantic overlap between the two responses using vector similarity techniques or checking whether one response logically implies or contradicts the other.

A relatively new approach to measuring response similarity is to simply ask the LLM how similar two responses are. While this may seem like “asking the fox to watch the hen house,” it has become one of the most common approaches. Consider this prompt:

Compare the following two LLM responses. Focus only on semantic meaning and factual content. Ignore wording and style. Note any clinically relevant information that is added, missing, or changed.
Output exactly:
[similarity score 0–10, where 10 = identical], one sentence describing the key difference(s). Do not add commentary.

For the same two responses examined above, this prompt produced the following:

7, Response B adds dyslipidemia, sedentary lifestyle, and family history while omitting the favorable non-smoking history and explicit absence of cardiovascular events noted in Response A.

How could two responses, both for the same model version, produce a score of only 7/10? While this might be quickly chalked up to hallucination, something else is going on: The original imprecise prompt is the culprit.

The second, improved prompt, consistently produced results with a score of 9 or 10. If prompts are imprecise, resulting in a 7/10 variance, it will be very difficult to test in the future and to evaluate the variance introduced by an entirely new model/version.

These types of variance comparisons are important when designing prompts, and output should be regularly evaluated to determine potential need for fine-tuning.

Conclusion

LLMs can be powerful tools within a regulated insurance system, but their inherent variability means accuracy, consistency, and auditability must be engineered into the system rather than assumed from the model. Precise prompt design and systematic variance testing are essential foundations for reliably evaluating LLM-based systems over time, especially as models evolve and are replaced.

Contact RGA today to learn more about how an AI-driven partnership can benefit your business.

Meet the Authors & Experts

Author

Jeff Heaton

Vice President, AI Innovation