Dealing with inconsistent results
Enterprise systems that use LLMs must be designed to mitigate the effects of inconsistent results. Companies should test these systems at two levels: the LLM’s actual outputs and the enterprise system’s overall accuracy.
Starting with overall accuracy, prompts should be designed to mitigate variance in answers as much as possible. Consider the following poorly conceived prompt:
Produce a one-sentence summary of life insurance underwriting concerns for the following medical document.
Running this prompt on a synthetic medical dataset produced the following two outputs in response to two subsequent queries:
- A 50-year-old male presents with moderately increased underwriting risk due to uncontrolled hypertension, early type 2 diabetes (A1c 6.6%), overweight status, and suboptimal medication adherence, but with preserved renal function, no cardiovascular events, and a favorable non-smoking history.
- A 50-year-old male presents with life insurance underwriting concerns due to uncontrolled hypertension with inconsistent medication adherence, newly diagnosed early type 2 diabetes (A1c 6.6%), overweight status, dyslipidemia, sedentary lifestyle, and a family history of cardiometabolic disease, collectively increasing long-term cardiovascular risk despite currently normal renal function and no end-organ damage.
Making the prompt more precise produces much more consistent results. The following prompt requests a list of medical codes and the top concerns associated with them:
Extract the most important life insurance underwriting concerns from the medical document. Output only CSV with three fields per row: Concern, ICD-10 Chapter, Severity (1–10). Use standardized medical terms, rank by severity (highest first), and include only documented conditions or risk factors.
This prompt is considerably more robust. It describes exactly what is sought and the precise form in which it should be delivered. The precise prompt has several benefits, including more consistent results in the form of a simple list with three columns: a medical code, the name, and the LLM’s assessment of severity.
Severity scores likely will contain the most variance. In practice, insurers likely would have an entire prompt dedicated to this severity ranking with very detailed expressions – if they even trust LLMs to rank severity at all. That said, if using only the data and not the assessment, severity scores still can help rank the LLM-extracted data.
From here, testing the system for accuracy can proceed more efficiently, focusing less on the output of individual LLM queries.
Do two different answers mean the same thing?
Determining the similarity of two responses requires testing an LLM’s actual output. From a quick read-through, the two responses in the previous section – elicited by an ill-conceived prompt – appear quite similar. However, some important, although relatively minor, differences become evident upon closer inspection.
The key is to apply a method to compare LLM responses and quantify their differences. Rather than simply comparing individual words, several mathematical approaches can be used. These include measuring the degree of semantic overlap between the two responses using vector similarity techniques or checking whether one response logically implies or contradicts the other.
A relatively new approach to measuring response similarity is to simply ask the LLM how similar two responses are. While this may seem like “asking the fox to watch the hen house,” it has become one of the most common approaches. Consider this prompt:
Compare the following two LLM responses. Focus only on semantic meaning and factual content. Ignore wording and style. Note any clinically relevant information that is added, missing, or changed.
Output exactly:
[similarity score 0–10, where 10 = identical], one sentence describing the key difference(s). Do not add commentary.
For the same two responses examined above, this prompt produced the following:
7, Response B adds dyslipidemia, sedentary lifestyle, and family history while omitting the favorable non-smoking history and explicit absence of cardiovascular events noted in Response A.
How could two responses, both for the same model version, produce a score of only 7/10? While this might be quickly chalked up to hallucination, something else is going on: The original imprecise prompt is the culprit.
The second, improved prompt, consistently produced results with a score of 9 or 10. If prompts are imprecise, resulting in a 7/10 variance, it will be very difficult to test in the future and to evaluate the variance introduced by an entirely new model/version.
These types of variance comparisons are important when designing prompts, and output should be regularly evaluated to determine potential need for fine-tuning.
Conclusion
LLMs can be powerful tools within a regulated insurance system, but their inherent variability means accuracy, consistency, and auditability must be engineered into the system rather than assumed from the model. Precise prompt design and systematic variance testing are essential foundations for reliably evaluating LLM-based systems over time, especially as models evolve and are replaced.
Contact RGA today to learn more about how an AI-driven partnership can benefit your business.