Imagine we have a Large Language Model (LLM) that generates responses to prompts. We also have a perfect ground truth oracle G, which always knows the right answer.
Probabilistic Soundness
We want to make sure that if our test oracle says the LLM’s answer is correct, it’s actually correct most of the time (better than a coin flip plus a little extra)
Example
1. Suppose the LLM generates the answer “Paris is the capital of France.”
2. The test oracle O evaluates this answer and says it is correct.
3. We want the probability that O is right to be greater than 1/2 + a little extra, as confirmed by G.
4. If this condition holds, O is considered to be sound.
Probabilistic Completeness
We also want to ensure that if the LLM’s answer is truly correct according to the ground truth oracle G, our test oracle O will recognize it as correct more often than just guessing randomly plus that extra bit.
Example
1. Suppose the LLM generates the answer “Paris is the capital of France,” which the ground truth oracle G confirms as correct.
2. We want our test oracle O to recognize this as correct more often than 1/2 + a little extra.
3. If this condition holds, O is considered to be complete.
The Problem
The Ground Truth Oracle in itself is a myth in most contexts. The O in reality is – plural: Os. So, if we as testers think about automating testing of LLMs, it is a very intricate challenge. The challenge still remains if we test LLMs as humans as well because we prefer determinism as per our nature.
The “a little extra” is an equally bigger problem. How much extra is good enough?
Testing of LLMs is a big, interesting challenge before us as a community. We need to re-look at test design as well as the key issue – The Oracle Problem.

Leave a comment