Quantifying Trust: Ensuring Reliability in LLM Summaries

2025-08-30

Building text summarisers with LLMs is a fascinating project to take on. Compressing complex documents is effortless, but establishing and maintaining trust in a system is still vital — especially in production settings. How do we know if we’ve picked the best model for the task? Can we establish a baseline before we attempt to fine-tune a model?

I was unfamiliar with standard metrics to use for text data. I learned that classical metrics used in this field often fell apart when used for evaluating the “factuality” of long document summaries. Modern techniques revolve around embeddings, which embeddings provide a way to quantify semantic and conceptual meaning of text data (via cosine similarity). But still, two sentences can contradict each other but still have a high cosine similarity. This problem is even more pronounced as the size of the text increases and the embeddings lose their granular meaning.

Natural language interface models (NLIs) provide a way to evaluate factuality, but they still fall apart when naively applied to long inputs.

The publication on “LongDocFact Score” stuck out to me. Instead of naively comparing entailment scores on across every text and summary sentence pair, they extract the top K semantically similar sentences from the source for each sentence in the summary, add a bit of context (sentence before and sentence after) before computing the NIL scores on each snippet/summary sentence pair.

We can illustrate this in Python using nltk and transformers:

# Define texts and summary
text = "..."
summary = "..."


# Segment text and summary
import nltk

nltk.download("punkt")

def split_sentences_nltk(data: str) -> list[str]:
    return [s.strip() for s in nltk.sent_tokenize(data) if s.strip()]

segments_text = split_sentences_nltk(text)
segments_summary = split_sentences_nltk(summary)


# Embed segments
from sentence_transformers import SentenceTransformer

device = "cuda"
sentence_model_name = "all-MiniLM-L6-v2"
sentence_model = SentenceTransformer(sentence_model_name)
sentence_model.eval()
sentence_model.to(self.device)

def get_similarity_matrix(
    segments_text: list[str],
    segments_summary: list[str],
    sentence_model: SentenceTransformer,
) -> torch.Tensor:
    """Returns similarity matrix with shape (len(segments_text),
    len(segments_summary)).
    """
    norm_embedded_segments_text = torch.nn.functional.normalize(
        sentence_model.encode(
            segments_text,
            convert_to_tensor=True,
            show_progress_bar=False,
        ),
        p=2,
        dim=1,
    )
    norm_embedded_segments_summary = torch.nn.functional.normalize(
        sentence_model.encode(
            segments_summary,
            convert_to_tensor=True,
            show_progress_bar=False,
        )
        p=2,
        dim=1,
    )
    return (
        norm_embedded_segments_text @ norm_embedded_segments_summary.T
    )

similarity_matrix = get_similarity_matrix(
    segments_text,
    segments_summary,
    sentence_model,
)

Once we’ve constructed the similarity matrix, we can construct the snippets for each sentence in the summary:

# Get top k similar text segments with context for each summary segment
k = 3

_, top_k_indices = torch.topk(similarity_matrix, k=k, dim=0, largest=True)

def get_snippet(data_sen: list[str], doc_idx: int) -> str:
    context_parts = []
    if doc_idx > 0:
        context_parts.append(data_sen[doc_idx - 1])

    context_parts.append(data_sen[doc_idx])

    if doc_idx < len(data_sen) - 1:
        context_parts.append(data_sen[doc_idx + 1])

    return " ".join(context_parts)

top_k_context_segments = [
    [get_snippet(data_sentences, k) for k in top_k_indices[:, idx]]
    for idx in range(similarity_matrix.shape[1])
]

Now, we have a much more efficient way to calculate the scores for this pair of documents:

# Calculate scores for each summary segment
from typing import Any
import transformers
import torch

model_name = "cross-encoder/nli-deberta-v3-base"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
model = transformers.AutoModelForSequenceClassification.from_pretrained(model_name)
self.model.to(self.device)

label_idxs = {
    "contradiction": 0,
    "entailment": 1,
    "neutral": 2,
}

def flatten_list_of_lists(list_of_lists: list[list[Any]]) -> list[Any]:
    return [value for list_of_values in list_of_lists for value in list_of_values]

def explode_list_by_k(list_of_values: list[Any], k: int) -> list[Any]:
    return [v for v in list_of_values for _ in range(k)]

features = {
    k: v.to(self.device)
    for k, v in tokenizer(
        flatten_list_of_lists(top_k_context_segments),
        explode_list_by_k(segments_summary, k),
        padding=True,
        truncation=True,
        return_tensors="pt",
    ).items()
}

with torch.no_grad():
    logits = self.model(**features).logits
    scores = softmax(logits, dim=-1)

n_snippets, n_labels = scores.shape
scores, _ = scores.reshape(int(n_snippets / k), k, n_labels).max(dim=1)

metrics = {
    "mean_entailment": float(scores[:, label_idxs["entailment"]].mean().item()),
    "mean_contradiction": float(scores[:, label_idxs["contradiction"]].mean().item()),
    "max_contradiction": float(scores[:, label_idxs["contradiction"]].max().item()),
}

Reply to this post by email ↪