Summary Is Not Enough: Source-Blind LLM Judges Mistake Faithful Citation for Hallucination

Using one language model to grade another's output is now standard practice — including for hallucination. But when the judge is shown only a short ground-truth summary instead of the source the candidate model actually read, it cannot tell a real fabrication from a faithful quote of the source. This paper documents that failure mode, proves it with a deterministic audit over every output, and ships a 40-line fix.

A judge that can't see the source

The LLM-as-judge literature already catalogues biases that distort scores independently of quality — position bias, verbosity bias. This paper adds another, specific to source-grounded scoring: when a judge is asked to rate hallucination but is given only a brief summary rather than the source the candidate model saw, it cannot verify source-grounded citations and systematically flags faithful ones as fabricated.

The mechanism is structural. When the candidate model cites content — a number, an identifier, a string — that is present in the source but absent from the judge's short summary, the judge has no way to check it, so it marks it as unsourced. The rate of mis-flagging is governed by how each output is phrased, not by any difference in fabrication.

The judge's hallucination score tracks output citation style — not whether the model made anything up.

Four audits, one verdict

The claim is established by four audits ordered by increasing independence from the judge. A sampled review of the judge's own rationales found that every flagged item was in fact present in the source. A deterministic membership check extracted every identifier-shaped token from all 252 outputs and matched it against the source library — zero outputs contained an identifier not present in the source. And two human raters, the second blind to the hypothesis, re-scored a sample with source access and diverged sharply from the judge on hallucination — by roughly +1.6 to +1.8 points — while agreeing with it on the other dimensions.

252

Outputs checked — zero identifier-level fabrications

~40

Lines of code in the deterministic membership check

+1.6

How much harsher the judge scored hallucination than humans with source access

That the contaminated dimension carried most of the apparent signal makes this more than a curiosity: a practitioner trusting it would have shipped a design decision on the strength of a judge artifact. The deterministic check would have caught it in seconds.

The remedy, and a second lesson

The fix is concrete: give the judge the same source the candidate model saw, or supplement it with a deterministic source-membership check against the corpus before trusting its hallucination scores. The authors release that check as a routine diagnostic for any pipeline scoring identifier-grounded outputs. An external arm on the public FiQA benchmark, where the judge is given full source access, supports the verification-oriented lesson — source-tagged context makes answers mechanically checkable even when answer quality is unchanged.

The experiment that surfaced the artifact — a four-condition ablation testing whether the shape of in-context data (prose, JSON, a typed object graph) changes downstream reasoning — carries its own lesson once the contaminated dimension is removed. No composite-level effect of representation survives; structured context helps only where the task demands it (typed-binding fidelity, multi-hop attribution), not as an architectural default. In short: structured representation is a per-task switch, and an unaudited LLM judge is a liability.

This paper has been submitted to CLiC-it 2026 — the Twelfth Italian Conference on Computational Linguistics (Palermo, 14–16 September 2026) — and is currently under peer review. The version below is the submitted manuscript.

Maryam Fooladi and Federico Bottino are affiliated with Kakashi Venture Accelerator.

Download the paper

Summary Is Not Enough: Source-Blind LLM Judges Mistake Faithful Citation for Hallucination

PDF · 14 pages · Submitted to CLiC-it 2026

Download PDF ↓

For enquiries about the research or partnership opportunities, contact the KVA team.

LLM-as-JudgeHallucinationEvaluationStructured ContextAblation StudyMarketing ReasoningReproducibility

KVA Research · Kakashi Ventures Accelerator Srl · Turin, Italy
Submitted to CLiC-it 2026 · Under review