Benchmark In Plain English

The corpus is a collection of records. Each record has a question, one or more documents that contain the answer, and several plausible distractor documents.

The benchmark gives the substrate a question it has not been targeted at and measures whether the substrate routes retrieval toward the answer-bearing documents. Patches that improve that routing earn credit.

The metric penalizes plausible-but-wrong matches. Superficially related documents that lack the answer score below a true match ranked further down. This is what real memory failure looks like, confidently retrieving something that sounds right, and a metric that did not punish it would reward miners for getting better at surfacing convincing noise.