Benchmark In Plain English
The CoreTex benchmark asks whether the 32 KB substrate helps retrieve the right memories from a much larger corpus.
The corpus contains many records. Each record has a query, one or more answer-bearing documents, and several plausible wrong documents. A miner patch changes a few words in the substrate. The evaluator checks whether the patched substrate routes the query toward better documents than the parent substrate did.
The key concept is a qrel, short for query relevance judgment. A qrel is simply a label that says how relevant a document is for a query.
| Relevance | Plain meaning |
|---|---|
1.0 |
Direct answer; the document contains the canonical truth |
0.8 |
Highly relevant; answer-bearing |
0.6 |
Partially answer-bearing |
0.4 |
Related but incomplete, stale, or only partly useful |
0.2 |
Surface-related but not the answer |
0.0 |
Irrelevant or adversarial distractor |
The evaluator does not ask a model to write an answer. It asks a narrower retrieval question: did the substrate put the most relevant documents near the top? That is why the primary metric is nDCG@10. The metric gives more credit when highly relevant documents appear early in the ranked list and less credit when plausible wrong documents rank above the truth.
This matters because memory failure often looks superficially correct. A stale fact, a near-collision entity, or a causally related neighbor can all look semantically similar to the query. CoreTex rewards patches that improve the retrieval structure despite those traps.
What Counts As Real Improvement
A patch can improve the benchmark by:
- mapping the query toward a better answer-bearing record,
- encoding retrieval keys that discriminate near-collision documents,
- marking stale memories so current facts win,
- adding relation structure that helps multi-hop retrieval,
- compressing useful record IDs or routing hints into the fixed substrate,
- preserving protected records while improving other families.
A patch does not get canonical credit merely for filling slots, changing bytes, or increasing structural occupancy. The retrieval candidates must be reachable through active substrate slots, reranked by the pinned model, and scored against hidden qrels.