Corpus Generation And Labeling
The CoreTex corpus is generated from the BOTCOIN challenge library, not from a raw dump of prior solver traces. Existing challenges already contain the important ingredients for a memory benchmark: multi-hop questions, temporal modifiers, causal relationships, traps, domain entities, and constraint difficulty. The corpus builder turns those ingredients into retrieval-evaluation records.
Each generated record has:
- a query,
- answer-bearing truth documents,
- hard negatives that are plausible but wrong,
- graded qrels,
- a deterministic split,
- provenance back to the source domain/seed,
- BGE-M3 embeddings in the bundle layout,
- optional temporal and relation annotations.
The launch corpus uses synthesizer-category labels for production qrels. The challenge synthesizer already knows why a negative document was generated: entity swap, attribute swap, stale temporal fact, trap, lexical distractor, relation neighbor, or unrelated filler. The bundle maps those categories to relevance values through negCategoryRelevanceMap.
This design removes a permanent scaling bottleneck. If every corpus delta required a 4B labeler over every document pair, corpus growth would be gated by continuous heavy model inference. Instead, qrels are deterministic outputs of the generator and bundle profile. The stronger MemReranker path is retained for audit and calibration, not for every production corpus update.
Corpus expansion is operationally repetitive by design. Adding domains, seeds, temporal updates, relation depth, hard negatives, qrels, embeddings, and split metadata should mean running the generator and validation pipeline again, not manually editing the benchmark.
| Stage | Purpose |
|---|---|
| Generate domains/seeds | Produce worlds, entities, temporal updates, relations, traps, and questions |
| Build retrieval records | Convert generated material into query/truth/negative/qrel tuples |
| Encode embeddings | Precompute BGE-M3 query and document vectors under the pinned layout |
| Assign splits | Deterministically divide records into visible, calibration, hidden, and canary sets |
| Validate corpus | Check roots, splits, embeddings, qrel distribution, and hidden-pack satisfiability |
| Audit labels | Sample against stronger rerankers to confirm category labels track model relevance |
The visible split exists so miners can inspect the benchmark shape without seeing active hidden material. The hidden split is used for live scoring. The calibration split tunes thresholds and bundle profile values. The canary split is reserved for leakage and overfitting detection.