Corpus And Expansion

The CoreTex corpus is a retrieval benchmark corpus, not a dump of raw challenge traces. Each admitted event has the fields needed for retrieval evaluation:

Field	Purpose
Query text	The user-facing retrieval question
Truth documents	Answer-bearing documents
Hard negatives	Plausible non-answer documents
Graded qrels	Relevance labels used by `nDCG@10` and reranker evaluation
Split	`train_visible`, `calibration`, `eval_hidden`, or `canary`
Family	Retrieval family such as near-collision, temporal, long-horizon, or multi-hop relation
Embeddings	Bundle-layout-compatible BGE-M3 payloads
Depth metadata	Optional `causalDepth` and `relationHopDepth` for deeper temporal and relation strata

The launch corpus path uses the challenge-library generator as the source of structured worlds, questions, relations, temporal modifications, and hard negatives. That is different from blindly bridging V3 dataset output. Raw V3 traces are useful audit material, but the CoreTex corpus needs explicit qrels, split discipline, hard-negative pools, embeddings, and relation/temporal metadata.

Expansion is append-only through signed corpus deltas:

New domains, seeds, question families, temporal updates, relations, and hard negatives are generated or admitted.
The corpus builder emits records with embeddings and qrels under the pinned schema.
Records are assigned to deterministic splits.
The new corpus root and delta manifest are published.
The next bundle or epoch metadata binds the root that the evaluator uses.

The fixed substrate is what makes expansion matter. More records, deeper relation chains, more temporal revisions, and denser distractors increase the amount of useful retrieval structure competing for the same 1024 words.