Corpus

The current launch-family corpus is the v16 dgen1-r5-synth-300k corpus. It is generated from structured BOTCOIN challenge worlds, entities, relations, temporal updates, traps, and hard negatives. The generation path creates retrieval-evaluation records from structured challenge ingredients.

Each record carries:

Field	Purpose
Query text	The retrieval task
Truth documents	Answer-bearing memory documents
Hard negatives	Plausible documents that should rank below truth
Graded qrels	Relevance labels for `nDCG@10`, MRR, recall, and audit metrics
Split	`train_visible`, `calibration`, `eval_hidden`, or `canary`
Public intent metadata	Temporal, relation, evidence, conflict, scope, entity, and abstention hints available to all miners
Embeddings	Bundle-layout-compatible BGE-M3 query and document vectors
Provenance	Source domain, seed, generator path, and deterministic roots

The production qrel path uses synthesizer-category labels. The generator knows why a negative exists, such as stale fact, entity swap, relation neighbor, attribute swap, lexical distractor, or unrelated filler. The bundle maps those categories into graded relevance. Larger audit rerankers remain useful for calibration checks. Production corpus growth avoids a heavier relabeling pass over every pair.

Corpus growth is published through signed deltas. Validators retain the launch base corpus and can reconstruct historical corpus roots by walking the signed delta chain forward. A manual --corpus-for-root 0x...=path shortcut exists for operators, while the normal validator path auto-resolves and verifies historical roots before post-reveal rescoring.

Miner-facing corpus access is coordinator-proxied. Start from:

Method	Path	Purpose
`GET`	`/coretex/public-corpus/manifest`	Public corpus manifest, model IDs, corpus root, paging limits, split policy, and endpoint templates
`GET`	`/coretex/public-corpus/events?offset=N&limit=M`	Paged public visible events and public qrels
`GET`	`/coretex/public-corpus/events?offset=N&limit=M&includeEmbeddings=true`	Same event page with canonical public embedding hex; lower page limit
`GET`	`/coretex/public-corpus/event/:eventId`	One public visible event by id
`GET`	`/coretex/public-corpus/entities?offset=N&limit=M`	Paged public entity table
`GET`	`/coretex/public-corpus/family-summary`	Query-family counts and bounded representative public examples
`GET`	`/coretex/public-corpus/relation-summary`	Public relation edge-type counts and bounded representative public examples
`GET`	`/coretex/public-corpus/query-examples?surface=...&family=...&relation=...`	Bounded public examples filtered by intended surface, family, or relation

The public corpus proxy serves unprotected train_visible rows. Calibration, hidden eval, canary, protected, or nonexistent event IDs are not served through these endpoints. Use the coordinator proxy for miner research even when S3 artifact links are also advertised, because bucket ACLs and publication timing can differ from the miner-facing corpus API.

/coretex/schema advertises the current public artifact base URL, S3 URL templates, public corpus links, and eval-report URL template under publicArtifacts. Validators use those artifact URLs to hydrate launch files, historical corpus material, signing keys, and post-reveal evaluation reports.

Useful publicArtifacts fields:

Field	Purpose
`artifactBaseUrl`	Base URL for launch-family public artifacts
`epochSigningPublicKeyUrl`	Public key material for verifying epoch-signed artifacts
`evalReportUrlTemplate`	Template for post-reveal eval reports by artifact hash
`publicCorpus`	Coordinator-proxied manifest, event, entity, family, relation, and query-example endpoints
`s3Hints`	Operator notes about which artifacts are best fetched through S3 versus coordinator proxy

Corpus evolution is part of the memory model. New information enters, older information becomes stale, conflicts appear, retired hidden tasks leave the scoring pool, and new hidden tasks are added. Each evolve event is calibrated against its own corpus root, query pack, baseline, and pinned scorer context.

The calibration path also checks whether useful substrate changes generalize across corpus generations. A representative test starts with substrate design A on corpus/query pack A, evolves into corpus/query pack B, accepts a miner patch that beats the newly calibrated baseline, then backtests the resulting substrate design B against the pre-evolve corpus/query pack A. When design B preserves pre-evolve performance while improving the evolved context, the result shows a retrieval-routing improvement rather than corpus-specific indexing churn.