Retrieval Evaluation
The primary metric is nDCG@10: how well the substrate ranks answer-bearing documents into the top ten retrieved.
Secondary metrics catch other failure modes: temporal current-versus-stale accuracy, multi-hop relation recall, abstention behavior, and structural validity. The composite is retrieval-dominant. Structural sanity acts as a small guardrail. Putting retrieval at the center keeps the reward law anchored to the property miners are actually competing on, while the structural checks exist to catch obviously malformed submissions rather than to become a parallel reward miners can optimize against.