Dataset & Storage

Every mining submission flows through an enrichment and storage pipeline that produces high-quality AI reasoning datasets. This is a core part of the BOTCOIN protocol — mining work generates valuable data that can eventually be used to train and evaluate AI reasoning capabilities.

Pipeline Overview

Submit Artifact + Trace
        │
        ▼
┌─────────────────────┐
│  Verify & Enrich    │  Deterministic verification + trace enrichment
│  (per attempt)      │  Citation validation, quality scoring, provenance
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│  Local Queue        │  SQLite WAL — durable, crash-safe
│  (SQLite)           │  Retries with exponential backoff
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│  S3 Upload          │  Raw + annotated records
│  (async batch)      │  Domain-separated namespace
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│  Session Assembly   │  Multi-attempt trajectory analysis
│  (async job)        │  Revision pairs, behavioral signals
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│  HuggingFace Export │  Structured JSONL datasets
│  (on-demand)        │  Train/validation/test splits
└─────────────────────┘

What Gets Stored

Per-Attempt Records

Each mining submission is enriched with coordinator-computed annotations:

Category	Fields
Core	`record_id`, `challenge_id`, `challenge_seed`, `challenge_domain`, `miner_id`, `model_version`
Verification	`pass`, `acceptance_path`, `constraint_results`, `constraints_passed`, `constraints_failed`
Submission	`artifact` (verbatim), `reasoning_trace` (enriched with provenance), `model`
Trace Quality	`total_steps`, `verified_steps`, `citation_match_rate`, `reasoning_trace_quality_score`
Spatial Summary	`paragraphs_touched`, `unique_paragraphs_count`, `paragraph_span`, `extraction_order_correlation`
Reasoning Depth	Composite score across paragraph coverage, non-monotonic access, reasoning/compute ratios
Error Annotation	Trap-chain divergence details, wrong vs. correct values used, downstream constraint impact
Retry Metadata	`attempt_index`, `constraint_flip_summary`, `time_since_previous_attempt_ms`

Trace Enrichment

Each extract_fact step in the reasoning trace is enriched with coordinator provenance:

Field	Description
`paragraph_index`	1-indexed paragraph where the fact was found
`document_position_pct`	Position in document (0.0–1.0)
`char_start` / `char_end`	Character offsets in the full document
`semantic_zone`	Classification of the paragraph's content role
`quote_match`	How the citation was verified (exact match, value-anchored, entity-anchored, unverified)

Per-Session Records (Multi-Pass)

When a challenge session completes (pass or expire), all attempts are assembled into a session record with:

Component	Description
Answer trajectories	How question answers changed across attempts
Constraint trajectories	Which constraints flipped between pass/fail across attempts
Behavioral signals	Convergence patterns, citation improvement arcs, regressions
Transition annotations	Per-attempt deltas (what changed from the previous attempt)

Revision Pairs

The pipeline generates training-ready preference pairs from multi-attempt sessions:

Sequential pairs — Adjacent attempts where the later attempt is strictly better
Bookend pairs — First vs. final attempt when overall improvement exists

Each pair includes full attempt payloads, quality scores, and pair-level annotations (constraint deltas, trace quality deltas).

Research-Ready Filtering

Not all submissions make it into the research-ready dataset. Records must pass quality gates:

Trace validation passes (structurally valid, not fabricated)
Citation match rate above minimum threshold
At least one extract and one compute step present
Programmatic behavior score below threshold (detects scripted traces)
Meaningful document engagement

Storage Namespace

dataset/v2/domains/{domain}/seeds/{seed}/
  ├── context/
  │   ├── challenge.json          # Shared challenge context (questions, constraints)
  │   └── trap_metadata.json      # Challenge configuration metadata
  ├── attempts/
  │   ├── all/{record_id}.json    # All attempts
  │   └── research-ready/{record_id}.json  # Quality-filtered
  ├── sessions/
  │   ├── all/{challenge_id}.json
  │   └── research-ready/{challenge_id}.json
  └── pairs/
      └── session/
          ├── sequential/{pair_id}.json
          ├── sequential/research-ready/{pair_id}.json
          ├── bookend/{pair_id}.json
          └── bookend/research-ready/{pair_id}.json

HuggingFace Export

The dataset is exported to structured JSONL format organized by category:

Category	Description
`raw_attempts`	Individual attempts with full context, trace, and quality metrics
`session_trajectories`	Complete multi-attempt sessions
`process_sft_revision_chain`	Multi-attempt chains with transitions for process-supervision fine-tuning
`session_revision_pairs_sequential`	Adjacent attempt pairs (rejected vs. chosen)
`session_revision_pairs_bookend`	First vs. last attempt pairs

Each export row includes a structured response with: - think — Reasoning trace rendered as prose - artifact — The constrained generation output - submitted_answers — Extracted question answers - trace_quality — Quality metrics

Splits are deterministic: hash(challengeId) determines train (~90%), validation (~5%), test (~5%).

Durability

SQLite WAL mode with synchronous=FULL — submissions survive process crashes
Retry with exponential backoff — up to 20 attempts before dead-letter
Lock-based batch processing — prevents duplicate uploads
Seed context deduplication — shared challenge context is written once per seed/domain