Active development. APIs and behaviour may change. Production use at your own risk. GitHub
Mentat

← All posts

2026-05-02 · 7 min read aidspyengineering

Inside the Scout-Draft-Validator pipeline

How Mentat's DSPy agent chain turns a one-line idea into a structured market specification — and what we learned tuning it.

A common reaction the first time someone sees Mentat’s Creator Studio is: “wait, the AI is just writing the whole market.” That is half right and worth unpacking carefully, because the way the pipeline composes — Scout, Draft, Validator, with the curator loop on top — is doing real work and the choice of DSPy as the orchestration framework matters.

This post walks through how the three agents compose, what each one actually does, why we wrote it in DSPy instead of raw API calls, and the empirical lessons from tuning the pipeline against the M2 dataset.

The pipeline at a high level

user idea (one line)


┌────────┐       ┌────────┐       ┌────────────┐
│ Scout  │──────▶│ Draft  │──────▶│ Validator  │──┐
└────────┘       └────────┘       └────────────┘  │
     ▲                                            │ fails?
     │             refine ◀───────────────────────┘
     │             max 2 retries
     │             else escalate

human curator review

The user types a one-line idea. Scout surfaces structured candidates. Draft produces a full market spec. Validator scores the spec. If validation fails, Draft refines based on Validator feedback for up to two iterations. If validation still fails, the draft escalates to a human curator with the failure notes. If validation passes, the draft enters the curator queue with a green badge.

The whole loop runs in approximately fifteen seconds on average for OpenAI GPT-4-class backends.

Scout: candidate generation

Scout’s job is to take a vague creator prompt — “I want a market about the upcoming SpaceX Starship test” — and produce 3-5 candidate market concepts with metadata: topic tags, suggested timeframe, confidence score, source candidates.

Why not skip Scout and go straight to Draft? Because the user input is usually ambiguous. “A market about the SpaceX Starship test” could mean:

  • Will Starship reach orbit on a specific flight number?
  • Will the booster soft-land successfully?
  • Will the test happen before a specific date?
  • Will a specific milestone be achieved?

Each of these is a different market with different sources, different timeframes, different fee profiles. Scout surfaces the options and the creator picks. This UX detail — exposing the ambiguity instead of papering over it — was one of the biggest creator satisfaction lifts in M2.

The Scout agent is a DSPy program that uses retrieval against a topic taxonomy plus a structured prompt that demands JSON output. The retrieval surface includes our internal taxonomy plus optional connectors (news APIs, social trending feeds) that the Scout can use to suggest concepts grounded in recent events.

A typical Scout output, given the Starship prompt:

{
  "candidates": [
    {
      "concept": "Starship Flight 12 reaches orbit before 2026-09-30",
      "topic_tags": ["space", "spacex", "milestones"],
      "timeframe_days": 120,
      "confidence": 0.78,
      "source_candidates": ["spacex.com/launches", "nasaspaceflight.com/feed"]
    },
    {
      "concept": "Super Heavy booster soft-lands on Flight 12",
      "topic_tags": ["space", "spacex", "engineering"],
      "timeframe_days": 120,
      "confidence": 0.71,
      "source_candidates": ["spacex.com/launches", "wikipedia.org/wiki/SpaceX_Starship"]
    }
  ]
}

The creator picks a candidate. That candidate becomes input to Draft.

Draft: full spec generation

Draft is the heavy lifter. Given a candidate concept and the market schema, it produces structured JSON conforming to the canonical Mentat market spec: question text, AI rationale, outcomes, resolution criteria, source list, trigger condition, timestamp window, fallback logic, invalidation clause, economic parameters, discovery summary.

The DSPy module composition for Draft looks roughly like:

class DraftAgent(dspy.Module):
    def __init__(self):
        super().__init__()
        self.retrieve_examples = dspy.Retrieve(k=4)
        self.compose = dspy.ChainOfThought(
            "concept, schema, examples -> market_spec_json"
        )

    def forward(self, concept, schema):
        examples = self.retrieve_examples(query=concept.topic_tags)
        result = self.compose(
            concept=concept,
            schema=schema,
            examples=examples,
        )
        return result.market_spec_json

The retrieval against past successful markets is doing important work. The Validator catches obvious schema violations, but the examples retrieval is what makes the Draft output stylistically consistent with markets that previously cleared curation. Few-shot examples are dynamic per-topic instead of static.

The Draft prompt enforces JSON via response_format on supported models and falls back to JSON-mode parsing on others. The system prompt is explicit: every question text must reference a specific timeframe in UTC, every source must include domain plus endpoint type plus a one-line justification for why it is zkTLS-verifiable, fallback_logic must be present even if explicitly “none.”

These constraints exist because we measured — over the M2 buildout — exactly where AI drafts fail validation. The two most common failure modes were (1) implicit timeframes (“by year-end” without specifying a year), and (2) source lists without justification (just “Reuters”). Forcing both in the prompt schema cut Validator failure rate by roughly 40% in our internal sample.

Validator: rubric-based critique

Validator is the gate. It receives a draft and emits a structured report:

{
  "pass": false,
  "scores": {
    "clarity": 0.82,
    "verifiability": 0.45,
    "policy_compliance": 1.0
  },
  "issues": [
    {
      "field": "trigger_condition",
      "severity": "block",
      "message": "Trigger references 'official announcement' without specifying which channel; ambiguous."
    },
    {
      "field": "source_list[1]",
      "severity": "warn",
      "message": "twitter.com is in the allowlist but rate-limited; recommend adding redundant source."
    }
  ],
  "required_fixes": [
    "Specify trigger channel: 'spacex.com/launches' or 'twitter.com/SpaceX official account'."
  ]
}

Validator’s rubric covers:

  • Clarity. Question text contains explicit timeframe, measurable condition, non-overlapping outcomes.
  • Verifiability. Trigger condition is deterministic against a source in the allowlist; fallback logic is explicit.
  • Policy compliance. Market does not fall into blocked categories (self-harm, violence, hate speech, illegal activity).

Block-severity issues prevent passage; warn-severity issues annotate the draft for curator attention but do not block. A draft with all block-severity issues resolved and an aggregate score above 0.7 passes Validator and enters the curator queue.

When Validator fails, Draft is invoked again with the issues and required_fixes as additional context. Up to two refinement iterations. After two failures, the draft escalates to a curator with a “needs human attention” badge and all three reports (Scout, Draft x N, Validator x N) attached.

Why DSPy

We started M2 with raw OpenAI API calls. We switched to DSPy halfway through and the rewrite paid for itself in two weeks.

The case for DSPy in this kind of pipeline:

Programs, not prompts. A DSPy module is a class with typed signatures. The prompt is generated from the signature, not handwritten. When you want to swap the underlying model, you swap one config line — the prompts adapt to the model’s idiosyncrasies automatically. We swap between GPT-4-class and Claude depending on category profile without rewriting prompts.

Optimizers. DSPy ships optimizers (BootstrapFewShot, MIPRO) that can tune few-shot example selection and prompt phrasing against a labeled dataset. We use this to recalibrate Draft against the growing dataset of curator-approved markets every two weeks. The improvement is measurable: aggregate validation pass rate went from 58% on initial deployment to 79% over the M2 buildout.

Composition. Scout, Draft, Validator are each dspy.Module subclasses. The pipeline is a parent module that calls them in sequence. Logging, tracing, and error handling are uniform across the pipeline because they live at the parent module level.

Type-safe outputs. Pairing DSPy signatures with Pydantic models for the output schema gives us runtime validation that the JSON actually conforms. Malformed AI outputs fail fast rather than poisoning downstream consumers.

The case against DSPy is real too: it is opinionated, the documentation occasionally lags features, and there is a learning curve. For a pipeline this complex the investment paid back fast. For simpler one-shot prompts it would be overkill.

What we learned tuning

A few takeaways from M2 that were not obvious upfront:

Validator rubric weights matter more than Draft prompt tweaks. We spent the first month iterating on Draft and got modest gains. The next two weeks were spent recalibrating Validator’s severity thresholds against a labeled dataset of curator decisions, and that produced larger improvements in curator satisfaction. The Validator is the load-bearing piece because it shapes what Draft is allowed to ship.

Refinement loops have diminishing returns past 2 iterations. We tested 1, 2, 3, and 5 iteration caps. Two is the sweet spot. Three rarely produces a passable draft that two could not; five wastes tokens and creator wait time without measurable benefit.

Confidence scoring helps Scout but hurts Validator. Scout candidates with confidence scores let the user pick well. Validator scores expose too much detail to curators and create review-overhead. Validator now exposes pass/fail plus issues; raw scores are logged but not surfaced in the console.

Per-topic prompt examples are worth the engineering. A general few-shot set produces general output. Topic-tagged retrieval (politics, sports, crypto, science, economics) produces topic-appropriate output. The retrieval indexing is non-trivial but the quality lift is real.

Curator feedback loops decay quickly without instrumentation. We added explicit “request AI revision with notes” in the curator console. Curators use it heavily for the first week, then stop unless we surface aggregate metrics back to them. Showing curators their impact on Draft quality keeps the loop alive.

What is next

M3 adds two pieces. The Reviewer agent — an adversarial pass that tries to identify exploitation vectors before curation — is on the M3 backlog. We have a prototype but the false positive rate is high enough that we are still tuning before shipping. The Summarizer agent for discovery surfaces is a smaller piece that ships with the trading UI.

M4 brings the proof side of the loop into the pipeline. When the Validator agent checks trigger verifiability today, it relies on a static allowlist plus heuristics. When the proof service is live, Validator can actually probe whether a candidate source has a known zkTLS recipe, which sharpens the verifiability score substantially.

The pipeline is the engine. Curators are the brake. Both matter, and the whole system fails if either side gets lazy.

Read the source.

All of this is implemented in the open at github.com/cryptuon/mentat.