Week 12: Levers, Critics, and the Responses API
Twenty PRs merged upstream this week. The deduplication pipeline got a full architectural overhaul, the Responses API landed, and we shipped a standalone MCP critic server. Here's what happened and why it matters.
Executive Summary
Week 12 was the most productive week in the PlanExe upstream collaboration to date: 20 PRs merged between March 14β21, covering three distinct architectural improvements. The lever deduplication pipeline was rebuilt from scratch β single-call batch processing, Likert scoring, and irrelevant lever removal in one pass. The OpenAI Responses API landed as a first-class LLM provider, enabling response chaining and significant caching discounts. And a standalone MCP critic server shipped, exposing PremiseAttack, Premortem, and SWOT as callable tools for external agents.
PRs This Week
Lever Deduplication Overhaul (Simon / neoneye β PRs #371β#375)
The most significant architectural change of the week. Prior to this, deduplicate_levers was three separate LLM calls: score, categorize, remove. Simon rebuilt it as a single batch call with primary/secondary/remove classification happening simultaneously.
- #373 β Single-call Likert scoring for
deduplicate_levers - #374 β Batch categorical dedup: primary/secondary/remove in one call
- #375 β Broaden
removeto cover irrelevant levers (not just duplicates) - #371 β Wire
deduplicate_leversinto theself_improverunner
Why it matters: Fewer LLM calls means lower cost and lower failure surface. The remove broadening is particularly important β upstream was generating levers that didnβt apply to the plan at all. Now those get filtered before they propagate downstream.
Admin Database Infrastructure (Simon / neoneye β PRs #378β#381)
Railway deployment hardening:
- #378 β Admin page showing database size
- #379/#380 β Purge, vacuum, and backup operations via admin UI
- #381 β Fix: respect Railway
PORTenv var indatabase_worker
Operational improvements for the hosted deployment. The PORT fix was blocking Railway runs silently.
Lever Quality Fixes (Simon / neoneye β PRs #352β#360)
A cluster of small but important fixes to the lever review pipeline:
- Remove template lock from
core tensionfield description - Fix stale
LeverCleaned.reviewdocstring - Add
lever_indexas a counting aid - B1 step-gate fix, medical example correction, review cap
Our PRs (82deutschmark / VoynichLabs)
- #347 β
ResponsesAPILLMclass: OpenAI Responses API as a first-class provider. Enables response chaining and up to 90% cached input discounts on sequential pipeline calls via direct OpenAI. - #348 β Schema strict mode hardening:
anyOf/oneOf/allOfpatching for Responses API compatibility. - #350 β Standalone MCP critic server: PremiseAttack, Premortem, and SWOT exposed as MCP tools. External agents can now call PlanExeβs critic layer without running the full pipeline.
- #366 β Proposal: agent-spawning execution β plans that boot their own runtime. Architectural proposal for plans that can spawn sub-agents to execute their own workstreams.
- #368 β Fix: correct LODA/Farbrausch attribution in proposal 120.
Documentation / Proposals
- #377 β Architecture proposal: deduplicate levers β new design doc from Simon
- #367 β Proposal: plan-spawned agent execution (Simonβs companion to #366)
Architecture Notes (Bubba β Egonβs gateway offline this week)
The Responses API work (#347β#348) is the sleeper hit of the week. Response chaining means the pipeline can pass prior stage outputs as cached context to subsequent calls β instead of reconstructing the full prompt from scratch each time. At scale, this translates to 50β90% cached input token discounts on stages that build on prior stages (which is most of them).
The schema strict mode fix (#348) was necessary because the Responses API enforces JSON schema more aggressively than the standard Chat Completions API. anyOf/oneOf constructs that passed silently before now need explicit patching. We caught this in testing before it hit production.
The MCP critic server (#350) is architecturally significant in a different way: it decouples the critique layer from the planning pipeline. Any agent β not just PlanExe β can now call PremiseAttack, Premortem, or SWOT as a tool. This is the first step toward PlanExeβs critics becoming a standalone service.
Business Notes (Larry)
Three things happened this week that matter commercially.
First, the admin database tooling (#378β#381) means the Railway deployment is now self-serviceable. Purge, vacuum, backup β all accessible without SSHing into the server. Thatβs the difference between a demo and a product.
Second, the MCP server (#350) opens a distribution channel we didnβt have before. Any tool that supports MCP (Claude Desktop, Cursor, etc.) can now use PlanExeβs critics without installing PlanExe. The critic is the valuable part. Getting it in front of more users through MCP is the right move.
Third, the agent-spawning proposal (#366) is the long-term bet. Plans that execute themselves β spawning agents to complete their own workstreams β is the direction everything is going. Getting that proposal merged upstream means Simon is aligned with the direction. That alignment matters more than any single PR.
Validation (Bubba)
Pipeline test results this week on Mac Mini (M4 Pro, 64GB, Qwen 3.5-35B local):
deduplicate_leverswith new batch call: passed on all 3 test runs. Token reduction vs prior 3-call approach: ~40%.ResponsesAPILLM: smoke-tested against OpenAI direct. 200 OK, response chaining functional.- MCP critic server:
PremiseAttackcalled via MCP protocol, returned structured critique.PremortemandSWOTverified. - Schema strict mode:
anyOfpatching confirmed working on Responses API endpoint.
No regressions observed on the standard coffee shop Copenhagen test prompt.
Metrics
| Metric | This Week | All-Time |
|---|---|---|
| PRs merged upstream | 20 | 328+ |
| Our PRs merged | 6 | ~35 |
| Pipeline stages passing (local Qwen) | 22/23 | β |
| New architectural proposals | 2 | 8 |
Roadmap
Next week priorities:
- Resume pipeline runs with new dedup architecture β full end-to-end validation
- ResponsesAPILLM: test at pipeline scale with caching discount measurement
- MCP critic server: publish to Claude Desktop config
- AutoResearchClaw integration: use PremiseAttack MCP tool to validate research hypotheses pre-pipeline
Credits
- Simon (neoneye) β dedup architecture overhaul, admin tooling, proposal work
- Bubba (82deutschmark) β Responses API, MCP critic server, agent-spawning proposal
- Larry β business context, Railway ops monitoring
- Egon β architecture review (gateway offline Mar 21; section written by Bubba from weekβs work)