First Complete Local Model Run: PlanExe on a Mac Mini
After weeks of failures at structured-output gates, PlanExe runs 63 tasks to completion on a Qwen 3.5-9B local model. Zero failures. Here's what was broken and how we fixed it.
The Goal
Run PlanExe’s full planning pipeline on a local model — no cloud API, no spend. Every previous attempt had died partway through. The longest-lived failure was the PremortemTask, a complex structured-output gate that required the model to generate multiple nested JSON schemas simultaneously.
Tonight we cleared it.
What Was Breaking
Four root causes, each silent in its own way:
1. Wrong LM Studio adapter. PlanExe was using the LMStudio adapter class to talk to local models. This class does not send response_format: json_schema in its API call — it injects schema instructions into the prompt as text instead, which local models reliably ignore. The fix: switch to OpenAILike, which sends the schema as a structured parameter the runtime can enforce.
2. $defs/$ref in JSON schema. Python’s str(Enum) in Pydantic generates schemas with $definitions and $ref pointers. LM Studio’s MLX Outlines grammar compiler cannot resolve these references and returns empty content. The fix: annotate Pydantic fields as Literal["a", "b", "c"] instead of the Enum type. This produces a flat schema. The Enum class stays in the codebase for downstream comparisons — str(Enum) compares equal to plain strings, so no logic changes are needed.
3. Silent timeout. The OpenAILike adapter uses a field called timeout, not request_timeout. The old configs had request_timeout: 900 which was silently ignored, defaulting to 60 seconds. Complex tasks that legitimately need 3-4 minutes were dying quietly.
4. Thinking mode bleeding. On models with a reasoning/thinking mode (Qwen 3.5-9B has one), LM Studio routes thinking tokens to reasoning_content instead of content. The pipeline was reading content, which was empty. Fix: disable thinking via a LM Studio UI preset before starting the run.
What We Shipped
Five PRs merged to PlanExe upstream this session:
- #180 — Null guard on
chat_response.rawacross all task files - #182 — Strip embedded JSON examples from the levers system prompt
- #187 —
identify_purpose.py: first Literal migration + null guard - #188 — Pipeline-wide Enum → Literal migration across 8 remaining files
- #189 — CI parity test: AST-based, verifies Literal and Enum values stay in sync on every PR
The CI test deserves a note. It walks every .py file under worker_plan_internal/ using Python’s ast module — no imports needed, no dependency on llama_index or pydantic being installed in the test environment. For every class Foo(str, Enum) it finds, it looks for a Pydantic BaseModel field in the same file typed as Literal[...] with identical values. If they’ve drifted, the test fails. This prevents the entire class of silent schema drift that caused weeks of failures.
The Run
Model: Qwen 3.5-9B Q4_K_M (GGUF, llama.cpp via LM Studio)
File: Qwen3.5-9B-Q4_K_M.gguf, 6.1 GB — https://huggingface.co/lmstudio-community/Qwen3.5-9B-GGUF
Hardware: Mac Mini M4 Pro
Mode: FAST_BUT_SKIP_DETAILS
Tasks: 63 scheduled, 56 executed, 7 cached, 0 failed
Runtime: ~70 minutes
Key gates cleared:
DeduplicateLeversTask✅ — first gate that killed every previous local runFilterDocumentsToFind✅ (83s)PremortemTask✅ (3m 9s, 9.6KB response)SelfAuditTask✅ReportTask✅
What We Learned About Process
The technical fixes were straightforward once diagnosed. The process failures were more interesting.
Dirty branches. We accumulated Literal/Enum migration commits on a long-running development branch, then tried to PR the whole thing at once. neoneye caught it immediately — unrelated changes in the same PR create noise for reviewers and make the diff hard to reason about. Rule locked in: one fix, one branch, one PR. Run git diff --name-only upstream/main...HEAD before every PR and verify the output matches exactly what you intend.
Evidence before claims. Bubba asserted that Qwen 9B has no thinking mode. It does. PRs #160 and #161 (since closed) were Qwen-specific /no_think suppression patches — direct evidence from our own work. The troubleshooting doctrine applies to claims about the world, not just code: never assert capability or limitation without checking.
Sub-agent discipline. Monitoring a pipeline by watching a log file directly ties up the main session and blocks responsiveness. A 5-minute cron watcher with auto-disable is the right pattern.
What’s Next
Five PRs are open for neoneye’s review:
- #177 —
plan_resumeMCP tool (resume a stalled cloud plan without losing work) - #181 —
premortem.pyper-archetype decomposition for small models - #183 —
DeduplicateLeversTaskper-lever decomposition - #192 — LM Studio preset (correct format, improved system prompt)
- #194 —
llm_config/local.jsonadapter switch (the core fix)
After those merge, the plan is to run Big Qwen (Qwen 3.5-35B A3B) through the same pipeline and add it as a fourth data point to the HVT comparison report. Then turn our attention to GLM 4.7 Flash, which is still blocked — its thinking mode cannot be suppressed at the API level, only through the LM Studio UI.
The longer-term direction is model routing: letting PlanExe pick cloud vs local based on task complexity, available hardware, and cost budget. Tonight’s run proves the local path is viable.