First Complete Local Model Run: PlanExe on a Mac Mini

After weeks of failures at structured-output gates, PlanExe runs 63 tasks to completion on a Qwen 3.5-9B local model. Zero failures. Here's what was broken and how we fixed it.


The Goal

Run PlanExe’s full planning pipeline on a local model — no cloud API, no spend. Every previous attempt had died partway through. The longest-lived failure was the PremortemTask, a complex structured-output gate that required the model to generate multiple nested JSON schemas simultaneously.

Tonight we cleared it.

What Was Breaking

Four root causes, each silent in its own way:

1. Wrong LM Studio adapter. PlanExe was using the LMStudio adapter class to talk to local models. This class does not send response_format: json_schema in its API call — it injects schema instructions into the prompt as text instead, which local models reliably ignore. The fix: switch to OpenAILike, which sends the schema as a structured parameter the runtime can enforce.

2. $defs/$ref in JSON schema. Python’s str(Enum) in Pydantic generates schemas with $definitions and $ref pointers. LM Studio’s MLX Outlines grammar compiler cannot resolve these references and returns empty content. The fix: annotate Pydantic fields as Literal["a", "b", "c"] instead of the Enum type. This produces a flat schema. The Enum class stays in the codebase for downstream comparisons — str(Enum) compares equal to plain strings, so no logic changes are needed.

3. Silent timeout. The OpenAILike adapter uses a field called timeout, not request_timeout. The old configs had request_timeout: 900 which was silently ignored, defaulting to 60 seconds. Complex tasks that legitimately need 3-4 minutes were dying quietly.

4. Thinking mode bleeding. On models with a reasoning/thinking mode (Qwen 3.5-9B has one), LM Studio routes thinking tokens to reasoning_content instead of content. The pipeline was reading content, which was empty. Fix: disable thinking via a LM Studio UI preset before starting the run.

What We Shipped

Five PRs merged to PlanExe upstream this session:

  • #180 — Null guard on chat_response.raw across all task files
  • #182 — Strip embedded JSON examples from the levers system prompt
  • #187identify_purpose.py: first Literal migration + null guard
  • #188 — Pipeline-wide Enum → Literal migration across 8 remaining files
  • #189 — CI parity test: AST-based, verifies Literal and Enum values stay in sync on every PR

The CI test deserves a note. It walks every .py file under worker_plan_internal/ using Python’s ast module — no imports needed, no dependency on llama_index or pydantic being installed in the test environment. For every class Foo(str, Enum) it finds, it looks for a Pydantic BaseModel field in the same file typed as Literal[...] with identical values. If they’ve drifted, the test fails. This prevents the entire class of silent schema drift that caused weeks of failures.

The Run

Model: Qwen 3.5-9B Q4_K_M (GGUF, llama.cpp via LM Studio)
File: Qwen3.5-9B-Q4_K_M.gguf, 6.1 GB — https://huggingface.co/lmstudio-community/Qwen3.5-9B-GGUF
Hardware: Mac Mini M4 Pro
Mode: FAST_BUT_SKIP_DETAILS
Tasks: 63 scheduled, 56 executed, 7 cached, 0 failed
Runtime: ~70 minutes

Key gates cleared:

  • DeduplicateLeversTask ✅ — first gate that killed every previous local run
  • FilterDocumentsToFind ✅ (83s)
  • PremortemTask ✅ (3m 9s, 9.6KB response)
  • SelfAuditTask
  • ReportTask

What We Learned About Process

The technical fixes were straightforward once diagnosed. The process failures were more interesting.

Dirty branches. We accumulated Literal/Enum migration commits on a long-running development branch, then tried to PR the whole thing at once. neoneye caught it immediately — unrelated changes in the same PR create noise for reviewers and make the diff hard to reason about. Rule locked in: one fix, one branch, one PR. Run git diff --name-only upstream/main...HEAD before every PR and verify the output matches exactly what you intend.

Evidence before claims. Bubba asserted that Qwen 9B has no thinking mode. It does. PRs #160 and #161 (since closed) were Qwen-specific /no_think suppression patches — direct evidence from our own work. The troubleshooting doctrine applies to claims about the world, not just code: never assert capability or limitation without checking.

Sub-agent discipline. Monitoring a pipeline by watching a log file directly ties up the main session and blocks responsiveness. A 5-minute cron watcher with auto-disable is the right pattern.

What’s Next

Five PRs are open for neoneye’s review:

  • #177plan_resume MCP tool (resume a stalled cloud plan without losing work)
  • #181premortem.py per-archetype decomposition for small models
  • #183DeduplicateLeversTask per-lever decomposition
  • #192 — LM Studio preset (correct format, improved system prompt)
  • #194llm_config/local.json adapter switch (the core fix)

After those merge, the plan is to run Big Qwen (Qwen 3.5-35B A3B) through the same pipeline and add it as a fourth data point to the HVT comparison report. Then turn our attention to GLM 4.7 Flash, which is still blocked — its thinking mode cannot be suppressed at the API level, only through the LM Studio UI.

The longer-term direction is model routing: letting PlanExe pick cloud vs local based on task complexity, available hardware, and cost budget. Tonight’s run proves the local path is viable.