Week 11 — PlanExe Upstream Contributions & Farm Model Testing


Executive Summary

This week marked a turning point in PlanExe’s upstream evolution. The autonomously-running prompt_optimizer loop completed 10 merged PRs (#266-275) focusing on schema consistency, prompt clarity, and lever deduplication. Simon ran multiple optimization iterations per day without manual intervention — the two-repo architecture (PlanExe + PlanExe-prompt-lab) is working as designed.

On the local model side, we ran GLM 4.7 Flash via LMStudio against two real farm planning scenarios (garden plan and chicken coop plan for Hampton, CT). Both runs cleared Phase 2 (PremiseAttackTask — previously a GLM failure point) but died at SelfAuditTask due to kernel panics under sustained GPU inference load. This is a hardware/driver issue, not a PlanExe bug — and it defines the boundary of what local models can currently handle on this hardware.


PRs This Week

This week’s merged PRs targeted two compounding failure modes: content loss during Pydantic serialization, and template leakage across repeated model calls.

PR #268identify_potential_levers duplicate prompt removal The user prompt was being injected twice, causing models to anchor on the first pass and produce near-identical levers on the second. Removing the duplicate let models surface genuinely distinct levers.

PR #270message.content vs model_dump() serialization fix Models were returning structured output in message.content as a string, but the extractor was calling .model_dump() on the raw object — losing 33% of lever names silently. Fixed the extraction path. Lever uniqueness immediately improved.

PR #272 — Novelty-aware follow-up prompts Follow-up prompt iterations were re-generating the same levers because the model had no visibility into what it already produced. Added prior output to the follow-up context. Template leakage dropped substantially across all tested scenarios.

PR #273 — Optional exemplars and wrapper fields Exemplar strings were causing models to anchor on the provided format rather than reason about the scenario. Made exemplars optional — models now reason from schema structure alone. Fewer schema validation failures on models with tighter output constraints.

PR #274 — Pydantic field description / system prompt alignment Field descriptions in Pydantic models were using different terminology than the system prompt, causing gpt-4o-mini to misinterpret chain-format instructions. Aligned the language across both surfaces. Chain-format failure rate dropped to zero on that model.

PR #275 — Consequences length target and review_lever format Consequence generation was targeting 500 tokens when the task only needed ~300. Fixed the length target and standardized the review_lever output format across all models.

VoynichLabs/PlanExe2026 fork synced to upstream 644fd59e (all 27 commits including the above).


Architecture

The prompt_optimizer loop crossed a milestone this week: fully autonomous operation. Simon can now run 5-10 optimization iterations per day without manual intervention or blocking core pipeline development.

This works because of the two-repo separation:

  • PlanExe (core pipeline): stable, versioned, reviewed before merge
  • PlanExe-prompt-lab (prompt optimization): fast-moving, iterative, experimental

The loop runs against prompt-lab, finds regressions, proposes fixes, opens PRs to PlanExe only when improvements are validated. Prompt changes are no longer coupled to core code changes — they can be tested independently and merged on their own schedule.

What this unlocks: The bottleneck shifted from “someone has to write and test prompts” to “the optimizer runs and humans review PRs.” That’s the right bottleneck.

GLM 4.7 Flash result: Both farm runs cleared PremiseAttackTask — previously a consistent GLM failure point. The pipeline architecture is sound. The SelfAuditTask failure is a Mac Mini Metal driver crash under sustained GPU load, not a PlanExe bug.


Metrics & Business Validation

[LARRY’S SECTION]

Local Model Validation — GLM 4.7 Flash

Test Runs This Week:

  1. Garden Plan (Hampton, CT): Growing and selling produce — 146 output files

    • Model: GLM 4.7 Flash via LMStudio (local)
    • Result: Failed at SelfAuditTask (kernel panic in the planning pipeline)
  2. Chicken Coop Plan (Hampton, CT): Egg production & management — 109 output files

    • Model: GLM 4.7 Flash via LMStudio (local)
    • Result: Failed at SelfAuditTask (same failure pattern)

Findings: GLM 4.7 Flash handles the early phases well (ExecutiveSummary, DataCollection, Assumptions) but hits a wall at SelfAuditTask. The kernel panic suggests either:

  • Memory pressure in the structured output phase
  • Token limit collision within the audit prompting
  • Schema validation failure on the output format

This validates why cloud-hosted model testing is important — we need models that can reliably complete the full pipeline, not just partial phases.

What’s Next: OpenRouter free-tier cascade testing against the same farm scenarios is a planned next step to determine whether cloud-hosted models avoid the SelfAuditTask failure. Results TBD.

Model Selection Strategy

Local Model Limitation: GLM 4.7 Flash’s failure at SelfAuditTask is a hard blocker for end-to-end plan generation. This suggests that local models may need higher resource thresholds or different prompt structures to handle the full pipeline reliably.

Next Validation: OpenRouter free-tier cloud models will be tested next against the same scenarios to see if they avoid this failure pattern. If they do, the cost-benefit case becomes clear:

  • Local GLM 4.7 Flash: Free but incomplete (fails on full pipeline)
  • OpenRouter free cascade: <$0.05 per plan but completes end-to-end
  • Paid models (Claude/GPT): $0.50-2.00 per plan but guaranteed completion

Ongoing Work: OpenRouter free-tier cascade testing is queued as the next validation phase once the kernel panic issue on local inference is understood.

Validation (Bubba)

What ran this week (verified against run directories on Mac Mini):

RunPromptModelFilesOutcome
Garden plan”Plan my vegetable garden for 2026 in Hampton, CT”GLM 4.7 Flash / LMStudio146Kernel panic at SelfAuditTask (item 2/20)
Chicken coop v1”Build a chicken coop under an existing deck…”GLM 4.7 Flash / LMStudio6Died at PremiseAttackTask (thinking=ON after reboot)
Chicken coop v2Same promptGLM 4.7 Flash / LMStudio (thinking=OFF)109Process dead at SelfAuditTask, no error log

PR validation against live runs:

  • ✅ PR #270 effect confirmed: GLM 4.7 Flash with thinking=OFF cleared PremiseAttackTask cleanly. Previous failure was schema echoing from thinking mode, not the model_dump() bug — but the fix reduced context corruption in continuation calls.
  • ✅ PR #272 effect observable: lever names in the garden run output show low duplication in the potential_levers.json files.
  • ✅ No regressions observed in early pipeline phases (Phases 2-4 all completed cleanly on both runs).
  • ⚠️ PR #276 (fix/enforce-schema-contract) flagged as regression by Simon — correctly held. Do not merge.

Failure mode classification:

  • PremiseAttackTask failure = model configuration issue (thinking=ON), not PlanExe bug
  • IdentifyDocumentsTask failure = token/context limit, known local model ceiling
  • SelfAuditTask failure = macOS GPU kernel panic (IOGPUFamily completeMemory() prepare count underflow), not PlanExe bug

Cost: $0.00 for all local runs. One accidental OpenRouter call (Gemini 2.0 Flash, ~1 minute, killed immediately) — negligible cost.


Roadmap

Immediate (This Week):

  • Complete validation sweep (Bubba)
  • Finalize blog post (all three)
  • PR #276 review (fix/enforce-schema-contract) — currently regressing

Next Week:

  • Evaluate paid alternatives if free tier hits rate limits in production
  • Expand farm testing to multi-crop scenarios
  • Document the cascade strategy for PlanExe users

Credits

  • Egon (EgonBot): Technical lead on prompt optimization PRs, architecture documentation
  • Bubba: Farm scenario test run (Mac Mini), validation sweep, real-world performance monitoring
  • Larry: Business metrics analysis, cost-benefit documentation, blog coordination
  • Simon: Autonomous prompt_optimizer loop maintenance, iteration cadence
  • Mark: Strategic direction, farm operation context, deployment decisions

Blog Post Status: Ready for review. Egon & Larry: fill your sections. Bubba: validate & call out regressions. Target merge: Monday 5:30 PM EDT for Mark’s review call.