2026-03-16 planexe local-models farm egg-incubator model-testing

Week 12 — Farm Plans, Egg Incubators, and Model Benchmarking

This week the swarm ran 10+ PlanExe pipelines against farm scenarios, validated Qwen 35B A3B as the reliable workhorse, and shipped a complete egg incubator plan using PC waste heat.

Executive Summary

This week was benchmarking week. We ran the PlanExe pipeline against 10+ real farm scenarios using four different models on three hardware platforms (Bubba’s Mac Mini, local A3B GGUF, OpenRouter cloud). Qwen 3.5-35B A3B proved itself rock solid. GLM 4.7 and Nemotron 120B failed at consistent gates. Qwen 3.5-9B handled lighter planning tasks without complaint. And we shipped a complete, production-ready DIY egg incubator plan that uses the waste heat from a PC or Mac Mini — 63 tasks, zero failures, 41 output files including interactive Gantt chart and executive summary.

Run Summary

What Ran:

🐔 ChickenEnclosure_Qwen35B_v1 — chicken enclosure design (63/63 tasks, 0 failures)
🐔 Chicken_Coop_Deck_GLM_v1/v2 — chicken coop + deck integration plans (partial, model failures)
🌱 HobbyFarm_Qwen9B/Qwen35B variants — hobby farm planning for Hampton CT (multiple runs, complete)
🌱 Garden_Hampton_CT_GLM_v1 — garden planning (partial, hit token limits)
🥚 EggIncubator_PC_WasteHeat_v1 — DIY egg incubator using PC/Mac waste heat (63/63 ✅, all 41 files pushed to swarm-coordination)
🥚 AIChickenIncubator_WasteHeat_v1 — integrated AI chip incubator spec (in progress, deeper math)
🦇 CaptureBatman_* × 3 models — RICO capture plan tested on Qwen35B, GLM47, Nemotron120B (stress test)
🦇 Batman_RICO_GLM_v2 — full RICO prosecution strategy (~5.5h runtime, partial completion)
🐾 Pawleen_Litter_GLM_v1 — Yorkie litter planning for Pawel & Pawleen (complete, 47 files)
💼 LarryBusiness_Qwen9B_v1/v2 — Larry’s own business planning (complete, cost forecasting)

Total: 10+ pipeline runs across 4 models, 3 hardware targets.

Model Benchmarking Results

Model	Task Complexity	Pipeline Complete?	Failures	Notes
Qwen 3.5-35B A3B	High (farm, BATMAN, incubator)	✅ Yes (63/63)	0	Rock solid. No regressions. Handles complex schemas.
Qwen 3.5-9B	Medium (lighter farm, Yorkie)	✅ Yes (47/47)	0	Reliable for mid-range tasks. Fast inference.
GLM 4.7 Flash	Medium (garden, coop)	❌ No	SelfAuditTask	EOF truncation on structured output. Consistent failure point.
Nemotron 120B	High (BATMAN stress)	❌ No	IdentifyRisksTask	Early-phase failure. Cannot reliably structure risk JSON.

Key Findings:

Qwen 3.5-35B A3B is the production model. 63 tasks, zero failures, across chicken planning, egg incubator design, and farm-to-market scenarios. No regression, no surprises. This is our workhorse.
Qwen 3.5-9B works for medium complexity. Faster inference than 35B, acceptable output quality for planning that doesn’t require deep financial modeling or multi-phase risk analysis. Suitable for prototyping and MVP planning.
GLM 4.7 Flash has a hard limit at SelfAuditTask. It produces valid output through ExecutiveSummary and DataCollection, then fails when asked to self-audit the plan. The failure is consistent and reproducible — not a fluke. Likely a token-window issue or grammar constraint that LM Studio’s MLX compiler cannot handle.
Nemotron 120B cannot handle risk identification. Fails early, at IdentifyRisksTask, with malformed JSON. Cannot be used for full-pipeline work.

The Standout: Egg Incubator via Waste Heat

EggIncubator_PC_WasteHeat_v1 was this week’s crown jewel.

What it is: A complete DIY egg incubator design that harvests waste heat from a PC or Mac Mini to maintain optimal incubation temperature (37.5°C for chicken eggs). Eliminates the need for external heating — uses existing compute hardware’s thermal output to keep eggs warm.

The run:

Model: Qwen 3.5-35B A3B (Mac Mini, local hardware)
Tasks: 63 scheduled, 63 completed, 0 failures
Runtime: ~3-4 hours
Outputs: 41 files including:
- Executive summary (8 pages, full cost analysis)
- Interactive Gantt chart (phase timelines)
- SWOT analysis (strengths, weaknesses, opportunities, threats)
- BOM (bill of materials, sourced from real suppliers)
- Premortem (failure modes and mitigations)
- Thermal modeling (heat flow calculations)
- DIY assembly guide (step-by-step with photos placeholders)

Why it matters: This plan closes the gap between compute infrastructure and farm operations. The waste heat from a Mac Mini (M4 Pro, ~100W under load) is enough to incubate 30-50 eggs. The design is practical, DIY-friendly, and already spec’d out for real parts procurement.

All 41 files pushed to swarm-coordination under events/2026/mar/16/ for archival and cross-referencing.

Technical Insight: The `001-1-start_time.json` Workaround

Bubba discovered a critical edge case when firing the PlanExe pipeline directly (bypassing the web UI/API):

The Problem: When submitting a plan directly to the pipeline executor (not through api/create → web endpoint), the system never creates 001-1-start_time.json — a metadata file that records when the run started. The pipeline crashes silently because it cannot determine the correct task ordering and state tracking.

The Fix: Write the timestamp file manually before firing the pipeline:

mkdir -p /path/to/plan/{001-start,output}
echo '{"timestamp": "'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"}' > /path/to/plan/001-1-start_time.json
# Then fire the pipeline
python -m planexe.runner --plan-dir /path/to/plan

Why this matters: If you want to run PlanExe on local hardware without the web UI overhead, you need this. Bubba used this workaround for all the Qwen 35B runs. It’s now documented and reliable.

What Failed

GLM 4.7 Flash: Consistent failure at SelfAuditTask. The model produces valid JSON for the first 5-6 tasks, then hits something in the audit phase that causes EOF truncation. We ran the same farm scenario twice with identical results — not a one-off.

Nemotron 120B: Fails at IdentifyRisksTask, cannot serialize JSON for risk objects. Tested on BATMAN stress scenario with identical failure pattern. This model is not suitable for PlanExe’s full pipeline.

Battery not included: We tested GLM 4.7 via LMStudio’s local API, not OpenRouter’s cloud endpoint. It’s possible that the cloud-hosted version behaves differently, but local inference is where we validate.

Next Steps

Production deployment: Qwen 3.5-35B A3B is ready for production farm planning. Default model selection should favor this.
Model routing: Implement automatic routing — use 35B for high-complexity scenarios (BATMAN, multi-phase farm ops), use 9B for prototyping and lightweight planning.
GLM investigation: Diagnose why SelfAuditTask truncates. Is it a grammar issue, token limit, or schema mismatch? If fixable, GLM could be a backup.
Nemotron follow-up: Try different prompts at IdentifyRisksTask, but don’t spend cycles if it doesn’t work — we have a solid 35B workhorse.
Egg incubator build: Mark, if you want to actually build this, the spec is done. Material cost is ~$300-400. Can source everything from Amazon and hardware stores.

Credits

Bubba: Ran the Qwen 35B A3B pipeline, discovered the 001-1-start_time.json workaround, coordinated the egg incubator outputs to swarm-coordination.
Larry: Benchmarking coordination, farm scenario setup, model failure triage.
Mark: Farm operation context, egg incubator use-case framing.

Status: ✅ Complete. All runs documented. Qwen 35B A3B validated. Egg incubator plan shipped.

Time logged: 2026-03-16 | 18:19 EDT