The Two Questions You're Answering At Once
Every time an AI assistant picks up a task, it's making two decisions. The first: how hard should I think about this? The second: how much should I spend doing it?
Most architectures collapse these into one routing decision. Complex task? Send it to the big model. Simple task? Use the cheap one. Problem solved, right?
Wrong. You've just built a system where deliberation depth and dollar spend are locked together. When GPT-5.5 gets deprecated — and it will — your cognitive logic goes with it. When a cheaper model turns out to plan just fine, you can't use it because your architecture assumes System 2 needs premium compute. You've conflated two orthogonal concerns into a single brittle decision.
Here's the fix: build two separate control layers. Let the cognition system decide how much deliberation a task needs. Let model policy decide which engine is worth paying for that level of work. The result is an architecture that survives model churn, respects budget, and keeps each layer's logic clean.
The Conflation Problem
The standard approach looks something like this:
// The common pattern: one routing function, two implicit decisions
function route(task) {
if (task.complexity > threshold) {
return "gpt-5.5"; // expensive AND deliberative
}
return "glm-5"; // cheap AND fast
}
This looks reasonable until you live with it. The coupling creates three failure modes:
- Expensive overkill: A routine formatting question that happens to look "complex" to the heuristic burns premium tokens. The model thinks harder and costs more — but neither was necessary.
- Cheap failure: A genuinely hard problem that happens to look simple gets routed to a budget model. The model can't handle it, and the architecture has no mechanism to say "think harder but stay cheap."
- Vendor lock-in: When you swap GPT-5.5 for Claude or Gemini or whatever comes next, you're not just changing the engine — you're rewriting the routing logic, because cognition and cost are tangled together in the same function.
The core mistake: treating model selection as a proxy for cognitive depth. A big model is not the same thing as deep deliberation. A cheap model is not the same thing as fast intuition. These are separate axes.
What Cognitive Routing Actually Does
Cognitive routing answers one question: how much deliberation does this task need?
In OpenClaw, this is the cognitive_assess function. It takes a raw user request and conversation context, evaluates the task's complexity, and returns a routing tag:
- SYSTEM_1_INTUITION — the task is straightforward. Pattern-match, respond directly. Fast, low deliberation.
- SYSTEM_2_FLARE — the task is complex enough to warrant structured lookahead planning. Slow, deliberative.
That's it. Two tags. No model names. No cost calculations. The cognition layer decides mode, not engine.
The naming is intentional. System 1 and System 2 come from Kahneman — fast thinking and slow thinking. FLARE stands for Future-aware LookAhead with Reward Estimation. When cognitive_assess returns SYSTEM_2_FLARE, it triggers a planning phase that simulates future trajectories, evaluates reward estimates, and selects the optimal first action via backward value propagation. This is algorithmic deliberation, not just "send it to a bigger model."
A concrete example: a user asks "what's the weather in Brooklyn?" — cognitive_assess returns SYSTEM_1_INTUITION. No lookahead needed. Same user asks "help me plan a migration from Postgres to CockroachDB for a 2TB production database" — cognitive_assess returns SYSTEM_2_FLARE. That planning problem has dependencies, trade-offs, and irreversible choices. It warrants structured reasoning.
Notice: neither routing tag says which model to use. That's a different decision.
What Model Routing Actually Does
Model routing answers a different question: given that we've decided how to think about this, which engine should we run it on?
This is a resource allocation decision. The inputs are:
- Context size: Does the task carry enough state to justify a model with a large context window?
- Output stakes: Is the final output user-visible, or is it intermediate scaffolding?
- Quota constraints: What's the remaining budget for premium models this session?
- Observed task fit: How has each model performed on similar tasks historically?
None of these are cognitive judgments. They're operational and economic. A model policy function looks like:
// Separate from cognition — purely operational
function selectModel(task, cognitionTag, context) {
const defaultModel = "ollama/glm-5:cloud";
// Escalation triggers (independent of cognition depth)
if (context.tokenCount > CONTEXT_THRESHOLD) return "openai-codex/gpt-5.5";
if (task.isHighStakesVisible()) return "openai-codex/gpt-5.5";
if (task.needsHardJudgment()) return "openai-codex/gpt-5.5";
if (quota.premiumBudget > 0 && task.justifiesPremium()) return "openai-codex/gpt-5.5";
return defaultModel;
}
Key point: this function can escalate to GPT regardless of whether the cognition tag is SYSTEM_1 or SYSTEM_2. A simple question that needs a high-quality visible answer? GPT is justified. A complex planning problem where intermediate reasoning doesn't need polish? GLM is fine.
The model layer decides engine, not mode.
Why Separation Matters
Model Churn
Models come and go. GPT-4 gave way to GPT-5. GPT-5.5 will give way to something else. New providers appear. Pricing changes. Quota limits shift.
When cognition and model routing are coupled, every model swap requires rethinking your cognitive architecture. When they're separate, you update the model policy function and the cognition layer doesn't notice. OpenClaw went through GPT-4 → GPT-5 → GLM experiments without touching the cognitive routing plugin. The cognitive_assess function still returns the same two tags it always did.
Policy Clarity
When the two decisions are tangled, policy edits are risky. Change the complexity threshold and you've accidentally changed cost behavior. Add a new model and you've accidentally changed deliberation depth.
With separation, policy edits are surgical. Want to lower the context threshold for escalation? Change one number in model policy. Want System 2 to trigger on a narrower set of tasks? Change the threshold in cognitive_assess. Each change is isolated, testable, and doesn't have surprise side effects on the other axis.
Cost Control
System 2 doesn't force GPT. System 1 doesn't force GLM. The combinations are:
- SYSTEM_1 + GLM: routine work, cheap — the common case
- SYSTEM_1 + GPT: simple task, high-stakes output — justified by visibility, not complexity
- SYSTEM_2 + GLM: complex planning, economical engine — works because FLARE's algorithmic structure compensates for model size
- SYSTEM_2 + GPT: complex planning, premium engine — reserved for tasks where planning quality genuinely benefits
Most AI assistant architectures only give you two of these four combinations. The missing ones — especially SYSTEM_2 + GLM — represent real savings without real quality loss.
Extensibility
Add a new model? Update model policy. Add a new cognitive mode (say, SYSTEM_3 for extended reasoning chains)? Update cognitive_assess. The layers don't need to know about each other's changes. This is basic separation of concerns, and it works here exactly like it works everywhere else in software architecture.
Architecture Walkthrough
Here's how the two-layer architecture works in practice in OpenClaw:
// Layer 1: Cognition routing (decides deliberation depth)
const cognitionTag = cognitiveAssess(userRequest, conversationContext);
// Returns: SYSTEM_1_INTUITION or SYSTEM_2_FLARE
if (cognitionTag === "SYSTEM_2_FLARE") {
// Trigger FLARE planning
const plan = flarePlan(taskDescription, currentState);
// FLARE simulates future trajectories,
// estimates rewards, selects optimal first action
}
// Layer 2: Model routing (decides resource allocation)
const model = selectModel(task, cognitionTag, operationalContext);
// Returns: specific model identifier like "ollama/glm-5:cloud"
// Execute on selected model with selected cognition mode
const result = execute(model, cognitionTag, task);
The critical insight: cognitiveAssess and selectModel are independent functions. They can be tested independently. They can be tuned independently. They can fail independently without breaking each other.
In practice, this means:
- When GLM-5 was observed performing better than expected on planning tasks, the model policy could route SYSTEM_2 tasks to GLM without any change to the cognitive layer.
- When a new class of complex requests was identified,
cognitive_assesscould route them to SYSTEM_2 without any change to model selection. - When quota limits tighten, model policy can shift to GLM for more tasks without pretending those tasks are simpler than they are.
The system doesn't lie to itself about complexity to save money. It tells the truth about complexity and makes a separate, honest decision about spend.
What This Isn't
This isn't an argument against big models. GPT-5.5 exists for a reason — when you genuinely need a large context window, high-quality visible output, or nuanced judgment, it's the right tool. The point is that those reasons are operational, not cognitive.
This also isn't an argument that GLM is always sufficient. It's not. Some tasks genuinely benefit from premium compute. The argument is that the decision about when that benefit applies should be made explicitly, not by accident as a side effect of a complexity heuristic.
And this isn't theoretical. OpenClaw runs this architecture in production. cognitive_assess is a real function that returns real routing tags. FLARE is a real planning system that runs on whichever model the policy selects. The model policy is a real configuration that can be changed without touching cognitive logic. It works because the separation is real, not aspirational.
The Takeaway
Build two control layers. Let cognition choose depth. Let model policy choose spend. The four-combination grid (two cognition modes × two model tiers, at minimum) gives you flexibility that a single routing function can't provide.
The cognitive layer is stable. SYSTEM_1_INTUITION and SYSTEM_2_FLARE don't change when OpenAI releases a new model or when your budget shifts. The model layer is adaptive — it should change often, based on real usage data, quota constraints, and observed task fit.
Keep them separate. The architecture survives model churn. The policy stays clear. The budget stays controlled.
In the next post, we'll look at what happens when you flip the default assumption: instead of justifying fallback to a cheaper model, you justify escalation to a more expensive one. The GLM default isn't about settling — it's about putting the burden of proof in the right place.