The Default Nobody Questions
Here's a pattern I see everywhere in AI engineering: systems default to the most expensive model and treat cheaper models as fallback. The premium tier is the baseline. The budget tier is the exception that needs justification.
That's backwards.
When you default to GPT and route to GLM only when you've decided a task doesn't "deserve" premium compute, you've baked a structural assumption into your architecture: expensive is normal, cheap needs permission. The result? You burn premium quota on routine work, hit rate limits faster than you should, and then face the awkward choice between degraded service or increased spend — all because your default was the expensive option.
I've been running OpenClaw with a different default for months. GLM first. GPT when you can justify it. The economics are dramatically better, the architecture is more honest, and the output quality is indistinguishable for the vast majority of tasks. This isn't about being cheap. It's about putting the burden of proof where it belongs.
In the first post, we separated cognition depth from model spend. Two independent layers: one decides how hard to think, the other decides how much to spend. Now I want to talk about what happens when you take that separation seriously and flip the default assumption entirely. Instead of justifying why you're using the budget model, you justify why you're spending on the premium one.
The Default Fallacy
Most model routing configurations look like this:
// The common pattern: premium is default, economy needs justification
function selectModel(task) {
if (task.isSimple()) {
return "ollama/glm-5:cloud"; // "downgrade" for easy stuff
}
return "openai-codex/gpt-5.5"; // default to premium
}
The logic feels reasonable: complex tasks need the best model, simple tasks can use the cheap one. But the default assumption is that premium compute is the starting point. You need a reason to use the budget option.
This creates three problems that compound over time:
Quota exhaustion. Premium models have limits. Ollama Pro gives you a session cap. Codex Plus has weekly ceilings. When premium is the default, every routine task — "what time is it in Tokyo," "format this as a table," "remind me to check email" — burns quota that should be reserved for tasks where premium compute actually matters. You'd be surprised how fast a session cap evaporates when you're routing every request through GPT by default. By mid-session, you're rationing premium compute for tasks that genuinely need it, except you've already spent the budget on things that didn't.
Honest complexity assessment. When expensive is the default, there's an incentive to underestimate task complexity. If calling a task "simple" saves money, you'll find yourself calling more things simple than they really are. The system learns to lie about its own cognitive needs to avoid spending. That's a terrible feedback loop. You end up with a cognitive routing layer that's been subtly corrupted by economic pressure. The cognitive_assess function that should return SYSTEM_2_FLARE for a genuinely complex task instead returns SYSTEM_1_INTUITION because the budget can't afford another GPT call. The architecture lies to itself.
Wasted capability gaps. Most tasks don't need the things that make GPT expensive: large context windows, high-quality synthesis, nuanced judgment. A routine formatting task or a background indexing operation runs just as well on GLM. But if your default is GPT, you never discover which tasks GLM handles fine — because you never give it the chance. You're operating with incomplete information about your own system's capabilities, and that ignorance costs you every single day.
The fix isn't complicated. Flip the default.
When GLM Wins
Here's the operational reality after running this system for months: GLM-5 handles the majority of tasks perfectly well. Not "acceptably." Not "good enough." Perfectly well. The output is indistinguishable from GPT for the task categories that make up most of an AI assistant's daily workload.
Routine chat. "What's the weather," "convert this CSV to JSON," "explain this error message" — GLM nails these. The delta between GLM and GPT on straightforward queries is negligible. You're paying 10-50x more for output that's indistinguishable in blind comparison. I've run side-by-sides on routine tasks and the results are statistically identical. The only difference is the line item on the invoice.
Throughput over polish. Background tasks, indexing operations, file management, status checks — these need reliability, not elegance. GLM's throughput on these is excellent, and the cost differential is dramatic. When you're running a system that processes dozens of background operations per session — file reads, status checks, cron job summaries, heartbeat evaluations — the cost of routing all of these through GPT is staggering. GLM handles them without breaking a sweat, and the savings compound over days and weeks.
Planning and exploration. This is the counterintuitive one, and it's worth spending time on because it's where the GLM default really shines. In the last post, I mentioned the SYSTEM_2 + GLM combination: complex planning on a budget engine. Here's why it works: FLARE's algorithmic structure compensates for model size.
When cognitive_assess returns SYSTEM_2_FLARE, the planning isn't just "send it to a big model and hope." It's a structured lookahead simulation with reward estimation. The algorithm does the heavy lifting. The model is executing a well-defined computational procedure, not improvising brilliance.
Let me walk through what actually happens when FLARE plans on GLM:
# What FLARE actually does — algorithmic structure, not model magic
def flarePlan(taskDescription, currentState):
# Step 1: Generate candidate actions from current state
candidates = generateCandidates(currentState) # GLM does this fine
# Step 2: Simulate future trajectories for each candidate
trajectories = {}
for action in candidates:
trajectories[action] = simulateForward(action, currentState)
# Each trajectory is a sequence of predicted states
# GLM can predict next-states reliably for structured domains
# Step 3: Estimate reward for each trajectory (backward value propagation)
rewards = {}
for action, trajectory in trajectories.items():
rewards[action] = estimateReward(trajectory)
# Reward estimation is a scoring function, not creative reasoning
# GLM evaluates these scores reliably
# Step 4: Select optimal first action
bestAction = max(rewards, key=rewards.get)
return bestAction
Each step in this algorithm is a well-defined computational task. Generate candidates? GLM can list possible actions from a state. Simulate forward? GLM can predict likely next states for structured problems. Estimate rewards? GLM can evaluate trade-offs along a trajectory. The algorithm provides the structure; the model provides the execution. You don't need GPT's creative reasoning for any of these steps — you need reliable execution of a defined procedure.
Concrete example: FLARE planning on GLM for a multi-step project migration task. The simulation builds trajectory trees, estimates rewards via backward propagation, and selects the optimal first action. GLM executes each step of this algorithm reliably. The quality comes from the algorithm's structure, not the model's size.
This isn't theoretical. I've been running FLARE on GLM for real planning tasks — task decomposition, sprint planning, multi-step debugging — and it works. Not as elegantly as GPT on a hard judgment call, but reliably enough that you only escalate to GPT when there's a concrete reason.
Session constraints. Both Ollama Pro and Codex Plus have usage limits. Ollama Pro caps per session. Codex Plus has weekly ceilings. These aren't abstract concerns — they're real constraints that shape how you allocate compute. When GLM is your default, those limits become genuinely manageable. A session cap that used to expire in two hours now lasts all day. A weekly ceiling that used to require careful rationing now has comfortable headroom. You're only spending premium budget on tasks that demonstrably need it, and the budget stretches accordingly.
The math is straightforward. If 70% of tasks run on GLM (which is conservative), and GPT costs 10-50x more per task, then switching from GPT-default to GLM-default reduces your total compute spend by roughly 7-35x. Even if you escalate 20% of tasks to GPT — which is generous — the savings are still 5-25x. This isn't marginal optimization. It's a structural change in how you allocate resources.
Escalation Triggers
If GLM is the default, when do you escalate to GPT? Not on a gut feeling. Not on a hunch that a task "seems hard." On specific, operational triggers:
# The GLM default: justify escalation, not fallback
DEFAULT_MODEL = "ollama/glm-5:cloud"
PREMIUM_MODEL = "openai-codex/gpt-5.5"
def selectModel(task, cognition_tag, context):
# Each escalation trigger is independent and operational
# Trigger 1: Large context window genuinely needed
if context.token_count > CONTEXT_THRESHOLD: # ~100k tokens
return PREMIUM_MODEL
# Trigger 2: User-visible, high-stakes output
if task.is_high_stakes_visible():
return PREMIUM_MODEL
# Trigger 3: Hard judgment calls that benefit from model quality
if task.needs_hard_judgment():
return PREMIUM_MODEL
# Trigger 4: Quota allows and task justifies spend
if quota.premium_budget > 0 and task.justifies_premium():
return PREMIUM_MODEL
# Default: GLM handles it
return DEFAULT_MODEL
Let me be specific about what each trigger means — and what it doesn't mean:
Large context. When a task carries enough state — conversation history, codebase context, document references — that the model needs to hold it all in working memory, GLM's smaller context window becomes a real limitation. This isn't about convenience. It's about capability. If the task genuinely requires 150k tokens of context, you need the model with the 200k window.
But "large context" doesn't mean "the task is complex." A 100k-token conversation history that's mostly boilerplate might not actually need GPT's larger window — you could compress the context and stay on GLM. The trigger is about genuine context size requirements, not about using the big window as a proxy for task importance.
User-visible synthesis. The final blog post. The email to a colleague. The commit message that goes into the permanent record. When output quality is directly visible to the user and the stakes are high, the polish premium models provide is worth paying for. Background scaffolding? GLM is fine. The thing the user reads? That's when GPT earns its cost.
But "user-visible" doesn't mean "the output exists." Intermediate reasoning, debug logs, internal state — these are visible in the sense that they're logged somewhere, but the user isn't reading them. The trigger is about direct, high-stakes visibility, not about the mere existence of output.
Hard judgment calls. Nuanced decisions where the difference between 95th percentile and 99th percentile reasoning matters. Architectural choices with long-term consequences. Debugging heisenbugs where pattern recognition in edge cases makes the difference. These are genuine strengths of premium models, and the escalation is honest: you're paying for better judgment, not just bigger parameters.
But "hard judgment" doesn't mean "the task is complex." A complex but well-structured planning problem — like FLARE handles — can run on GLM. The judgment trigger fires when the decision requires intuition, nuance, or pattern recognition that benefits from a more capable model, not when the task simply has many steps.
Quota availability. Even when the task might benefit from GPT, if the premium budget is exhausted, GLM it is. This isn't a hack — it's responsible resource management. You prioritize premium compute for the tasks that need it most, and GLM handles everything else without quality complaints.
Notice: none of these triggers say "if the task is complex, use GPT." Complexity is a cognition-layer decision. The model layer makes a separate, operational choice. A complex task can run on GLM (SYSTEM_2 + GLM). A simple task with high-stakes output can justify GPT (SYSTEM_1 + GPT). The triggers are orthogonal to cognitive depth.
This orthogonality is the key insight. When escalation triggers are about operational conditions (context size, output visibility, judgment quality, quota) rather than cognitive complexity, you get a model policy that's both more honest and more effective. You're not routing to GPT because a task is "hard" — you're routing because a specific operational condition is met. The distinction matters because it means you can measure, test, and tune each trigger independently.
The Four Combinations, Revisited
In the first post, I laid out the four combinations of cognition mode and model tier. Now let's talk about the economics — what each combination costs, how often it fires, and why the distribution matters:
SYSTEM_1 + GLM (Default, ~70% of tasks): Routine tasks. Minimal cost. This is the workhorse combination — the vast majority of what an AI assistant does day to day. Quick questions, formatting, lookups, status checks, file operations. GLM handles these without any perceptible quality difference from GPT. The cost per task is a fraction of a cent on free-tier allocation. This is where the GLM default pays for itself.
SYSTEM_1 + GPT (Escalated, ~10%): High-stakes visible output. Premium cost. The task is straightforward — it doesn't need deep planning — but the output matters enough to warrant polish. A commit message for a major release. An email to a stakeholder. A blog post paragraph that needs to be exactly right. The cognition layer says "respond directly," but the model layer says "this output needs to be polished."
SYSTEM_2 + GLM (Default, ~15%): Structured planning. Low cost. FLARE's algorithmic structure compensates for model size, making this combination viable for real planning work. Task decomposition, multi-step debugging, sprint planning, migration strategy. The cognition layer says "plan this carefully," and the model layer says "GLM can execute this plan structure reliably."
SYSTEM_2 + GPT (Escalated, ~5%): Hard judgment plus planning. Premium cost. Reserved for tasks where both deliberation depth and model quality are genuinely needed. Architectural decisions with long-term consequences. Debugging sessions where pattern recognition in edge cases matters. These are the tasks where GPT's premium capabilities actually make a measurable difference.
The numbers are illustrative, not measured. But the distribution pattern is real: most tasks fall into SYSTEM_1 + GLM. A good chunk of planning runs on SYSTEM_2 + GLM because FLARE's structure compensates for model size. The premium model gets used sparingly, for tasks where its specific capabilities — context window, synthesis quality, judgment — are genuinely needed.
This is the economics of the GLM default. You're not depriving yourself of GPT. You're using it when it matters. The result: premium quota lasts longer, costs stay controlled, and the system doesn't lie to itself about complexity to save money.
The financial math is worth making explicit. If you're running a system that processes 100 tasks per day:
- GPT-default with 100% premium routing: 100 premium calls per day
- GLM-default with 20% escalation: 80 free calls + 20 premium calls per day
- At 10x cost differential: GLM-default costs 30% of GPT-default
- At 50x cost differential: GLM-default costs 7% of GPT-default
Even with conservative estimates, the GLM default cuts costs by 3-10x. And that's before accounting for the fact that premium quota exhaustion leads to degraded service during peak usage, which has its own costs in user experience and system reliability.
Operational Policy: What It Looks Like in Practice
Here's the actual model configuration from my system, exactly as it appears in TOOLS.md:
| Role | Primary | Fallback 1 | Fallback 2 |
|-------------------|------------------------|-----------------------|-------------------------|
| Main session | glm-5.1:cloud (free) | deepseek-v4-pro:cloud | openai-codex/gpt-5.5 |
| Subagents | deepseek-v4-pro:cloud | glm-5.1:cloud (free) | openai-codex/gpt-5.5 |
| System 2 (FLARE) | deepseek-v4-pro:cloud | openai-codex/gpt-5.5 | — |
| Image generation | openai/gpt-image-1 | — | — |
Every primary model is free-tier. Every paid model is a fallback. The defaults are:
- Main session: GLM-5.1 (free, fast, capable for routine work)
- Subagents: DeepSeek V4 Pro (free, strong for parallel execution)
- System 2 / FLARE planning: DeepSeek V4 Pro (free, handles structured planning well)
- Image generation: OpenAI GPT-Image-1 (paid, but images are rare)
Paid models only activate when:
- The primary and first fallback both hit limits
- An escalation trigger fires (large context, high-stakes output, hard judgment)
- The task genuinely justifies premium compute
This isn't a budget constraint dressed up as architecture. It's a deliberate design choice that the GLM default makes possible. When the default is expensive, you need reasons to be cheap. When the default is cheap, you need reasons to be expensive. The latter produces better engineering incentives.
The fallback chain matters too. GLM-5.1 → DeepSeek V4 Pro → GPT-5.5. At each step, you're escalating to a more capable (and more expensive) model. But the escalation is intentional and triggered by specific conditions, not by default. The fallback chain exists for resilience — if GLM is down, DeepSeek takes over — not because every task needs premium compute.
And there's a subtle point about the FLARE configuration: the primary model for System 2 planning is DeepSeek V4 Pro, not GLM. This is intentional. FLARE planning benefits from a model that's strong at structured reasoning, and DeepSeek V4 Pro has proven reliable for this. But it's still a free-tier model. The escalation from DeepSeek to GPT for System 2 tasks follows the same trigger logic: large context, high-stakes output, hard judgment. The default for planning is still free.
The Monitoring Loop
The GLM default isn't set-and-forget. It's a policy that needs observation and revision. You don't set it once and walk away — you monitor, measure, and adjust based on real data. Here's what I track:
Context sizes. What's the distribution of token counts across tasks? If I'm consistently hitting the context threshold, maybe the threshold needs adjustment — or maybe I need to restructure how context is managed rather than just escalating to a bigger window. Context management (summarization, compression, selective retrieval) can often reduce the need for GPT's larger window without losing information.
Quota usage. How much premium budget am I using per session? Per week? If I'm consistently hitting caps, are the escalation triggers too loose, or is the task mix genuinely premium-heavy? Quota tracking gives you a direct read on whether the GLM default is actually working as designed.
Task fit by model. Which tasks does GLM handle well? Which ones fail? This is empirical, not theoretical. I don't assume GLM can handle everything. I observe where it succeeds and where it doesn't, and adjust the triggers accordingly. When GLM fails, the failure analysis tells you whether to tighten an escalation trigger or improve the prompting.
Failure modes. When GLM produces bad output, is it because the task needed GPT, or because the prompt was poorly structured? Before escalating, I try restructuring. A well-prompted GLM often matches a poorly-prompted GPT. This isn't a knock on GPT — it's an observation that prompt quality matters, and prompt engineering is cheaper than model escalation.
The policy evolves based on data. If GLM starts handling more tasks well — which it has, as models improve — the triggers tighten. If a new class of task emerges that GLM consistently struggles with, the triggers loosen. The monitoring loop keeps the system honest.
# Monitoring metrics that drive policy revision
metrics = {
"context_distribution": "histogram of token counts per task",
"premium_quota_usage": "remaining budget per session and week",
"glm_success_rate": "tasks completed satisfactorily on GLM",
"escalation_rate": "percentage of tasks escalated to GPT",
"escalation_reasons": "distribution of trigger types",
"glm_failure_analysis": "root cause when GLM produces bad output",
}
# Policy revision triggers
if metrics["glm_success_rate"] > 0.92:
tighten_triggers() # GLM is stronger than expected, conserve more
if metrics["escalation_rate"] > 0.25:
investigate_escalation_reasons() # Too many tasks needing GPT
if metrics["glm_failure_analysis"].common_root_cause == "prompt_structure":
improve_prompting() # Don't escalate — fix the input
The monitoring loop creates a feedback cycle that improves the system over time. If GLM gets better (and it does, as models improve), the triggers tighten and you use even less premium compute. If new task types emerge that GLM struggles with, the triggers loosen appropriately. The system adapts to reality rather than enforcing a static policy.
This is why the separation from the first post matters so much. If cognition and model routing were tangled, I couldn't independently adjust the escalation triggers without affecting the cognitive routing. I couldn't tighten the context threshold without accidentally changing which tasks get SYSTEM_2 planning. The two-layer architecture gives me independent control over both axes, and the monitoring loop is what makes that control effective.
Why Not Just Use the Best Model for Everything?
Because "the best model for everything" isn't how resource allocation works in any other domain, and it shouldn't be how it works here.
You don't use a Formula 1 car for your commute. Not because you can't afford one — because it's the wrong tool for the job. It's expensive to operate, needs specialized maintenance, and the things that make it exceptional on a track (downforce, tire compounds, gear ratios) are irrelevant in traffic. Your commute needs reliability, efficiency, and appropriate performance. A good sedan wins.
Same principle. GPT-5.5 is a Formula 1 car. For tasks that need its specific capabilities — large context, high-quality synthesis, nuanced judgment — it's worth every token. For everything else, you're burning premium resources on work that doesn't benefit from them.
And there's a more subtle cost that people often miss: opportunity cost. Every token you spend on a routine task that GLM could handle is a token you can't spend on a task that genuinely needs GPT. When you hit your quota limit and a critical task comes in that really needs premium compute, the routine tasks you ran on GPT earlier in the session are the reason you're now scrambling.
The GLM default isn't about being cheap. It's about being honest. Most AI tasks are commutes, not races. Using the race car for every trip doesn't make you sophisticated. It makes you wasteful.
What Changes When You Flip the Default
The shift from GPT-default to GLM-default is small in code but significant in how the system behaves:
Before (GPT default):
- Default is expensive. Budget is tight. You ration cheap compute.
- Tasks get promoted to premium by default, then demoted if they "don't need it."
- The system optimizes for quality ceiling, not throughput floor.
- Quota runs out. The expensive model becomes unavailable. You scramble.
After (GLM default):
- Default is free. Budget is abundant. You ration premium compute.
- Tasks start on the budget model, then escalate if they have a specific reason.
- The system optimizes for appropriate resource allocation.
- Quota lasts. Premium compute is available when you genuinely need it.
The code change is minimal:
# Before: justify fallback to GLM
DEFAULT = "openai-codex/gpt-5.5"
# After: justify escalation to GPT
DEFAULT = "ollama/glm-5:cloud"
One constant. The entire economic posture of the system changes.
But the behavioral change is larger than the code change. When the default is GLM, you start thinking about premium compute differently. Instead of "why am I not using GPT for this?" you ask "does this task actually need GPT?" The question flips. The burden of proof shifts. And that shift cascades through every routing decision the system makes.
You become more deliberate about escalation triggers. You start measuring whether they're firing correctly. You notice when GPT is being used for tasks that GLM could handle, and you tighten the policy. The system gets more efficient over time, not because you're cutting costs, but because you're making better resource allocation decisions.
The Honest Architecture
There's a deeper point here than cost savings. When the default is expensive, the system develops an incentive to misrepresent task complexity. "This is simple, we can use the cheap model" becomes a cost-saving strategy rather than an honest assessment. The system learns to lie about its own cognitive needs.
Consider what this means in practice. If you're running GPT-default, and you hit your quota limit halfway through a session, what happens? The system starts looking for tasks to downgrade. "Is this really SYSTEM_2, or could it be SYSTEM_1?" "Is this really complex, or can we simplify it?" The cognitive routing layer — which should be making honest assessments about deliberation depth — is now under economic pressure to call things simpler than they are.
That's a corruption of the architecture. The whole point of separating cognition from model routing was to let each layer make its own honest decision. But when the default is expensive, the economic pressure bleeds across the separation boundary. The cognition layer starts optimizing for cost instead of accuracy.
When the default is cheap, that incentive disappears. You don't save money by calling tasks simple — they'd run on GLM either way. You only spend more by explicitly identifying tasks that need premium compute, and you have to justify that spend with a specific reason.
This produces more honest complexity assessment. The cognition layer says "this is SYSTEM_2" because it genuinely needs structured planning, not because you're trying to route it to a premium model. The model layer says "this needs GPT" because the context window is genuinely too large, or the output stakes are genuinely high, or the judgment call is genuinely hard.
The result is an architecture that doesn't lie to itself about complexity to save money, and doesn't waste money on tasks that don't need it. The GLM default makes the system more honest, not less capable.
This is why the separation from the first post isn't just a nice architectural pattern — it's a prerequisite for the GLM default to work. If cognition and model routing were tangled, you couldn't flip the model default without affecting cognitive routing. You couldn't make the model layer honest without also affecting the cognition layer's assessments. The two-layer architecture is what makes the GLM default both possible and principled.
The Workhorse and the Specialist
GLM is the workhorse. It handles most tasks, most of the time, without complaint. It's reliable, it's fast, and it's free. When you run a system that processes hundreds of tasks a day, the workhorse is what keeps the lights on. The workhorse doesn't need to be brilliant — it needs to be dependable, and it is.
GPT is the specialist. You call it in for the jobs that need its specific capabilities — the large context window, the polished synthesis, the nuanced judgment. It's expensive, and it should be. The specialist should be reserved for the tasks that genuinely benefit from specialization.
The GLM default means the workhorse runs the stable. The specialist is on call. You don't keep a surgeon in the emergency room for every patient who walks in. You call the surgeon when the case needs a surgeon. The ER handles everything else.
And when the surgeon is called, it's because there's a specific reason — not because every patient gets the surgeon by default. That reason is documented, measurable, and adjustable. If too many patients are seeing the surgeon, you investigate. If the surgeon is never needed, you question whether you need one at all. The system is honest about resource allocation because the default doesn't create pressure to over-use or under-use either resource.
In the next post, we'll look at hybrid runs — what happens when you combine cheap exploration with expensive polish in a single task. When FLARE plans on GLM and then executes the final synthesis on GPT, you get the best of both worlds: the throughput of the budget model for the bulk of the work, and the quality of the premium model for the finish. That's where the real economics get interesting.