Claude to Qwen and Gemma: Building a Cost-Efficient AI Assistant Stack

Qwen3 and Gemma: How I Migrated My Personal AI Assistant Nyx Off Claude in a Morning

Qwen migration: I moved my assistant Nyx from Claude to Qwen3 and Gemma to slash costs, manage latency, and build a resilient multi-model stack quickly.

Why the migration matter: an unplanned deadline from Anthropic

On Friday, April 3 at 7:47 PM I received a brief announcement from Anthropic that changed the economics of my personal assistant overnight. The message said that starting April 4 at 12:00 PM PT / 8:00 PM BST, subscription limits for Claude would no longer apply to third‑party harnesses such as OpenClaw. In practice, that meant my Claude‑based agent Nyx — which had been running under my Anthropic Max subscription — would continue to work but only if I accepted an additional pay‑as‑you‑go “extra usage” charge for third‑party integrations. Faced with less than 17 hours to decide, I pivoted: I rebuilt Nyx around a diversified model stack led by Qwen3 and Gemma.

That decision mattered because Nyx is not a toy. It runs on a VPS I control, handles content production, automation flows, analysis, a publication calendar, and dozens of daily tasks. Until the announcement, Claude Sonnet 4.6 was Nyx’s default model and its usage was covered by my Max tier. The new billing arrangement would have added variable API token charges on top of the subscription, potentially hundreds of dollars per month for an always‑on agent. Rather than accept opaque extra costs, I used the notice as an impetus to assemble a multi‑model, cost‑aware stack — and to test alternatives under real‑world conditions.

What Anthropic changed and why

Anthropic’s email made three points clear: first, Pro and Max subscriptions would no longer cover use through third‑party tools (OpenClaw among them) after April 4; second, subscription allowances would continue to apply to Anthropic’s own products such as Claude Code and Claude Cowork; and third, Anthropic would offer a one‑time credit equal to the monthly subscription price (claimable through April 17) plus discounts — up to 30% — for prebuying extra usage bundles.

The company’s engineering rationale, attributed to Boris Cherny (Head of Claude Code at Anthropic), centered on prompt caching: third‑party tools were not taking advantage of the internal prompt cache Claude uses, producing disproportionately high compute costs for Anthropic compared with the usage patterns inside its own products. Anthropic’s move to restrict subscription coverage for third‑party integrations follows other backend limits they had already been applying, including session caps on heavy users.

My starting point: the stakes for Nyx

Nyx runs on my VPS and functions as an autonomous assistant handling many production tasks. Before the change, a Max subscription ($100–$200 per month, per my billing tier) covered Sonnet 4.6 as the default model. Post‑announcement, continuing to use Claude through OpenClaw would require paying token costs in addition to the subscription. Community estimates suggested an always‑active agent can burn $50–$200 a month in API usage alone — a substantial added burden when I was already paying for Max.

The rapid deadline forced a pragmatic choice: either enable Anthropic’s extra usage and absorb variable charges, switch to Anthropic’s direct pay‑per‑token API, or migrate to an alternative provider routing layer such as OpenRouter that offers many models by API key. I chose the second path: build a diversified stack primarily routed through OpenRouter while keeping Anthropic on hand for on‑demand, highest‑quality calls.

How I migrated in a single morning — the practical steps

The migration followed a deliberate, measurable process instead of gut calls. I broke it into four concrete steps: inventory the actual auth and model options, consult unbiased rankings, measure real‑world latency, and compose a task‑aware stack.

Audit of available models and authentication modes

My first priority was to understand what was actually reachable with real credentials, not marketing blurbs. The list I verified included:

Anthropic (direct token) — Sonnet, Opus, Haiku: still available but now subject to pay‑as‑you‑go for third‑party usage.
Google (OAuth) — Gemini 3.1 Pro High and Gemini 3 Flash: accessible via OAuth but previously had production timeouts in my tests.
OpenRouter (API key) — surfaced dozens of models from multiple vendors behind a single integration point, billed by real usage.
Groq (token) — offered fast models but some model IDs were out of date.

This inventory reinforced a theme: availability on paper is different from reliable availability under load, and authentication mode (OAuth vs API key vs token) materially matters for integration reliability.

Consulting human‑voted rankings for quality signals

Rather than trust vendor claims, I turned to LM Arena, which aggregates millions of blind human pairwise votes between models. The Arena scores I relied on for relevant candidates were:

Gemini 3.1 Pro High — Arena score ~1505; closed‑source; available via OAuth (noted historical timeouts).
Claude Sonnet 4.6 — Arena score ~1460; closed‑source; costs listed as $3/$15 (as shown in the source table).
Gemma 4 31B — Arena score 1450; open‑source (Apache 2.0); cost listed as $0.14 per 1M tokens.
Qwen3 235B — Arena score 1418; open‑source (Apache 2.0); cost listed as $0.07 per 1M tokens.
DeepSeek V3 0324 — Arena score 1377; open‑source (MIT); cost listed as $0.20 per 1M tokens.

Those Arena scores and the cost figures made it clear that some open‑source models were now competitive with proprietary offerings on quality while offering dramatically better price points. Gemini’s top Arena rank was notable, but my earlier production experience with timeouts made it a less attractive backbone for an interactive assistant.

Measuring real latency under load — not paper benchmarks

Quality scores tell one part of the story; interactivity depends on latency. I launched subagents via OpenClaw to record cold‑start response times for several candidate models. The median cold latencies I observed were:

DeepSeek V3 0324 — 257 ms
Llama 4 Maverick — 346 ms
Qwen3 235B — 638 ms
Mistral Small 3.1 — 460 ms
Gemma 4 31B — 6.2 seconds

Gemma 4’s Arena score made it attractive for batch work, but its 6.2‑second cold latency rendered it unsuitable for conversational interactivity; I relegated it to offline batch tasks such as SEO analysis and mass processing. Conversely, DeepSeek and Maverick delivered sub‑half‑second responses that are ideal for short interactions and code work. Qwen3 balanced quality and cost with sub‑second responses that remained acceptable for most conversational flows.

These latency measurements were pivotal: two models with similar Arena scores can produce radically different user experiences if one responds in 600 ms and another in 6 seconds. For an assistant meant to converse and iterate quickly, latency is as important as raw score.

Composing the final, task‑aware stack

With quality and latency data in hand, I assembled the stack that Nyx now runs on. The guiding principle was matching each model to tasks where it offered the best cost/performance trade‑off.

Primary model (default): Qwen3 235B via OpenRouter
- Observed latency: ~638 ms
- Arena score: 1418
- Cost: $0.07 per 1M tokens
- Context window: 262k tokens
- Role: default conversational reasoning, content drafting, Spanish fluency and multi‑tasking.
Specialized agents and their assigned models:
- Content and courses: Qwen3 235B — chosen for strong Spanish output and solid general reasoning.
- SEO and bulk analysis: Gemma 4 31B — high Arena score for offline batch processing despite high cold latency.
- n8n integrations and code tasks: DeepSeek V3 0324 — fastest responses (257 ms) and strong code handling.
- Short social comments and microcopy: Mistral Small 3.1 — very fast and extremely low cost (~$0.03 per 1M tokens).
- API exploration and very long context tasks: Llama 4 Maverick — large context (1M tokens) useful for particular experiments.
- Session compaction / summarization: Mistral Small 3.1 — inexpensive for summarizing conversational context.

Anthropic’s Sonnet, Opus and Haiku remain available on demand when I need the highest‑quality, targeted outputs. They are no longer Nyx’s default model but sit in reserve for specific, quality‑sensitive calls.

What this migration taught me about vendor risk and open‑source parity

The episode reinforced several concrete lessons that are relevant beyond a single assistant build:

Dependence on a single provider is an operational risk. Anthropic’s change came with a short time horizon — a Friday evening notice with a Saturday‑noon effective date — and if I hadn’t had infrastructure and alternative models available, Nyx could have become nonfunctional for end users the next day.
The open‑source ecosystem has closed much of the quality gap. Gemma 4 (31B) posted an Arena score comparable to many proprietary models from only months earlier, and Qwen3 235B sits within striking distance of Claude Sonnet while costing a fraction per million tokens. That shift changes the calculus for teams balancing cost and quality.
Task specialization is more efficient than one‑size‑fits‑all. Using the cheapest acceptable model for short, repetitive tasks and reserving larger models for complex reasoning reduces cost without degrading results. In my stack, Mistral Small handles microcopy; DeepSeek handles code; Gemma does batch SEO; Qwen3 is the generalist.
Latency is a first‑order variable for interactive assistants. Arena scores around 1400–1450 can feel very different in practice when latency moves from a few hundred milliseconds to multiple seconds.
Anthropic’s move was economically coherent. From a platform economics perspective, subsidizing heavy third‑party usage inside consumer subscription tiers is unsustainable when the provider can protect margin by shifting those costs to pay‑as‑you‑go users. The change wasn’t surprising in hindsight.

Each of these insights came directly from the constraints and measurements I recorded during the migration; they guided concrete architecture choices rather than abstract best practices.

Practical options if you run an assistant that used Claude via third parties

If your automation, agent, or personal assistant depends on Claude through a subscription and a third‑party harness, you essentially have three practical paths:

Enable Anthropic’s “extra usage” on your account to keep existing integrations working. This is the simplest operational path, but it adds a new variable, usage‑based charge on top of your fixed subscription. If you take this route, note Anthropic’s one‑time credit (equivalent to a monthly subscription) that was offered and usable through April 17, plus limited bundle discounts.
Migrate to an aggregator like OpenRouter that exposes many models via an API key and charges by real usage. This removes a fixed subscription for third‑party routing and gives you the freedom to select models by task and price, but it requires careful model selection and testing.
Use Anthropic’s direct API and abandon the subscription model for a pure pay‑per‑token workflow. That simplifies billing into a single channel but may be more expensive if your agent’s usage is high and continuous.

I chose the OpenRouter‑centered strategy complemented by occasional Anthropic on‑demand calls. That approach let me lower expected monthly spend, target specific models to tasks, and retain access to Claude quality when needed.

Developer and business implications for AI tooling and operations

This migration is a microcosm of broader tendencies in AI deployment that developers and organizations should watch:

Operational control and flexible routing will become routine. Teams that can route calls to different models programmatically — prioritizing latency, price, or quality per task — will extract more value than those tied to a single vendor plan.
Cost transparency matters. As companies unbundle subscription allowances and API billing, predictable cost models will be a priority for production agents. Built‑in monitoring and caps are essential to avoid surprise bills.
Open‑source models are changing procurement strategies. With many open models now competitive on quality and vastly cheaper, procurement can shift from vendor lock‑in toward a mix of open and proprietary models chosen for specific workstreams.
Developer tooling for model selection, latency testing, and task‑based routing will grow in importance. The differentiator is operational tooling to measure quality and latency under real conditions, not just benchmark scores.

These implications affect not only hobbyist assistants but also teams that rely on agents for customer support, content pipelines, or automation. The ability to compositionally mix models — e.g., Mistral for compacting sessions, DeepSeek for code, Gemma for bulk SEO — becomes an engineering advantage.

How I verified each tradeoff under pressure

Two constraints shaped the migration: a tight time window and the need for repeatable measurements. I prioritized what could be measured quickly:

Authentication and availability: verified which models were reachable with real tokens/OAuth keys.
Human quality signal: used LM Arena scores to approximate comparative quality.
Cold latency: measured live cold start times for representative prompts via my existing OpenClaw subagents.
Cost per token: referenced the cost columns reported in my inventory for a consistent cost comparison.

This allowed me to pick Qwen3 as the default for most conversational tasks while relegating Gemma to batch work and keeping faster small models for microtasks. These choices were grounded in observable tradeoffs rather than marketing claims.

If you want to replicate this kind of migration, the practical checklist is straightforward: catalog what models you can actually reach with your credentials; run Arena or human‑judged quality checks where possible; measure latency in the environment your users will experience; and map models to task classes before switching defaults.

Nyx’s migration — which took me the better part of a Saturday morning — was intentionally surgical: swap the default model routing, validate key automations, and keep Anthropic for on‑demand calls. The end state is a multi‑model stack tuned to price, latency, and task suitability.

Looking ahead, the market will likely keep evolving toward composable model stacks and richer routing controls. Providers may respond to competitive pressure by improving latency and price on open models, while platform providers will refine how third‑party usage is billed and cached. For operators of personal assistants and production agents alike, the priorities are clear: measure, diversify, and automate routing so the next policy shift doesn’t become an emergency.