If you’ve ever tried to run a “serious” multimodal model locally, you know the deal. You get one of two experiences.
A slick demo that crawls. Or a model that’s smart… until your machine wheezes, swaps, and taps out.
So Gemma 4 showing up with this very specific energy is kind of refreshing: frontier-ish multimodal capability, open weights, and a loud, explicit “yeah, this is meant to run on-device.”
Gemma 4 from Google DeepMind is a family of open-weight models in four sizes: E2B, E4B, 26B A4B, and 31B. You get Apache 2.0 licensing, multimodal inputs across the lineup, and a real push toward practicality on laptops and even phones. The model card also calls out long context windows: 128K on the small models and 256K on the medium ones. Not a rounding error.Now “marketing long.” Actually workflow-changing.
For the authoritative specs, here’s the source of truth: [Gemma 4 model card].
Key takeaways, the way you’d tell a friend
Gemma 4 comes in four variants: E2B, E4B, 26B A4B, and 31B.
Multimodal is the default. It’s text + image across the family, and E2B/E4B also take audio input. Video is handled the usual practical way: frames.
The context windows are big:
- 128K tokens on E2B/E4B
- 256K tokens on 26B A4B and 31B
The release is open-weight under Apache 2.0, which is the kind of licensing detail suddenly makes commercial and on-prem conversations way less awkward.
Google and Hugging Face also publicly call out a few architecture moves aimed at efficiency: hybrid attention, Per-Layer Embeddings, and Shared KV Cache.
Benchmarks and positioning, early-days flavor: Hugging Face reports estimated LMArena ~1452 for 31B and ~1441 for 26B MoE as text-only scores in their write-up, and Engadget points to strong Arena leaderboard placements.
What “Gemma 4 has landed” means in boring, verifiable terms
Google’s release notes list a “Release of Gemma 4 in E2B, E4B, 31B and 26B A4B sizes”, last updated 2026-04-02 UTC, on the Gemma releases page. That’s the cleanest canonical “yep, it’s out” you’re going to get.
Source: [Gemma releases page].
And the naming isn’t random, even if it looks like it at first glance.
E2B / E4B
The “E” means “effective” parameters. The model card lists 2.3B effective and 4.5B effective, using techniques like PLE to stay efficient.
26B A4B
The “A” means “active” parameters. It’s a Mixture-of-Experts model with ~4B active during inference even though total parameters are ~26B. In some respects, it can run closer to a 4B model’s cost profile. Not magic. Just MoE doing MoE things.
31B
A classic dense model. Bigger. Heavier. Straightforward.
The parts you’ll actually care about: multimodal, long context, agent workflows
Multimodal support
Per the model card and Hugging Face’s launch post:
All Gemma 4 models support text + image input → text output, and the vision stack supports variable aspect ratio.
E2B and E4B add native audio input, targeting things like ASR and speech translation.
For video understanding, the description is basically “process sequences of frames.” In practice, that’s often exactly what your pipeline is anyway.
Hugging Face also mentions a configurable image token budget: 70, 140, 280, 560, 1120 tokens. That’s a big deal if you’ve ever wanted “faster, please” without throwing away the whole vision feature set.
Source. [Welcome Gemma 4: Frontier multimodal intelligence on device].
Long context, the quiet killer feature
The model card is very direct here:
- E2B/E4B have a 128K context window
- 26B A4B / 31B have a 256K context window
This is the stuff that changes how you work. Big repos. Long docs. Multi-file reasoning. Agent runs that don’t feel like a goldfish memory test. The “here’s the entire log, tell me what happened” debugging move.
Will you still need prompt discipline? Yeah. You can absolutely drown a model in junk at 256K. But the ceiling is way nicer.
Agentic and function-calling workflows
Gemma 4 introduces or strengthens some “build apps, not just chat” controls:
You get native system prompt support via the system role. There’s native function calling for structured tool use. And the model card mentions configurable “thinking modes” for reasoning.
This is the kind of thing you notice immediately when you wire it into an agent loop and stop hand-rolling workarounds.
Under the hood, in human terms: why it’s faster than it looks
Not everyone needs architecture details. Totally fair. But some of these choices explain why “on-device friendly” is more than a vibes-based promise.
From Hugging Face plus the model card:
Hybrid attention alternates local sliding-window layers and global full-context layers. Small models use 512-token windows, larger ones use 1024. The final layer is global.
There’s Dual RoPE + p-RoPE, with separate rotary position setups for sliding vs global layers to keep long-context behavior stable.
Per-Layer Embeddings show up especially for the “effective” models. Think extra embedding tables acting like lightweight per-layer conditioning.
Then there’s Shared KV Cache, where later layers can reuse K/V tensors. Less redundant compute and memory, which matters a lot when context gets long and the KV cache starts acting… let’s say “ambitious.” I’ve had the “why is my VRAM gone?” moment before. This reads like someone else has too.
Running Gemma 4 locally with Transformers
The official model card uses a straightforward Hugging Face Transformers flow. Here’s a minimal version you can paste into a venv.
pip install -U transformers torch accelerateimport torch
from transformers import AutoProcessor, AutoModelForCausalLM
MODEL_ID = "google/gemma-4-E2B-it"
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role". "system", "content". "You are a helpful assistant."},
{"role": "user", "content": "Explain shared KV cache like i'm a GPU."},
]
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False,
)
inputs = processor(text=text, return_tensors="pt").to(model.device)
input_len = inputs["input_ids"].shape[-1]
out = model.generate(**inputs, max_new_tokens=300)
response = processor.decode(out[0][input_len:], skip_special_tokens=False)
print(response)If you want the “grown-up, production-ish” route, Google’s Vertex AI docs list Gemma 4 among available open models and outline serving and tuning options:
Use Gemma open models on Vertex AI.
Two workflows I’d actually use (not just talk about)
1) Local offline coding assistant (code + long context)
Engadget calls out using Gemma 4 for offline code generation, which is the real reason a lot of people run open models at all. Privacy and latency. Not vibes.
If you’re resource constrained, start with E4B. When you want stronger reasoning without fully paying the dense 31B cost profile, 26B A4B is a pretty logical next jump.
2) Multimodal UI / OCR parsing (screens, docs, receipts)
Hugging Face shows Gemma 4 doing object detection + “pointing” with bounding boxes and returning JSON-ish outputs without tons of coercion. The model card also lists OCR, document/PDF parsing, and UI understanding as first-class capabilities.
This is perfect for stuff like extracting values from screenshots, parsing receipts and invoices, and even “what button do I click” UI automation assistants. Carefully. Very carefully.
Common mistakes that make people rage-quit
Picking the wrong size is a classic. If you’re on a laptop GPU, try E4B first. Jump straight to 31B, start swapping, then blame the model… yeah, no. That one’s on us.
If your pipeline feels slow, don’t ignore image token budgets. Lower the budget where supported before you declare the model “too heavy.”
And don’t treat 256K context as a license to shovel everything in. You still want summarization, chunking, retrieval.But can absolutely overwhelm the model with irrelevant text.
License-wise, Apache 2.0 is permissive, but you still own your safety and compliance work. No free passes.
If you’re also thinking about responsible deployment and how people talk themselves into bad decisions during hype cycles, I have a related piece here. AI fearmongering: how to spot it (and what to do instead).
So… should we care that Gemma 4 landed?
Yeah, I think so.
Apache 2.0 open weights, multimodal inputs, native function calling, and 128K/256K context across sizes actually map to real hardware tiers is a combination that’s hard to shrug off. And the architecture choices like PLE, hybrid attention, and shared KV cache aren’t academic. They’re the difference between “weekend science project” and “this could actually live on my machine.”
If you try Gemma 4, I genuinely want to know what you ran it on. CPU? M-series? 4090? Phone? And what broke first.
If you want a practical next step, pick E4B, run a small multimodal task like OCR or UI parsing, then scale up only when you hit a real limit.
Sources
- Hugging Face — Welcome Gemma 4. Frontier multimodal intelligence on device. Https.//huggingface.co/blog/gemma4
- Google AI for Developers — Gemma 4 model card. Https.//ai.google.dev/gemma/docs/core/model_card_4
- Google AI for Developers — Gemma releases (Gemma 4 release listed. Last updated 2026-04-02 UTC). Https.//ai.google.dev/gemma/docs/releases
- Google Cloud — Use Gemma open models on Vertex AI. Https.//docs.cloud.google.com/vertex-ai/generative-ai/docs/open-models/use-gemma
- Engadget — Google releases Gemma 4, a family of open models built off of Gemini 3. Https.//www.engadget.com/ai/google-releases-gemma-4-a-family-of-open-models-built-off-of-gemini-3-160000332.html
- YouTube (Sam Witteveen) — Gemma 4 Has Landed!. Https.//www.youtube.com/watch?v=5aqF1HVpjdc
- YouTube — What’s new in Gemma 4: https://www.youtube.com/watch?v=jZVBoFOJK-Q