AI News8 min read· June 6, 2026

Gemma 4 QAT Models: Run Google's Best AI on Your Phone or Laptop (2026)

Google just released Gemma 4 QAT models — compressed AI you can run on a phone or laptop without an internet connection. Here's what QAT means and how to get started.

Gemma 4 QAT Models: Run Google's Best AI on Your Phone or Laptop (2026)

Google dropped a major update to Gemma 4 on June 5, 2026: the Gemma 4 QAT models — specially compressed versions of its best open-source AI that can run on a phone or a standard laptop without a GPU.

This is a bigger deal than it sounds. Until now, running a capable AI model locally required a decent graphics card and at least 16GB of VRAM. The new QAT versions change that equation significantly — making Gemma 4's performance accessible to almost any modern device.


What Are QAT Models? (Plain-Language Version)

QAT stands for Quantization-Aware Training. Let's break that down without the jargon.

Normal AI models store all their numbers at high precision — like writing every value out to 16 decimal places. This is accurate, but it's also heavy. A typical model might need 10–20GB of memory just to load.

Quantization is the process of rounding those values down to fewer decimal places — say, 4 bits instead of 16. This dramatically shrinks the model's size. A model that needed 12GB of RAM might drop to 3GB.

The problem with most quantization methods: you do it after the model is trained, which can hurt quality. QAT — Quantization-Aware Training — solves this by training the model with quantization in mind from the start. The model learns to be accurate even in its compressed form, instead of getting compressed as an afterthought.

The result: a model that's small enough to run on consumer hardware but performs much closer to the full-sized version than regular post-training quantization.


What Google Released

Google dropped two Gemma 4 QAT variants:

Gemma 4 E2B — The phone-sized version. "E2B" refers to the 2-bit quantization level. This is designed to run on modern smartphones and tablets. If you've got a recent Android or iPhone, you may be able to run a legitimate AI model directly on your device with no cloud connection and no subscription.

Gemma 4 E4B — The laptop version. This is a 4-bit quantized model designed for MacBooks, Windows laptops, and mini PCs. The E4B hits a sweet spot: it's small enough to fit in standard RAM, but retains most of Gemma 4's capability.

Both variants support the major local AI tools:

  • Ollama (the easiest way to run local models — recommended for beginners)
  • MLX (optimized for Apple Silicon Macs — extremely fast)
  • llama.cpp (advanced, maximum compatibility across hardware)
  • Hugging Face Transformers (for developers)

How to Run Gemma 4 QAT on Your Laptop

If you haven't set up Ollama before, the short version: Ollama is a free tool that makes running AI models on your laptop about as complicated as installing an app. Here's how to get Gemma 4 QAT running:

Step 1: Install Ollama

Go to ollama.com and download the installer for your operating system. Available for Mac, Windows, and Linux. Install it like any other app.

If you've never used a terminal before, the terminal beginners guide covers the basics in about 10 minutes — you'll need it for step 2.

Step 2: Pull the Gemma 4 QAT Model

Open your terminal and run:

ollama pull gemma4:e4b

For the phone-optimized version:

ollama pull gemma4:e2b

Ollama will download the model automatically. The E4B version is roughly 3–4GB. The E2B is smaller still.

Step 3: Run It

ollama run gemma4:e4b

You'll get a chat interface in your terminal. Start asking questions. It's running entirely on your machine — no internet required once downloaded.

For Apple Silicon Macs: Use MLX

If you're on an M2, M3, or M4 Mac, the MLX framework is significantly faster than Ollama for local models because it uses Apple's Neural Engine directly.

pip install mlx-lm
mlx_lm.generate --model google/gemma-4-e4b-it-litert --prompt "What are 5 ways to make money with AI?"

MLX on an M4 MacBook Pro hits speeds of 40–65 tokens per second on the E4B model — fast enough for real conversational use.


How to Check If Your Laptop Can Handle It

Before downloading, check your available RAM. The Gemma 4 E4B needs about 4–6GB of free RAM to run comfortably. Most laptops sold in the last three years have 8–16GB total, so this is achievable.

To check your VRAM and RAM specs, see the how to check VRAM for AI guide — it covers Mac, Windows, and Linux in plain language.

Quick rule of thumb:

  • 8GB RAM laptop: E2B model, light tasks
  • 16GB RAM laptop: E4B model, comfortable performance
  • M2/M3/M4 Mac: Either model runs well; use MLX for speed

Why This Matters for Making Money with AI

Running AI locally changes the economics of AI-powered work in two important ways:

No monthly subscription costs. Most capable AI services charge $20–$200/month. Once you've pulled a QAT model, every prompt is free. For someone running an AI-assisted content workflow, automating client tasks, or building a side business, that's a meaningful cost reduction.

Privacy for client work. Sending client data to a third-party API creates legal and practical questions. Local models process everything on your machine — nothing leaves your computer. If you're building AI tools for businesses in healthcare, legal, or finance, this matters enormously.

Offline work. Rural internet, travel, or just spotty Wi-Fi — local AI models work without a connection. Your workflow keeps running regardless.

That said, if you need the very best performance — multimodal analysis, complex coding, high-volume processing — cloud models still win. Tools like CustomGPT let you build custom AI assistants on top of strong cloud models when local performance isn't enough. And for GPU-accelerated cloud compute when you need more power than your laptop offers, Ampere Cloud provides on-demand GPU instances optimized for AI workloads.


Gemma 4 QAT vs. Previous Gemma Versions

If you've been following the Gemma series, here's how the QAT models compare:

Version Size Where It Runs QAT Speed
Gemma 4 (full) 27B Needs GPU No Slow on consumer hardware
Gemma 4 12B 12B High-end laptop No Needs 12GB+ VRAM
Gemma 4 E4B (QAT) ~4GB Standard laptop Yes 40–65 tok/s on M4
Gemma 4 E2B (QAT) ~2GB Phone Yes On-device, no cloud

The E4B version is the first Gemma model that genuinely fits into a standard consumer laptop workflow without configuration headaches. Previous versions required either a powerful GPU or sacrificing significant performance.


Gemma 4 QAT + Google AI Edge Gallery

Google also launched Google AI Edge Gallery for macOS alongside the QAT release — a desktop app that lets Mac users browse, download, and run Gemma-based models without touching the terminal.

Think of it as an app store for local AI models. You open the gallery, pick a model, click download, and start chatting. No terminal, no configuration. This is the most beginner-friendly path to running local AI Google has offered yet.

The gallery currently supports a curated set of Gemma models. It's free and available through the Google AI Edge site.


Frequently Asked Questions

What is a QAT model? QAT stands for Quantization-Aware Training. It's a method of compressing AI models by reducing numerical precision, done during the training process so the model stays accurate at its smaller size. QAT models are significantly smaller and faster than standard models while retaining most of their quality.

Can I run Gemma 4 QAT on a Windows laptop? Yes. The E4B model works on Windows via Ollama or llama.cpp. You need approximately 6GB of free RAM. Most laptops with 16GB total RAM will handle it comfortably. Check the VRAM guide to verify your specs.

Can I actually run this on a phone? The E2B model is designed for phones. Android is the most supported platform right now via Google AI Edge on Android. iOS support is more limited. Performance varies by device — newer phones with dedicated AI hardware (Neural Processing Units) handle it best.

Is Gemma 4 QAT better than ChatGPT? No — cloud models like GPT-5, Claude Opus, and Gemini Ultra still significantly outperform any local model of this size. What QAT models offer is a usable, private, zero-cost alternative for everyday tasks. For complex analysis, creative writing, or coding assistance, cloud models are still better. For summarization, Q&A, drafting, and basic reasoning, the E4B is surprisingly capable.

How much storage does Gemma 4 E4B take up? Approximately 3–4GB of disk space. The E2B is around 2GB. Both download through Ollama automatically. You'll need space on your main drive — external drives can work but are slower.

Do I need the internet to run it after downloading? No. Once downloaded, the model runs entirely offline. You only need internet to download the model initially.

Which is better — E2B or E4B? E4B is significantly more capable and is the recommended choice for laptops. E2B is designed for phones and lower-power devices where E4B won't fit. If you have the RAM for E4B, use E4B.

Can I use Gemma 4 QAT for my business? Yes. Gemma models are released under Google's Gemma Terms of Use, which permit commercial use. You can use QAT models in products and services you sell. Review the Gemma license for specifics.

Alex the Engineer

Alex the Engineer

Founder & AI Architect

Senior software engineer turned AI Agency owner. I build massive, scalable AI workflows and share the exact blueprints, financial models, and code I use to generate automated revenue in 2026.

Related Articles