Local AI8 min read· May 8, 2026

How to Install Llama 4 Locally: Step-by-Step for Windows and Mac

Install and run Llama 4 locally on Windows or Mac in 2026 — no subscriptions, no API keys. Complete beginner guide using Ollama with GPU and CPU options.

How to Install Llama 4 Locally: Step-by-Step for Windows and Mac

Meta's Llama 4 is one of the most capable open-source AI models available today — and you can run it completely free on your own computer. No monthly subscription. No API key. No data sent to any server.

This guide covers exactly how to do it, step by step, for both Windows and Mac. We'll use Ollama — the easiest and most reliable way to run Llama 4 locally in 2026.

Before starting, check two things:

  • Your VRAM — see our VRAM check guide to confirm your GPU has enough headroom.
  • Your terminal — you'll need to run a few commands. If you've never done that, read our terminal beginners guide first (takes 5 minutes).

Which Llama 4 Model Should You Run?

Meta released two versions of Llama 4 worth knowing about:

Model Size What You Need Best For
Llama 4 Scout 17B×16E MoE 8GB+ VRAM (quantized) Everyday chat, coding help, Q&A
Llama 4 Maverick Larger MoE 16GB+ VRAM Complex reasoning, longer tasks

Start with Llama 4 Scout. It runs on most modern gaming GPUs (RTX 3060 and up) and even on CPU-only setups (slower but works). Maverick is more powerful but requires significantly more hardware.


Step 1 — Install Ollama

Ollama is a free, open-source tool that lets you download and run AI models with a single command. It handles everything: model downloading, quantization, and a local API you can use from any app.

On Mac

Open Terminal and run:

curl -fsSL https://ollama.com/install.sh | sh

This downloads and installs Ollama automatically. Done in under a minute.

On Windows

Open PowerShell as Administrator and run:

irm https://ollama.com/install.ps1 | iex

Alternatively, download the Windows installer directly from ollama.com. The installer sets up Ollama as a background service that starts automatically.

Windows users: If you have an NVIDIA GPU, make sure you've installed the latest NVIDIA drivers before this step. AMD GPU support on Windows is limited — CPU mode will work, just slower.


Step 2 — Download and Run Llama 4 Scout

Once Ollama is installed, open your terminal (Mac: Terminal app, Windows: PowerShell or Command Prompt) and run:

ollama run llama4:scout

What happens next:

  1. Ollama downloads the quantized Scout model (~7GB)
  2. It loads into your GPU (or RAM if CPU-only)
  3. You get a chat prompt in the terminal

The first run takes 5–15 minutes depending on your internet speed. Every run after that starts in seconds because the model is cached locally.

To test it:

>>> What's the difference between RAM and VRAM?

You should see a detailed, accurate answer within a few seconds on GPU, or 30–60 seconds on CPU-only.

To exit: Type /bye and press Enter.


Step 3 — Run Llama 4 Maverick (Optional, Requires More VRAM)

If your machine has 16GB+ VRAM and you want the more powerful model:

ollama run llama4:maverick

This pulls the larger Maverick model. Noticeably better at multi-step reasoning, code generation, and long document tasks. Not necessary for everyday use — Scout handles most things well.


Step 4 — Use a Visual Interface (Optional but Recommended)

The terminal chat works, but a proper chat interface is much easier to use day-to-day. The best free option is Open WebUI — it gives you a ChatGPT-style browser interface that connects directly to your local Ollama models.

See our Open WebUI setup guide for installation in under 10 minutes.

Alternatively, if you'd rather use a desktop app with model management and an interface built in, check our LM Studio tutorial — LM Studio is a good alternative to Ollama for beginners who prefer clicking over typing.


Step 5 — Useful Ollama Commands

Once you have Ollama running, these commands are the ones you'll use most:

# List all models you've downloaded
ollama list

# Run a specific model
ollama run llama4:scout

# Pull a model without running it
ollama pull llama4:scout

# Remove a model to free up disk space
ollama rm llama4:scout

# See what's currently running
ollama ps

To check the Ollama service status on Windows (it runs in the background):

Get-Service ollama

How Much RAM/VRAM Do You Need?

This is the most common question. Here's the honest answer:

Your Hardware Can You Run Llama 4? Experience
RTX 3060 (12GB VRAM) Yes — Scout easily Fast, responsive
RTX 3080/4070 (10–12GB VRAM) Yes — Scout well Very fast
RTX 4090 (24GB VRAM) Yes — both models Excellent
8GB VRAM Yes — Scout (quantized) Good, some slowdown on long prompts
No GPU / CPU only Yes — Scout (Q4 quant) Slow (1–3 tokens/sec), but works
Mac M1/M2/M3 Yes — Scout and Maverick Excellent (unified memory)

Apple Silicon Macs (M1 through M4) are particularly good for local AI — the unified memory architecture means 16–32GB RAM functions like VRAM, making them ideal for running larger quantized models without a discrete GPU.


What Can You Do With Llama 4 Locally?

Once it's running, Llama 4 handles:

  • Coding help — debugging, code review, explaining errors. Scout is solid for Python, JavaScript, and most common languages.
  • Document Q&A — paste in a long document and ask questions about it
  • Writing assistance — drafts, rewrites, summaries
  • Private conversations — nothing leaves your machine. Useful for anything you don't want cloud AI services seeing.
  • Automation scripts — combine with Python and the Ollama API (http://localhost:11434) to build local AI tools

The local API is fully OpenAI-compatible, so any tool or script built for the OpenAI API can be pointed at Ollama with a one-line change.


Troubleshooting Common Issues

"Ollama is not recognized as a command" Restart your terminal after installation. On Windows, close PowerShell and reopen it.

Download stuck or very slow The models are 5–10GB files. If the download stalls, press Ctrl+C and rerun ollama pull llama4:scout — it resumes from where it stopped.

Model loads but responses are extremely slow You're running in CPU-only mode. This is normal — expect 1–3 tokens per second. For GPU acceleration, confirm your drivers are updated and that Ollama is detecting your GPU with ollama run llama4:scout (it shows "using GPU" in the logs if GPU is active).

Out of memory error when starting model Your VRAM isn't enough for the default quantization level. Try pulling the more aggressive quantized version:

ollama run llama4:scout:q4_0

This reduces quality slightly but uses significantly less VRAM.

Windows: "Access denied" error during install Make sure you're running PowerShell as Administrator. Right-click on PowerShell → "Run as administrator."


FAQ

Q: Is Llama 4 free to use locally?
A: Yes. Meta released Llama 4 under an open license. The Ollama tool is also free and open source. You pay nothing beyond your electricity costs.

Q: Does running Llama 4 locally require internet?
A: Only for the initial download. Once the model is on your machine, it runs completely offline. No internet connection needed.

Q: How is Llama 4 locally different from ChatGPT?
A: ChatGPT runs on OpenAI's servers — your conversations are sent to their systems and subject to their privacy policy. Llama 4 locally runs entirely on your hardware. Nothing is transmitted anywhere. The tradeoff: cloud models like GPT-5.5 are generally more capable on complex tasks, but for everyday use the gap is smaller than most people expect.

Q: Can I use Llama 4 for commercial projects?
A: Check Meta's Llama license. As of 2026, commercial use is permitted for organizations under 700 million monthly active users, with some restrictions. For personal projects and small businesses, you're fine.

Q: What's the difference between Ollama and LM Studio?
A: Both run local models. Ollama is terminal-based and lighter-weight — better for automation and scripting via the local API. LM Studio has a graphical interface and model browser — better for beginners who prefer clicking over commands. Our LM Studio guide covers the alternative setup.

Q: Do I need a special computer to run Llama 4?
A: No. Scout runs on CPU-only if needed — it's slow but functional. Any modern computer from 2020 onward with 16GB RAM can run it. A GPU dramatically improves speed, but it's not required to get started.

Q: Can I run Llama 4 on a laptop?
A: Yes. NVIDIA laptop GPUs (RTX 3060/3070/4060 mobile) work well for Scout. Mac laptops with M-series chips are excellent for local AI — the unified memory gives them an advantage over many desktop setups.


Running Llama 4 locally takes about 15 minutes from start to working chat. It's one of the most practical things you can set up as someone exploring AI tools — a capable, private, free AI assistant that runs entirely on your own hardware.

Once you have it running, try connecting it to Open WebUI for a proper interface, or explore using the local API to build simple automation scripts.

Alex the Engineer

Alex the Engineer

Founder & AI Architect

Senior software engineer turned AI Agency owner. I build massive, scalable AI workflows and share the exact blueprints, financial models, and code I use to generate automated revenue in 2026.

Related Articles