Configure Local Models

Run AI models locally on your machine for free, with complete privacy. No API keys needed! Local models run entirely on your device—no data is sent to external servers, ensuring complete privacy.

LM Studio

LM Studio is a user-friendly desktop app with downloadable models, perfect for getting started with local AI.

Step 1: Download & Install

Get LM Studio from lmstudio.ai (Free for Windows, Mac, and Linux)

Step 2: Download a Model

Open LM Studio
Go to the Search tab
Download a model like “Llama 3.2” or “Qwen 2.5”

Popular models:

Llama 3.2 3B - Fast, efficient for general tasks
Qwen 2.5 7B - Strong reasoning capabilities
Phi-3 Medium - Microsoft’s compact model

Step 3: Start Local Server

Go to the Developer tab in LM Studio
Click Start Server (runs on default port 1234)
Keep LM Studio running in the background

Step 4: Configure Extension

Open the WebLLM extension sidepanel
Go to Providers tab
Click Configure next to LM Studio
The extension will auto-detect the running server
Click Test Connection to verify
Click Save

That’s it! WebLLM will now route requests to your local LM Studio models.

Ollama

Ollama is a command-line tool for running LLMs, ideal for developers who prefer terminal-based workflows.

Step 1: Install Ollama

Download from ollama.ai or use the installation script:

curl -fsSL https://ollama.ai/install.sh | sh

Available for macOS, Linux, and Windows (WSL2).

Step 2: Download & Run a Model

Open your terminal and run:

ollama run llama3.2

This downloads the model (if needed) and starts it. Other popular models:

ollama run qwen2.5 - Qwen 2.5 (strong reasoning)
ollama run phi3 - Microsoft Phi-3 (compact)
ollama run codellama - Code-specialized model

Step 3: Verify Server is Running

Ollama automatically starts a server on port 11434. Test it with:

curl http://localhost:11434

You should see: Ollama is running

Step 4: Configure Extension

Open the WebLLM extension sidepanel
Go to Providers tab
Click Configure next to Ollama
Enter server URL: http://localhost:11434/v1/chat/completions
Click Test Connection to verify
Click Save

Done! Your web pages can now use local Ollama models via WebLLM.

Managing Local Models

Switching Models (LM Studio)

In LM Studio’s Developer tab, you can select which model to use. The extension will use whichever model is currently loaded.

Switching Models (Ollama)

List available models:

ollama list

Switch to a different model:

ollama run <model-name>

Remove a model to free space:

ollama rm <model-name>

Performance Tips

RAM Requirements: Most 7B models need 8GB+ RAM, 3B models work with 4GB+
GPU Acceleration: Both tools automatically use your GPU if available (NVIDIA, AMD, or Apple Silicon)
Model Size: Smaller models (1B-3B) are faster but less capable; larger models (7B-70B) are more powerful but slower

Troubleshooting

Extension can’t connect to server:

Make sure LM Studio/Ollama is running
Check the server is on the correct port (1234 for LM Studio, 11434 for Ollama)
Verify firewall isn’t blocking localhost connections

Model responses are slow:

Try a smaller model (3B instead of 7B)
Ensure your GPU is being used (check LM Studio/Ollama logs)
Close other applications to free RAM

Out of memory errors:

Switch to a smaller model
Reduce context length in model settings
Close other applications

Next Steps

Provider Configuration - Configure API providers alongside local models
Routing Strategies - Control which provider handles requests
API Reference - Learn the full WebLLM API