Configure Local Models
Run AI models locally on your machine for free, with complete privacy. No API keys needed! Local models run entirely on your device—no data is sent to external servers, ensuring complete privacy.
LM Studio
Section titled “LM Studio”LM Studio is a user-friendly desktop app with downloadable models, perfect for getting started with local AI.
Step 1: Download & Install
Section titled “Step 1: Download & Install”Get LM Studio from lmstudio.ai (Free for Windows, Mac, and Linux)
Step 2: Download a Model
Section titled “Step 2: Download a Model”- Open LM Studio
- Go to the Search tab
- Download a model like “Llama 3.2” or “Qwen 2.5”
Popular models:
- Llama 3.2 3B - Fast, efficient for general tasks
- Qwen 2.5 7B - Strong reasoning capabilities
- Phi-3 Medium - Microsoft’s compact model
Step 3: Start Local Server
Section titled “Step 3: Start Local Server”- Go to the Developer tab in LM Studio
- Click Start Server (runs on default port 1234)
- Keep LM Studio running in the background
Step 4: Configure Extension
Section titled “Step 4: Configure Extension”- Open the WebLLM extension sidepanel
- Go to Providers tab
- Click Configure next to LM Studio
- The extension will auto-detect the running server
- Click Test Connection to verify
- Click Save
That’s it! WebLLM will now route requests to your local LM Studio models.
Ollama
Section titled “Ollama”Ollama is a command-line tool for running LLMs, ideal for developers who prefer terminal-based workflows.
Step 1: Install Ollama
Section titled “Step 1: Install Ollama”Download from ollama.ai or use the installation script:
curl -fsSL https://ollama.ai/install.sh | shAvailable for macOS, Linux, and Windows (WSL2).
Step 2: Download & Run a Model
Section titled “Step 2: Download & Run a Model”Open your terminal and run:
ollama run llama3.2This downloads the model (if needed) and starts it. Other popular models:
ollama run qwen2.5- Qwen 2.5 (strong reasoning)ollama run phi3- Microsoft Phi-3 (compact)ollama run codellama- Code-specialized model
Step 3: Verify Server is Running
Section titled “Step 3: Verify Server is Running”Ollama automatically starts a server on port 11434. Test it with:
curl http://localhost:11434You should see: Ollama is running
Step 4: Configure Extension
Section titled “Step 4: Configure Extension”- Open the WebLLM extension sidepanel
- Go to Providers tab
- Click Configure next to Ollama
- Enter server URL:
http://localhost:11434/v1/chat/completions - Click Test Connection to verify
- Click Save
Done! Your web pages can now use local Ollama models via WebLLM.
Managing Local Models
Section titled “Managing Local Models”Switching Models (LM Studio)
Section titled “Switching Models (LM Studio)”In LM Studio’s Developer tab, you can select which model to use. The extension will use whichever model is currently loaded.
Switching Models (Ollama)
Section titled “Switching Models (Ollama)”List available models:
ollama listSwitch to a different model:
ollama run <model-name>Remove a model to free space:
ollama rm <model-name>Performance Tips
Section titled “Performance Tips”- RAM Requirements: Most 7B models need 8GB+ RAM, 3B models work with 4GB+
- GPU Acceleration: Both tools automatically use your GPU if available (NVIDIA, AMD, or Apple Silicon)
- Model Size: Smaller models (1B-3B) are faster but less capable; larger models (7B-70B) are more powerful but slower
Troubleshooting
Section titled “Troubleshooting”Extension can’t connect to server:
- Make sure LM Studio/Ollama is running
- Check the server is on the correct port (1234 for LM Studio, 11434 for Ollama)
- Verify firewall isn’t blocking localhost connections
Model responses are slow:
- Try a smaller model (3B instead of 7B)
- Ensure your GPU is being used (check LM Studio/Ollama logs)
- Close other applications to free RAM
Out of memory errors:
- Switch to a smaller model
- Reduce context length in model settings
- Close other applications
Next Steps
Section titled “Next Steps”- Provider Configuration - Configure API providers alongside local models
- Routing Strategies - Control which provider handles requests
- API Reference - Learn the full WebLLM API