Skip to content

Configure Local Models

Run AI models locally on your machine for free, with complete privacy. No API keys needed! Local models run entirely on your device—no data is sent to external servers, ensuring complete privacy.

LM Studio is a user-friendly desktop app with downloadable models, perfect for getting started with local AI.

Get LM Studio from lmstudio.ai (Free for Windows, Mac, and Linux)

  1. Open LM Studio
  2. Go to the Search tab
  3. Download a model like “Llama 3.2” or “Qwen 2.5”

Popular models:

  • Llama 3.2 3B - Fast, efficient for general tasks
  • Qwen 2.5 7B - Strong reasoning capabilities
  • Phi-3 Medium - Microsoft’s compact model
  1. Go to the Developer tab in LM Studio
  2. Click Start Server (runs on default port 1234)
  3. Keep LM Studio running in the background
  1. Open the WebLLM extension sidepanel
  2. Go to Providers tab
  3. Click Configure next to LM Studio
  4. The extension will auto-detect the running server
  5. Click Test Connection to verify
  6. Click Save

That’s it! WebLLM will now route requests to your local LM Studio models.


Ollama is a command-line tool for running LLMs, ideal for developers who prefer terminal-based workflows.

Download from ollama.ai or use the installation script:

Terminal window
curl -fsSL https://ollama.ai/install.sh | sh

Available for macOS, Linux, and Windows (WSL2).

Open your terminal and run:

Terminal window
ollama run llama3.2

This downloads the model (if needed) and starts it. Other popular models:

  • ollama run qwen2.5 - Qwen 2.5 (strong reasoning)
  • ollama run phi3 - Microsoft Phi-3 (compact)
  • ollama run codellama - Code-specialized model

Ollama automatically starts a server on port 11434. Test it with:

Terminal window
curl http://localhost:11434

You should see: Ollama is running

  1. Open the WebLLM extension sidepanel
  2. Go to Providers tab
  3. Click Configure next to Ollama
  4. Enter server URL: http://localhost:11434/v1/chat/completions
  5. Click Test Connection to verify
  6. Click Save

Done! Your web pages can now use local Ollama models via WebLLM.


In LM Studio’s Developer tab, you can select which model to use. The extension will use whichever model is currently loaded.

List available models:

Terminal window
ollama list

Switch to a different model:

Terminal window
ollama run <model-name>

Remove a model to free space:

Terminal window
ollama rm <model-name>
  • RAM Requirements: Most 7B models need 8GB+ RAM, 3B models work with 4GB+
  • GPU Acceleration: Both tools automatically use your GPU if available (NVIDIA, AMD, or Apple Silicon)
  • Model Size: Smaller models (1B-3B) are faster but less capable; larger models (7B-70B) are more powerful but slower

Extension can’t connect to server:

  • Make sure LM Studio/Ollama is running
  • Check the server is on the correct port (1234 for LM Studio, 11434 for Ollama)
  • Verify firewall isn’t blocking localhost connections

Model responses are slow:

  • Try a smaller model (3B instead of 7B)
  • Ensure your GPU is being used (check LM Studio/Ollama logs)
  • Close other applications to free RAM

Out of memory errors:

  • Switch to a smaller model
  • Reduce context length in model settings
  • Close other applications