Provider Management

Overview

WebLLM supports multiple AI providers with automatic fallback. Users configure their preferred providers and priorities, and the extension handles the rest.

20+ Supported Providers

API Providers

Anthropic (Claude) - Requires API key from console.anthropic.com
OpenAI (GPT) - Requires API key from platform.openai.com
Custom OpenAI-compatible - Any API following OpenAI format

Local Providers

Local Models - Run entirely in browser via WebGPU/WASM
- Llama 3.2 1B (~1.2GB)
- Phi-3 Mini (~2GB)
- Other ONNX-compatible models

Priority System

Users configure provider priority in extension settings:

Priority Order (drag to reorder):
1. 🟢 Local Model (Llama 3.2 1B)      [Enabled]
2. 🔑 Anthropic API                    [Enabled]
3. 🔑 OpenAI API                       [Disabled]

How Priority Works

When a request comes in:

Try highest priority provider first
If unavailable (no API key, model not downloaded, rate limited), try next
Continue until success or all providers exhausted
Return error only if all providers fail

Example Scenarios

Scenario 1: Local Model Available

Request → Local Model (free, instant)
API providers never called
No cost, maximum privacy

Scenario 2: Local Model Unavailable

Request → Local Model (not downloaded) → Skip
→ Anthropic API (has key) → Success
Uses user’s API key, user pays

Scenario 3: All Providers Need Setup

Request → Extension prompts user to configure
User adds API key or downloads model
Request retried automatically

Provider Configuration

Adding API Provider

Open extension settings
Click “Add Provider” or configure existing
Select provider (Anthropic, OpenAI, or Custom)
Enter API key
(Optional) Test connection
Enable and set priority

Adding Local Model

Open extension settings
Go to “Model Management”
Browse available models
Click “Download” (downloads to IndexedDB)
Once downloaded, appears in provider list
Enable and set priority

Custom OpenAI-Compatible API

For services like:

Together AI
Anyscale
Local LM Studio
Self-hosted vLLM

Configuration:

Provider: Custom OpenAI-compatible
Base URL: https://api.together.xyz/v1
API Key: your-api-key
Model ID: meta-llama/Llama-3-8b-chat-hf

Provider Availability

A provider is considered available if:

For API Providers

✅ API key is configured
✅ Provider is enabled
✅ Not rate-limited
✅ Internet connection available

For Local Providers

✅ Model is downloaded
✅ Provider is enabled
✅ Sufficient memory available
✅ WebGPU/WASM support detected

Automatic Fallback

The extension handles failures gracefully:

// User doesn't see this complexity
const result = await llm.summarize(text);

// Behind the scenes:
// 1. Try local model → Out of memory
// 2. Try Anthropic → Rate limited
// 3. Try OpenAI → Success ✓

Failure Reasons

Common reasons for fallback:

Local Model
- Not downloaded
- Out of memory
- GPU not available
API Provider
- Invalid API key
- Rate limit exceeded
- Network error
- Insufficient credits

Provider Interface

All providers implement the same interface:

interface Provider {
  name: string;
  type: 'api' | 'local';

  // Check if ready to use
  isAvailable(): Promise<boolean>;

  // Execute request
  execute(request: LLMRequest): Promise<LLMResponse>;

  // Streaming support
  stream?(request: LLMRequest): Promise<ReadableStream>;
}

Normalized Response

Regardless of provider, responses use standard format:

{
  content: string,           // Generated text
  usage: {
    inputTokens: number,
    outputTokens: number,
    cost?: number           // If known
  },
  metadata: {
    provider: string,       // Which provider was used
    model: string,          // Which model
    latency: number         // Response time in ms
  }
}

Cost Tracking

When using API providers with known pricing:

// Response includes cost information
const result = await llm.generate(prompt);

console.log(result.usage);
// {
//   inputTokens: 150,
//   outputTokens: 200,
//   cost: 0.0012  // $0.0012
// }

View accumulated costs in extension:

Per-origin spending
Daily/weekly/monthly totals
By provider breakdown

Best Practices

For Users

Start with local models - Free and private
Add API key as backup - For complex tasks
Monitor costs - Check spending in settings
Revoke unused permissions - Keep control

For Developers

Respect user’s choices - Don’t require specific provider
Handle unavailability - Extension might not be installed
Degrade gracefully - Offer fallback UX if LLM unavailable
Be transparent - Tell users what you’ll use AI for

Provider Limits

Rate Limits

API providers have rate limits:

Extension tracks and respects them
Shows cooldown timer to user
Automatically tries next provider

Token Limits

Per-request maximums:

Local models: Typically 2048-4096 tokens
API models: Varies by model (8k-200k tokens)
Extension validates before sending

Context Windows

Different models support different context sizes:

Extension warns if prompt too long
Truncates or chunks if necessary
Shows warning to user