Skip to content

Provider Management

WebLLM supports multiple AI providers with automatic fallback. Users configure their preferred providers and priorities, and the extension handles the rest.

  • Anthropic (Claude) - Requires API key from console.anthropic.com
  • OpenAI (GPT) - Requires API key from platform.openai.com
  • Custom OpenAI-compatible - Any API following OpenAI format
  • Local Models - Run entirely in browser via WebGPU/WASM
    • Llama 3.2 1B (~1.2GB)
    • Phi-3 Mini (~2GB)
    • Other ONNX-compatible models

Users configure provider priority in extension settings:

Priority Order (drag to reorder):
1. 🟢 Local Model (Llama 3.2 1B) [Enabled]
2. 🔑 Anthropic API [Enabled]
3. 🔑 OpenAI API [Disabled]

When a request comes in:

  1. Try highest priority provider first
  2. If unavailable (no API key, model not downloaded, rate limited), try next
  3. Continue until success or all providers exhausted
  4. Return error only if all providers fail

Scenario 1: Local Model Available

  • Request → Local Model (free, instant)
  • API providers never called
  • No cost, maximum privacy

Scenario 2: Local Model Unavailable

  • Request → Local Model (not downloaded) → Skip
  • → Anthropic API (has key) → Success
  • Uses user’s API key, user pays

Scenario 3: All Providers Need Setup

  • Request → Extension prompts user to configure
  • User adds API key or downloads model
  • Request retried automatically
  1. Open extension settings
  2. Click “Add Provider” or configure existing
  3. Select provider (Anthropic, OpenAI, or Custom)
  4. Enter API key
  5. (Optional) Test connection
  6. Enable and set priority
  1. Open extension settings
  2. Go to “Model Management”
  3. Browse available models
  4. Click “Download” (downloads to IndexedDB)
  5. Once downloaded, appears in provider list
  6. Enable and set priority

For services like:

  • Together AI
  • Anyscale
  • Local LM Studio
  • Self-hosted vLLM

Configuration:

Provider: Custom OpenAI-compatible
Base URL: https://api.together.xyz/v1
API Key: your-api-key
Model ID: meta-llama/Llama-3-8b-chat-hf

A provider is considered available if:

  • ✅ API key is configured
  • ✅ Provider is enabled
  • ✅ Not rate-limited
  • ✅ Internet connection available
  • ✅ Model is downloaded
  • ✅ Provider is enabled
  • ✅ Sufficient memory available
  • ✅ WebGPU/WASM support detected

The extension handles failures gracefully:

// User doesn't see this complexity
const result = await llm.summarize(text);
// Behind the scenes:
// 1. Try local model → Out of memory
// 2. Try Anthropic → Rate limited
// 3. Try OpenAI → Success ✓

Common reasons for fallback:

  • Local Model

    • Not downloaded
    • Out of memory
    • GPU not available
  • API Provider

    • Invalid API key
    • Rate limit exceeded
    • Network error
    • Insufficient credits

All providers implement the same interface:

interface Provider {
name: string;
type: 'api' | 'local';
// Check if ready to use
isAvailable(): Promise<boolean>;
// Execute request
execute(request: LLMRequest): Promise<LLMResponse>;
// Streaming support
stream?(request: LLMRequest): Promise<ReadableStream>;
}

Regardless of provider, responses use standard format:

{
content: string, // Generated text
usage: {
inputTokens: number,
outputTokens: number,
cost?: number // If known
},
metadata: {
provider: string, // Which provider was used
model: string, // Which model
latency: number // Response time in ms
}
}

When using API providers with known pricing:

// Response includes cost information
const result = await llm.generate(prompt);
console.log(result.usage);
// {
// inputTokens: 150,
// outputTokens: 200,
// cost: 0.0012 // $0.0012
// }

View accumulated costs in extension:

  • Per-origin spending
  • Daily/weekly/monthly totals
  • By provider breakdown
  1. Start with local models - Free and private
  2. Add API key as backup - For complex tasks
  3. Monitor costs - Check spending in settings
  4. Revoke unused permissions - Keep control
  1. Respect user’s choices - Don’t require specific provider
  2. Handle unavailability - Extension might not be installed
  3. Degrade gracefully - Offer fallback UX if LLM unavailable
  4. Be transparent - Tell users what you’ll use AI for

API providers have rate limits:

  • Extension tracks and respects them
  • Shows cooldown timer to user
  • Automatically tries next provider

Per-request maximums:

  • Local models: Typically 2048-4096 tokens
  • API models: Varies by model (8k-200k tokens)
  • Extension validates before sending

Different models support different context sizes:

  • Extension warns if prompt too long
  • Truncates or chunks if necessary
  • Shows warning to user