Skip to content

Browser Extension Architecture

The WebLLM browser extension serves as a polyfill for the future native API, providing standardized AI access before browsers implement native support.

The popup provides user settings and configuration:

Settings
├── Data Retention
│ ├── Keep history: [7 days | 30 days | 90 days | Forever]
│ ├── Auto-delete after: [checkbox]
│ └── Clear all data now [button]
├── Provider Configuration
│ ├── Priority Order (drag-to-reorder)
│ │ 1. 🟢 Local Model (Llama 3.2 1B)
│ │ 2. 🔑 Anthropic API (configured)
│ │ 3. 🔑 OpenAI API (not configured)
│ └── Add Provider [+]
├── Model Management
│ ├── Download Local Models
│ └── Installed Models
└── Permissions
├── Allowed Sites
└── Blocked Sites

Injected into every web page, provides the navigator.llm API:

// Injected into page context
navigator.llm = {
async request(config) {
return new Promise((resolve, reject) => {
const messageId = crypto.randomUUID();
// Send to background worker
window.postMessage({
type: 'WEBLLM_REQUEST',
messageId,
config,
origin: window.location.origin
}, '*');
// Wait for response
window.addEventListener('message', function handler(event) {
if (event.data.type === 'WEBLLM_RESPONSE' &&
event.data.messageId === messageId) {
window.removeEventListener('message', handler);
if (event.data.error) reject(event.data.error);
else resolve(event.data.result);
}
});
});
}
};

The core orchestrator that:

  • Manages permissions
  • Routes requests to providers
  • Stores conversation history
  • Handles provider fallback
  • Shows usage notifications
class WebLLMService {
async handleRequest(request, origin) {
// 1. Check permissions
const hasPermission = await this.permissionManager.check(origin);
if (!hasPermission) {
const granted = await this.requestPermission(origin);
if (!granted) throw new Error('Permission denied');
}
// 2. Show notification
this.notifyUsage(origin, request.action);
// 3. Get provider
const provider = await this.providerManager.getNextAvailable();
// 4. Execute request
const response = await provider.execute(request);
// 5. Store in local DB
await this.db.store({
origin,
request,
response,
timestamp: Date.now(),
provider: provider.name
});
// 6. Schedule cleanup
this.scheduleCleanup();
return response;
}
}

Stores:

  • Conversations - Request/response history with expiration
  • Providers - User’s provider configurations
  • Permissions - Per-origin access grants
  • Settings - User preferences
{
conversations: {
id: string,
origin: string,
timestamp: number,
messages: Message[],
provider: string,
expiresAt: number
},
providers: {
id: string,
type: 'api' | 'local',
priority: number,
config: {
apiKey?: string,
baseUrl?: string,
modelPath?: string
},
enabled: boolean
},
permissions: {
origin: string,
granted: boolean,
grantedAt: number,
lastUsed: number
},
settings: {
retentionDays: number,
autoDelete: boolean,
defaultProvider: string
}
}

Handles provider selection and fallback:

class ProviderManager {
async getNextAvailable() {
// Get providers by priority
const providers = await db.providers
.where('enabled').equals(true)
.sortBy('priority');
// Try each in order
for (const provider of providers) {
if (await this.isAvailable(provider)) {
return this.instantiate(provider);
}
}
throw new Error('No providers available');
}
async isAvailable(provider) {
if (provider.type === 'local') {
return await this.checkLocalModel(provider);
} else {
return provider.config.apiKey?.length > 0;
}
}
}
  1. User visits a website that uses WebLLM
  2. Website calls llm.summarize(...)
  3. Extension shows permission prompt: “Allow example.com to use WebLLM?”
  4. If no providers configured, shows setup wizard
  5. User configures first provider (API key or local model)
  6. Request proceeds
  1. Website calls API
  2. Small notification appears: “example.com is using WebLLM”
  3. Request processed automatically
  4. Notification auto-dismisses after 5 seconds
.webllm-notification {
position: fixed;
top: 16px;
right: 16px;
background: white;
border: 1px solid #e0e0e0;
border-radius: 8px;
padding: 12px 16px;
box-shadow: 0 4px 12px rgba(0, 0, 0, 0.15);
z-index: 999999;
animation: slideIn 0.3s ease;
}

The extension can run AI models locally using:

  • WebGPU - Hardware-accelerated inference in browser
  • WASM - CPU-based fallback
  • ONNX Runtime - Optimized model execution
  1. User opens extension settings
  2. Selects “Download Local Models”
  3. Sees available models (e.g., “Llama 3.2 1B - 1.2GB”)
  4. Clicks download, model downloads to IndexedDB
  5. Model becomes available in provider list
class LocalModelProvider {
constructor(config) {
this.modelPath = config.modelPath;
this.session = null;
}
async initialize() {
// Load ONNX model
this.session = await ort.InferenceSession.create(this.modelPath, {
executionProviders: ['webgpu', 'wasm']
});
}
async execute(request) {
if (!this.session) await this.initialize();
// Tokenize
const tokens = await this.tokenize(request.prompt);
// Run inference
const outputs = await this.session.run({
input_ids: new ort.Tensor('int64', tokens, [1, tokens.length])
});
// Decode
const text = await this.decode(outputs.logits);
return { content: text, usage: { inputTokens: tokens.length } };
}
}
  • Runs in isolated world (no access to page globals)
  • Only exposes navigator.llm API
  • Cannot access page’s API keys or sensitive data
  • Stored in encrypted IndexedDB
  • Only accessible to background worker
  • Never exposed to content scripts or web pages
  • Separate permissions per origin
  • Can revoke access anytime
  • Blocked sites list prevents abuse
  • Validate all inputs before processing
  • Rate limiting per origin
  • Maximum token limits
  • Timeout enforcement
  • Queue rapid requests
  • Batch when possible
  • Reduce API calls
  • Cache common requests (within retention period)
  • Deduplicate identical prompts
  • Cache model outputs
  • Load providers on demand
  • Initialize local models only when needed
  • Unload unused models after timeout