AI Experiments in the Browser

The browser is becoming a legitimate ML inference environment. That sentence would have sounded absurd five years ago. Today, with WebGPU shipping in major browsers and quantized model formats that compress a 7B parameter model into something that fits in a few gigabytes of GPU memory, it's increasingly real.

Over the last six months, I've been running experiments — mostly in the context of MeshLearn's on-device recommendation system, but also as standalone explorations. Here's what I've learned.

The Stack That Actually Works

I've tried three primary approaches:

TensorFlow.js — The most mature ecosystem. Great documentation. Backend options include WebGL (widely supported), WebGPU (faster, newer), and WASM (CPU fallback). The main limitation is that the TF.js model format (TFJS) requires conversion from the original TensorFlow SavedModel format, which adds friction.
ONNX Runtime Web — Increasingly my preference. ONNX is a cross-framework interchange format, so I can export from PyTorch or TensorFlow and run the same model everywhere. The WebGPU execution provider is fast. The WebAssembly SIMD backend is a surprisingly good CPU fallback.
Transformers.js (Hugging Face) — For NLP tasks specifically, this is now the easiest path. Pre-quantized models from the Hugging Face Hub, a familiar API if you've used the Python transformers library. Ran a sentiment classification model at ~20ms per inference on a mid-range laptop.

What WebGPU Changes

The arrival of WebGPU in Chrome 113 and Firefox Nightly is genuinely significant. WebGL was designed for graphics, not compute — it works for ML inference but it's a workaround. WebGPU exposes compute shaders directly, which is the right primitive for matrix multiplication at scale.

In my benchmarks, the same ONNX model running with WebGPU is roughly 3–4× faster than WebGL on an M1 MacBook. On machines with dedicated GPUs, the gap is larger.

Benchmark: Sentiment Classification

Model: DistilBERT-base-uncased-finetuned-sst-2 (quantized INT8, 67MB)
WebGL backend: ~185ms per inference
WebGPU backend: ~48ms per inference
WebAssembly SIMD (CPU): ~320ms per inference
Tested on M1 MacBook Air, Chrome 124

The Size Problem

The biggest practical barrier to client-side ML isn't performance — it's model size. Even quantized models are large by web standards. A quantized DistilBERT is 67MB. A quantized Whisper base is 150MB. Users aren't used to web applications that require a 150MB download before they do anything useful.

Approaches that help:

Progressive loading — Load a smaller, faster model first. Use it for the common case. Load the larger model in the background for edge cases that need it.
Model caching with Cache API — Once downloaded, a model should never download again. Cache API allows byte-range storage that persists across sessions.
Task-specific distillation — Train a tiny model (10MB or less) on your specific task using a large model as teacher. For narrow tasks, a well-distilled small model outperforms a large general model.

The Privacy Angle

This is the part that actually excites me most about browser-side ML. When inference happens on the client, user data never leaves the device. For applications where the input is sensitive — health data, private communications, financial information — this is not just a privacy win, it's a competitive advantage.

MeshLearn's recommendation engine works this way by design. Your study patterns, your completion rates, your engagement signals — they stay on your phone. The model improves via federated learning, but the raw data never moves. That's a meaningful commitment that purely cloud-based EdTech cannot make.

What I'll Keep Exploring

The area I'm most excited about right now is retrieval-augmented generation (RAG) in the browser. The pieces are coming together: small embedding models (all-MiniLM-L6-v2 is 23MB quantized), client-side vector stores like LanceDB-WASM, and small language models that can run inference at reasonable speeds on modern hardware.

The vision: a web application that can perform sophisticated contextual retrieval and generation entirely on the client, with no API calls, no data leaving the device, and no usage costs. For certain categories of application, this is now buildable.