Notes from six months of running machine learning models directly in the browser using TensorFlow.js, ONNX Runtime Web, and WebGPU — what works, what doesn't, and what genuinely surprised me.
The browser is becoming a legitimate ML inference environment. That sentence would have sounded absurd five years ago. Today, with WebGPU shipping in major browsers and quantized model formats that compress a 7B parameter model into something that fits in a few gigabytes of GPU memory, it's increasingly real.
Over the last six months, I've been running experiments — mostly in the context of MeshLearn's on-device recommendation system, but also as standalone explorations. Here's what I've learned.
I've tried three primary approaches:
The arrival of WebGPU in Chrome 113 and Firefox Nightly is genuinely significant. WebGL was designed for graphics, not compute — it works for ML inference but it's a workaround. WebGPU exposes compute shaders directly, which is the right primitive for matrix multiplication at scale.
In my benchmarks, the same ONNX model running with WebGPU is roughly 3–4× faster than WebGL on an M1 MacBook. On machines with dedicated GPUs, the gap is larger.
Model: DistilBERT-base-uncased-finetuned-sst-2 (quantized INT8, 67MB)
WebGL backend: ~185ms per inference
WebGPU backend: ~48ms per inference
WebAssembly SIMD (CPU): ~320ms per inference
Tested on M1 MacBook Air, Chrome 124
The biggest practical barrier to client-side ML isn't performance — it's model size. Even quantized models are large by web standards. A quantized DistilBERT is 67MB. A quantized Whisper base is 150MB. Users aren't used to web applications that require a 150MB download before they do anything useful.
Approaches that help:
This is the part that actually excites me most about browser-side ML. When inference happens on the client, user data never leaves the device. For applications where the input is sensitive — health data, private communications, financial information — this is not just a privacy win, it's a competitive advantage.
MeshLearn's recommendation engine works this way by design. Your study patterns, your completion rates, your engagement signals — they stay on your phone. The model improves via federated learning, but the raw data never moves. That's a meaningful commitment that purely cloud-based EdTech cannot make.
The area I'm most excited about right now is retrieval-augmented generation (RAG) in the browser. The pieces are coming together: small embedding models (all-MiniLM-L6-v2 is 23MB quantized), client-side vector stores like LanceDB-WASM, and small language models that can run inference at reasonable speeds on modern hardware.
The vision: a web application that can perform sophisticated contextual retrieval and generation entirely on the client, with no API calls, no data leaving the device, and no usage costs. For certain categories of application, this is now buildable.
I love talking about ML in constrained environments. Let's compare notes.