All posts
Deep Dive10 min readJune 10, 2026

How AI Background Removal Works: ISNet, ONNX Runtime, and WebAssembly Explained

The two-second background removal you see in your browser hides a deep technology stack. Here is exactly what ISNet, ONNX Runtime Web, and WebAssembly do, and why it all runs without uploading a single pixel.

ZP

ZeroPNG Team

Editorial

How AI Background Removal Works: ISNet, ONNX Runtime, and WebAssembly Explained

The "Magic" Behind the Two-Second Cut

You drop a photo into a background remover, wait two seconds, and the subject is cleanly isolated with a transparent background. No drawing tools, no manual selection, no Photoshop expertise required. It looks effortless, because the interface hides a sophisticated technology stack running silently inside your browser tab.

This post is the technical explanation of how it actually works: the neural network, the model format, the browser runtime, and the execution engine that make server-grade AI inference possible on your own device, without uploading a single pixel.

Step 1: The Neural Network - ISNet

At the core of every AI background remover is a deep learning model trained to do one thing precisely: separate a foreground subject from its background at the pixel level. The model ZeroPNG uses is called ISNet (Image Segmentation Network).

Unlike simple edge-detection algorithms or threshold-based approaches, ISNet understands semantic context. It has been trained on hundreds of thousands of labelled images and has learned to recognize where a person ends and the background begins, even when colors are similar, edges are complex (hair, fur, semi-transparent fabric), or the subject is non-human.

What Is Image Segmentation?

Image segmentation is the process of classifying every pixel in an image into one of two (or more) categories. In background removal, the two categories are straightforward: subject (keep) and background (remove). The neural network outputs a matte - a grayscale mask the same dimensions as the input image, where white pixels represent the subject and black pixels represent the background. Pixels at the edge of the subject (the wisps of hair against a bright sky, for example) receive an intermediate grey value representing partial opacity, producing smooth anti-aliased edges rather than a jagged cutout.

This mask is then composited with the original image: subject pixels keep their original RGBA values, background pixels become fully transparent, and edge pixels have their alpha channel set proportionally to the mask value.

How ISNet Compares to Other Segmentation Models

ModelStrengthLimitation
ISNetHigh-accuracy isolation, handles hair and fur wellHeavier than simpler models
U²-NetGood general segmentationLess precise on fine edges
MODNetFast portrait mattingStruggles with non-human subjects
SAM (Meta)Universal, interactive segmentationFar too large for real-time browser use

ISNet hits the right balance for a general-purpose web tool: accurate enough for portraits, animals, and product photos, small enough to run in a browser after quantization.

The Quantized Variant: isnet_quint8

The ISNet model ships in three sizes:

  • isnet - Full 32-bit precision, ~170 MB, highest quality
  • isnet_fp16 - 16-bit half precision, ~80 MB, the library default
  • isnet_quint8 - 8-bit integer quantized, ~30 MB, fastest download and inference

ZeroPNG uses isnet_quint8. Quantization is the process of converting a model's weights from 32-bit floating point values to 8-bit integers. This compresses the model to roughly one-fifth of its original size and speeds up CPU inference significantly, at the cost of only a marginal and typically imperceptible drop in accuracy for standard photographs. For a web tool where the model must be downloaded over the network before it can run, the quantized variant is the only practical choice that keeps the first-run experience tolerable.

Step 2: The Model Format - ONNX

A neural network trained in PyTorch, TensorFlow, or any other deep learning framework produces a model file tied to that specific framework. To run a PyTorch model, you need PyTorch. On a Python server, this is trivial. In a browser, it is impossible, the browser cannot execute a Python runtime.

ONNX (Open Neural Network Exchange) solves this. It is an open, platform-agnostic format for representing any neural network as a standardized computation graph, a serialized description of every operation, every layer, and every weight value in the network, in a format that any compliant runtime can parse and execute.

Think of ONNX as a PDF for neural networks. The researcher trains using PyTorch (or whatever framework they prefer). The trained model is exported to .onnx. The end user's runtime reads that file, on a server, on a phone, in a browser, without caring what framework created it.

The ISNet model in ZeroPNG's background remover is an .onnx binary that describes the exact operations, layer shapes, and quantized weight values of the network in runtime-agnostic form. This is what gets downloaded to your browser on first use.

Step 3: The Runtime - ONNX Runtime Web

Having an ONNX model file is only half the story. You still need an engine to parse that file and execute the computation graph. That engine is ONNX Runtime, an open-source high-performance inference library built by Microsoft.

ONNX Runtime is available for every major deployment environment: C++ and Python for servers, Java and Kotlin for Android, Swift for iOS, and JavaScript for the browser. The browser-specific build is called ONNX Runtime Web.

ONNX Runtime Web exposes a JavaScript API that handles the full inference lifecycle:

  1. Load the .onnx file from network or browser cache
  2. Create an InferenceSession - compile the computation graph for the available backend
  3. Prepare input tensors: raw pixel data from your image, resized and normalized to float32
  4. Run the forward pass - every layer of ISNet executes in sequence
  5. Read the output tensor - the raw segmentation mask as float values

Internally, ONNX Runtime Web supports multiple execution backends, and selects the best available automatically:

  • WebAssembly (WASM) - CPU-based inference, supported in every modern browser
  • WebGL - GPU-accelerated via OpenGL ES compute shaders, significantly faster on capable hardware
  • WebGPU - Next-generation GPU compute, available in Chrome 113+ and Edge

The runtime falls back gracefully through this list, which means the tool always produces a result regardless of the user's hardware, even on older devices that only have WASM available.

Step 4: The Execution Engine - WebAssembly

WebAssembly (WASM) is a binary instruction format that modern browsers execute at near-native speed. It was designed to allow code written in compiled languages, C, C++, Rust to run inside the browser's sandbox without any plugins or extensions.

Before WebAssembly, every computation in the browser had to go through JavaScript. JavaScript is an interpreted, dynamically-typed language, excellent for building user interfaces, but orders of magnitude too slow for the billions of floating-point multiplications that a single neural network inference pass requires. Running ISNet in pure JavaScript would take minutes per image.

WebAssembly changes this: ONNX Runtime's core inference engine is written in C++ and compiled to WASM. The browser executes the same optimized, low-level arithmetic that would run on a server, just inside your tab, on your hardware, with your data never leaving your device.

The performance gap is not marginal. WebAssembly executes at speeds within 10–30% of equivalent native code. JavaScript for the same compute-intensive workload can be 10–100× slower. That difference is what separates a two-second result from a three-minute wait.

The Full Pipeline: What Happens When You Drop a Photo

Putting it all together, here is the exact sequence ZeroPNG's background remover follows:

  1. First visit only: The isnet_quint8.onnx file (~30 MB) is fetched from IMG.LY's CDN and stored in the browser's cache. This is a one-time download.
  2. Model compilation: ONNX Runtime Web reads the cached model and compiles an InferenceSession, preparing the computation graph for the chosen backend. This is the "Compiling AI neural net…" step shown in the progress bar.
  3. Pre-processing: Your image is decoded into raw pixel data, resized to the model's fixed 1024×1024 input dimensions, and converted to a float32 tensor with values normalized to the [0, 1] range.
  4. Inference: The input tensor passes through all layers of ISNet, convolutional layers, attention mechanisms, skip connections, producing an output tensor that maps each pixel to a subject-probability score.
  5. Post-processing: The mask tensor is resized back to the original image dimensions. Each pixel's alpha channel is set proportionally to its mask score, producing smooth, anti-aliased edges around the subject.
  6. Output: The result is encoded as a PNG file, the only widely-supported format with full transparency, and presented for download. All six steps happen entirely on your device.

The Library Behind It: @imgly/background-removal

ZeroPNG uses the @imgly/background-removal npm package by IMG.LY, which wraps the entire ISNet + ONNX Runtime + WebAssembly stack into a single async function call. The configuration in ZeroPNG's source is deliberately minimal:

import { removeBackground } from '@imgly/background-removal'

const blob = await removeBackground(imageFile, {
  model: 'isnet_quint8',
  progress: (key, current, total) => {
    // update UI progress indicator
  }
})

That single removeBackground() call manages model fetching, cache management, ONNX Runtime session initialization, tensor preparation, inference, mask compositing, and PNG encoding. The complexity is real. The API surface is minimal.

Performance and Caching

The first run requires a ~30 MB model download. On a typical broadband connection this takes 3–8 seconds. That download happens exactly once. After the first use, the browser caches both the model file and the ONNX Runtime WASM binaries. Every subsequent run loads from cache, fast, and fully functional offline.

Inference time once the model is loaded is typically 2–5 seconds for a standard photograph on a modern device. Lower-end hardware takes longer. WebGL and WebGPU acceleration reduce this substantially when available. The quantized variant is specifically chosen to keep this latency acceptable across the widest possible range of devices.

Why This Architecture Matters Beyond One Tool

The ISNet + ONNX + WebAssembly pattern is not specific to background removal. It is a general blueprint for running any neural network, object detection, style transfer, super-resolution, OCR, image classification, directly in the browser. Any model that can be exported to ONNX can be shipped to a user's device and executed without a server.

Every capability that migrates from server-side to client-side becomes private by default, free by default, and offline-capable by default. The browser has quietly become a serious AI runtime. ZeroPNG's background remover is one example of what that makes possible.

See It Running in Your Browser

ZeroPNG's AI Background Remover runs ISNet directly in your browser tab. No uploads, no account, no watermark. Drop any photo and the model does the rest, entirely on your device.

Remove Background Free →

Found this useful?

Share it with someone who needs it.

Share on X