WebNN, Web 端侧推理的未来

胡宁馨 ningxin.hu@intel.com
张敏 belem.zhang@intel.com

英特尔 SATG Web 平台工程
2023 年 11 月

WebML 客户端推理的优势

隐私

摄像头、麦克风等传感器数据保留在设备中

离线

初始资源缓存并离线后，不再依赖网络

延迟

无云端网络问题，浏览器实时推理

成本

无需云端算力支持

0 安装

浏览器中运行，无需额外安装，并易于共享

跨平台

在几乎所有平台上运行 AI 应用

WebML 客户端推理

突发的
延迟敏感

持续的
电量敏感

周期的
吞吐量敏感

多样的客户端 AI 场景, 多种满足需求的计算单元

CPU
无处不在
低延迟, 单一推理任务

GPU
高并行性, 高 batch size
与 3D/渲染/媒体管道集成

NPU
专用低功耗AI加速器
高能耗比, 提升电源效率

Web 开发者的需求

The web needs its own neural networks specification to leverage Apple Silicon, Tensor Cores, and others.

Delighted to find the working drafts of WebNN. Incredible new power unlocked for the free, open and competitive Web!

Native Tensor support! Would be amazing to have Tensor objects and ops built into Chrome, and available as an “ML API”.

Although some scientific computing libraries exist for JS/TS, having built-in support would be far more desirable!

If go through the code of utils, maths, audio, tensor in JS, it is annoying that I had to implement these ops myself in JS.

llama2-7b in the browser – using WebNN – is going to be 🔥🔥 on-device, local ML 💪 cc @xenovacom

WebNN 简介

新兴的 W3C Web 标准 API

神经网络的统一抽象

通过原生 ML API 访问 AI 硬件加速器

接近原生的 AI 推理性能和结果的可靠性

目前在 Chrome 和 Edge Canary 中可用 (runtime flag)

WebNN 标准规范

WebNN 标准规范由 W3C Web Machine Learning 工作组负责起草
WebNN 标准规范由 Intel 及 Microsoft 联合编辑

WebNN 标准规范进展

已交付

2023 年 3 月: W3C CR
60 个 CNN/RNN 运算, float16/32, int32/uint32, int8/uint8
图像分类: SqueezeNet, MobileNet, ResNet
物体检测: TinyYOLO
噪声抑制: RNNoise, NSNet

进行中

2023 年末: W3C 候选推荐更新
18~22 个 Transformer ops, int64/uint64, NPU 和量化 (TBD)

目标模型

Text-to-image: Stable Diffusion unet/VAE/text encoder
Image segmentation: Segment Everything decoder
Speech-to-text: Whisper Tiny
Text-to-text 生成 (encoder-decoder): T5 及 M2M100
Text-generation (decoder): LLaMA

WebNN 架构

WebNN 在 Chromium 中的实现

Neural Network Hardware Abstraction Layer / Microsoft Compute Driver Model / Windows Display Driver Model

WebNN 操作符的实现状态 (部分)

WebNN Spec	Web Platform Tests	XNNPack/CPU backend	External Delegate	Execution Provider
			Lite for TensorFlow.js
			Lite for TensorFlow.js
clamp	📈 🧪	✅ clamp	✅ ReluN1To1	✅ Clip
clamp	📈 🧪	✅ Relu6	✅ ReluN1To1	✅ Clip
concat	📈 🧪	✅ concatenate2	✅ Concatenation	✅ Concat
		✅ concatenate3
		✅ concatenate4
conv2d	📈 🧪	✅ convolution_2d	✅ Conv2d	✅ Conv
conv2d	📈 🧪	✅ convolution_2d	✅ DepthwiseConv2d	✅ Conv
convTranspose2d	📈 🧪	✅ deconvolution_2d	✅ TransposeConv	✅ ConvTranspose
convTranspose2d	📈 🧪	✅ deconvolution_2d	✅ Convolution2DTransposeBias	✅ ConvTranspose
add ^{element-wise binary}	📈 🧪	✅ add2	✅ Add	✅ Add
sub ^{element-wise binary}	📈 🧪	✅ subtract	✅ Sub	✅ Sub
mul ^{element-wise binary}	📈 🧪	✅ multiply2	✅ Mul	✅ Mul
div ^{element-wise binary}	📈 🧪	✅ divide	✅ Div	✅ Div
max ^{element-wise binary}	📈 🧪	✅ maximum2	✅ Maximum	✅ Max
min ^{element-wise binary}	📈 🧪	✅ minimum2	✅ Minimum	✅ Min
abs ^{element-wise unary}	📈 🧪	✅ abs	✅ Abs	✅ Abs
ceil ^{element-wise unary}	📈 🧪	✅ ceiling	✅ Ceil	✅ Ceil
floor ^{element-wise unary}	📈 🧪	✅ floor	✅ Floor	✅ Floor
neg ^{element-wise unary}	📈 🧪	✅ negate	✅ Neg	✅ Neg
elu	📈 🧪	✅ elu	✅ Elu	✅ Elu
gemm	📈 🧪	✅ fully_connected	✅ FullyConnected	✅ Gemm

WebNN 操作符的实现状态 (部分)

WebNN Spec	Web Platform Tests	XNNPack/CPU backend	External Delegate	Execution Provider
			Lite for TensorFlow.js
			Lite for TensorFlow.js
hardSwish	📈 🧪	✅ hardswish	✅ HardSwish	✅ HardSwish
leakyRelu	📈 🧪	✅ leaky_relu	✅ LeakyRelu	✅ LeakyRelu
pad	📈 🧪	✅ static_constant_pad	✅ Pad	✅ Pad
averagePool2d ^pooling	📈 🧪	✅ average_pooling_2d	✅ AveragePool2d	✅ GlobalAveragePool
averagePool2d ^pooling	📈 🧪	✅ average_pooling_2d	✅ Mean	✅ AveragePool
maxPool2d ^pooling	📈 🧪	✅ max_pooling_2d	✅ MaxPool2d	✅ GlobalMaxPool
maxPool2d ^pooling	📈 🧪	✅ max_pooling_2d	✅ MaxPool2d	✅ MaxPool
prelu	📈 🧪	✅ prelu	✅ Prelu	✅ Prelu
relu	📈 🧪	✅ clamp	✅ Relu	✅ Relu
resample2d	🚀🚀	✅ static_resize_bilinear_2d	✅ ResizeBilinear	✅ Resize
reshape	📈 🧪	✅ static_reshape	✅ Reshape	✅ Reshape
sigmoid	📈 🧪	✅ sigmoid	✅ Logistic	✅ Sigmoid
split	📈 🧪	✅ even_split2	✅ Split	✅ Split
		✅ even_split3
		✅ even_split4
		✅ static_slice (uneven split)
slice	📈 🧪	✅ static_slice	✅ Slice	✅ Slice
slice	📈 🧪	✅ static_slice	✅ StridedSlice	✅ Slice
softmax	📈 🧪	✅ softmax	✅ Softmax	✅ Softmax
transpose	📈 🧪	✅ static_transpose	✅ Transpose	✅ Transpose
78	29	35	34	68

WebNN 的实现状态 (DirectML)

目前已经支持 40 个 ops
Transfomer 的 ops 也正与 spec 同步开发中
正在为 WebNN NPU 支持作出适配

WebNN 编程模型

WebNN 代码示例

										
										const context = await navigator.ml.createContext({powerPreference: 'low-power'});

										// The following code builds a graph as:
										// constant1 ---+
										//              +--- Add ---> intermediateOutput1 ---+
										// input1    ---+                                    |
										//                                                   +--- Mul---> output
										// constant2 ---+                                    |
										//              +--- Add ---> intermediateOutput2 ---+
										// input2    ---+

										// Use tensors in 4 dimensions.
										const TENSOR_DIMS = [1, 2, 2, 2];
										const TENSOR_SIZE = 8;

										const builder = new MLGraphBuilder(context);

										// Create MLOperandDescriptor object.
										const desc = {dataType: 'float32', dimensions: TENSOR_DIMS};

										// constant1 is a constant MLOperand with the value 0.5.
										const constantBuffer1 = new Float32Array(TENSOR_SIZE).fill(0.5);
										const constant1 = builder.constant(desc, constantBuffer1);

										// input1 is one of the input MLOperands. Its value will be set before execution.
										const input1 = builder.input('input1', desc);

										// constant2 is another constant MLOperand with the value 0.5.
										const constantBuffer2 = new Float32Array(TENSOR_SIZE).fill(0.5);
										const constant2 = builder.constant(desc, constantBuffer2);

									
									// input2 is another input MLOperand. Its value will be set before execution.
									const input2 = builder.input('input2', desc);

									// intermediateOutput1 is the output of the first Add operation.
									const intermediateOutput1 = builder.add(constant1, input1);

									// intermediateOutput2 is the output of the second Add operation.
									const intermediateOutput2 = builder.add(constant2, input2);

									// output is the output MLOperand of the Mul operation.
									const output = builder.mul(intermediateOutput1, intermediateOutput2);

									// Compile the constructed graph.
									const graph = await builder.build({'output': output});

									// Setup the input buffers with value 1.
									const inputBuffer1 = new Float32Array(TENSOR_SIZE).fill(1);
									const inputBuffer2 = new Float32Array(TENSOR_SIZE).fill(1);
									const outputBuffer = new Float32Array(TENSOR_SIZE);

									// Execute the compiled graph with the specified inputs.
									const inputs = {
										'input1': inputBuffer1,
										'input2': inputBuffer2,
									};
									const outputs = {'output': outputBuffer};
									const results = await context.compute(graph, inputs, outputs);

									console.log('Output value: ' + results.outputs.output);
									// Output value: 2.25,2.25,2.25,2.25,2.25,2.25,2.25,2.25

WebNN 和主流 JavaScript ML 框架的集成

WebNN 与 ONNXRuntime Web 集成的代码示例

										
											import { InferenceSession } from "onnxruntime-web";

											// ...

											// Initialize the ONNX model
											const initModel = async () => {
												ort.env.wasm.numThreads = 1; // 4
												ort.env.wasm.simd = true;
												ort.env.wasm.proxy = true; 

												const options: InferenceSession.SessionOptions = {
													// provider name: wasm, webnn
													// deviceType: cpu, gpu
													// powerPreference: default, high-performance
													executionProviders: 
														[{ name: "wasm"}], // WebAssembly CPU
												};

												// ...
											};

											const results = await model.run(feeds);
											const output = results[model.outputNames[0]];

WebAssembly 后端

									
										import { InferenceSession } from "onnxruntime-web";

										// ...

										// Initialize the ONNX model
										const initModel = async () => {
											env.wasm.numThreads = 1; // 4
											env.wasm.simd = true;
											env.wasm.proxy = true;

											const options: InferenceSession.SessionOptions = {
												// provider name: wasm, webnn
												// deviceType: cpu, gpu
												// powerPreference: default, high-performance
												executionProviders: 
													[{ name: "webnn", deviceType: "gpu", powerPreference: 'default' }],
											};

											// ...
										};

										const results = await model.run(feeds);
										const output = results[model.outputNames[0]];

WebNN 后端

WebNN XNNPack/CPU 性能数据 (标准化)

MediaPipe 模型, 越高越好

W3C Machine Learning for the Web

社区组

讨论和探索新想法，孵化机器学习推理的新提案

39 个组织代表, 126 名参与者

工作组

基于社区组孵化的提案，标准化机器学习推理的 Web API

17 个组织代表, 43 名参与者 (3 名特邀专家)

Local Peer-to-Peer API

通过近距离通信传输数据、消息或文件
隐藏各底层点对点技术的复杂性
于 2023 年 11 月 10 日进入 W3C Web Incubator CG 孵化
基于 Open Screen, QUICHE 等实现

谢谢！

https://webnn.dev

WebNN 交流群

张敏 (Belem)