WebNN, Web 端侧推理的未来

胡宁馨 ningxin.hu@intel.com
张敏 belem.zhang@intel.com
英特尔 SATG Web 平台工程
2023 年 11 月

WebML 客户端推理的优势

隐私

摄像头、麦克风等传感器数据保留在设备中

离线

初始资源缓存并离线后,不再依赖网络

延迟

无云端网络问题,浏览器实时推理

成本

无需云端算力支持

0 安装

浏览器中运行,无需额外安装,并易于共享

跨平台

在几乎所有平台上运行 AI 应用

WebML 客户端推理

突发的
延迟敏感
持续的
电量敏感
周期的
吞吐量敏感
多样的客户端 AI 场景, 多种满足需求的计算单元
CPU
无处不在
低延迟, 单一推理任务
GPU
高并行性, 高 batch size
与 3D/渲染/媒体管道集成
NPU
专用低功耗AI加速器
高能耗比, 提升电源效率

Web 开发者的需求

The web needs its own neural networks specification to leverage Apple Silicon, Tensor Cores, and others.
Delighted to find the working drafts of WebNN. Incredible new power unlocked for the free, open and competitive Web!
Native Tensor support! Would be amazing to have Tensor objects and ops built into Chrome, and available as an “ML API”.
Although some scientific computing libraries exist for JS/TS, having built-in support would be far more desirable!
If go through the code of utils, maths, audio, tensor in JS, it is annoying that I had to implement these ops myself in JS.
llama2-7b in the browser – using WebNN – is going to be 🔥🔥 on-device, local ML 💪 cc @xenovacom

WebNN 简介

新兴的 W3C Web 标准 API
神经网络的统一抽象
通过原生 ML API 访问 AI 硬件加速器
接近原生的 AI 推理性能和结果的可靠性
目前在 Chrome 和 Edge Canary 中可用 (runtime flag)

WebNN 标准规范

WebNN Spec
  • WebNN 标准规范由 W3C Web Machine Learning 工作组负责起草
  • WebNN 标准规范由 Intel 及 Microsoft 联合编辑

WebNN 标准规范进展

已交付
  • 2023 年 3 月: W3C CR
  • 60 个 CNN/RNN 运算, float16/32, int32/uint32, int8/uint8
  • 图像分类: SqueezeNet, MobileNet, ResNet
  • 物体检测: TinyYOLO
  • 噪声抑制: RNNoise, NSNet
进行中
  • 2023 年末: W3C 候选推荐更新
  • 18~22 个 Transformer ops, int64/uint64, NPU 和量化 (TBD)
目标模型
  • Text-to-image: Stable Diffusion unet/VAE/text encoder
  • Image segmentation: Segment Everything decoder
  • Speech-to-text: Whisper Tiny
  • Text-to-text 生成 (encoder-decoder): T5 及 M2M100
  • Text-generation (decoder): LLaMA

WebNN 架构

WebNN 在 Chromium 中的实现

Neural Network Hardware Abstraction Layer / Microsoft Compute Driver Model / Windows Display Driver Model

WebNN 操作符的实现状态 (部分)


W3C
WebNN Spec



Web Platform Tests

XNNPack/CPU backend External Delegate Execution Provider
Chrome Dev Edge Canary Lite for TensorFlow.js
Windows Linux
clamp
📈 🧪
clamp ReluN1To1
Clip
Relu6

concat

📈 🧪

concatenate2
Concatenation


Concat

concatenate3
concatenate4
conv2d
📈 🧪
convolution_2d
Conv2d Conv
DepthwiseConv2d
convTranspose2d
📈 🧪
deconvolution_2d
TransposeConv ConvTranspose
Convolution2DTransposeBias
add element-wise binary 📈 🧪 add2 Add Add
sub element-wise binary 📈 🧪 subtract Sub Sub
mul element-wise binary 📈 🧪 multiply2 Mul Mul
div element-wise binary 📈 🧪 divide Div Div
max element-wise binary 📈 🧪 maximum2 Maximum Max
min element-wise binary 📈 🧪 minimum2 Minimum Min
abs element-wise unary 📈 🧪 abs ✅ Abs Abs
ceil element-wise unary 📈 🧪 ceiling ✅ Ceil Ceil
floor element-wise unary 📈 🧪 floor ✅ Floor Floor
neg element-wise unary 📈 🧪 negate ✅ Neg Neg
elu 📈 🧪 elu Elu Elu
gemm 📈 🧪 fully_connected FullyConnected Gemm

WebNN 操作符的实现状态 (部分)


W3C
WebNN Spec



Web Platform Tests

XNNPack/CPU backend External Delegate Execution Provider
Chrome Dev Edge Canary Lite for TensorFlow.js
Windows Linux
hardSwish 📈 🧪 hardswish HardSwish HardSwish
leakyRelu 📈 🧪 leaky_relu LeakyRelu LeakyRelu
pad 📈 🧪 static_constant_pad Pad Pad
averagePool2d pooling
📈 🧪
average_pooling_2d
AveragePool2d GlobalAveragePool
Mean AveragePool
maxPool2d pooling
📈 🧪
max_pooling_2d
MaxPool2d
GlobalMaxPool
MaxPool
prelu 📈 🧪 prelu Prelu Prelu
relu 📈 🧪 clamp Relu Relu
resample2d 🚀🚀 static_resize_bilinear_2d ResizeBilinear Resize
reshape 📈 🧪 static_reshape Reshape Reshape
sigmoid 📈 🧪 sigmoid Logistic Sigmoid


split


📈 🧪


even_split2

Split




Split


even_split3
even_split4
static_slice (uneven split)
slice 📈 🧪 static_slice Slice Slice
StridedSlice
softmax 📈 🧪 softmax Softmax Softmax
transpose 📈 🧪 static_transpose Transpose Transpose
78 29 35 34 68

WebNN 的实现状态 (DirectML)

  • 目前已经支持 40 个 ops
  • Transfomer 的 ops 也正与 spec 同步开发中
  • 正在为 WebNN NPU 支持作出适配

WebNN 编程模型

WebNN 代码示例

										
										const context = await navigator.ml.createContext({powerPreference: 'low-power'});

										// The following code builds a graph as:
										// constant1 ---+
										//              +--- Add ---> intermediateOutput1 ---+
										// input1    ---+                                    |
										//                                                   +--- Mul---> output
										// constant2 ---+                                    |
										//              +--- Add ---> intermediateOutput2 ---+
										// input2    ---+

										// Use tensors in 4 dimensions.
										const TENSOR_DIMS = [1, 2, 2, 2];
										const TENSOR_SIZE = 8;

										const builder = new MLGraphBuilder(context);

										// Create MLOperandDescriptor object.
										const desc = {dataType: 'float32', dimensions: TENSOR_DIMS};

										// constant1 is a constant MLOperand with the value 0.5.
										const constantBuffer1 = new Float32Array(TENSOR_SIZE).fill(0.5);
										const constant1 = builder.constant(desc, constantBuffer1);

										// input1 is one of the input MLOperands. Its value will be set before execution.
										const input1 = builder.input('input1', desc);

										// constant2 is another constant MLOperand with the value 0.5.
										const constantBuffer2 = new Float32Array(TENSOR_SIZE).fill(0.5);
										const constant2 = builder.constant(desc, constantBuffer2);
									
								
									
									// input2 is another input MLOperand. Its value will be set before execution.
									const input2 = builder.input('input2', desc);

									// intermediateOutput1 is the output of the first Add operation.
									const intermediateOutput1 = builder.add(constant1, input1);

									// intermediateOutput2 is the output of the second Add operation.
									const intermediateOutput2 = builder.add(constant2, input2);

									// output is the output MLOperand of the Mul operation.
									const output = builder.mul(intermediateOutput1, intermediateOutput2);

									// Compile the constructed graph.
									const graph = await builder.build({'output': output});

									// Setup the input buffers with value 1.
									const inputBuffer1 = new Float32Array(TENSOR_SIZE).fill(1);
									const inputBuffer2 = new Float32Array(TENSOR_SIZE).fill(1);
									const outputBuffer = new Float32Array(TENSOR_SIZE);

									// Execute the compiled graph with the specified inputs.
									const inputs = {
										'input1': inputBuffer1,
										'input2': inputBuffer2,
									};
									const outputs = {'output': outputBuffer};
									const results = await context.compute(graph, inputs, outputs);

									console.log('Output value: ' + results.outputs.output);
									// Output value: 2.25,2.25,2.25,2.25,2.25,2.25,2.25,2.25
								
							

WebNN 和主流 JavaScript ML 框架的集成

WebNN 与 ONNXRuntime Web 集成的代码示例

										
											import { InferenceSession } from "onnxruntime-web";

											// ...

											// Initialize the ONNX model
											const initModel = async () => {
												ort.env.wasm.numThreads = 1; // 4
												ort.env.wasm.simd = true;
												ort.env.wasm.proxy = true; 

												const options: InferenceSession.SessionOptions = {
													// provider name: wasm, webnn
													// deviceType: cpu, gpu
													// powerPreference: default, high-performance
													executionProviders: 
														[{ name: "wasm"}], // WebAssembly CPU
												};

												// ...
											};

											const results = await model.run(feeds);
											const output = results[model.outputNames[0]];
									
								
WebAssembly 后端
									
										import { InferenceSession } from "onnxruntime-web";

										// ...

										// Initialize the ONNX model
										const initModel = async () => {
											env.wasm.numThreads = 1; // 4
											env.wasm.simd = true;
											env.wasm.proxy = true;

											const options: InferenceSession.SessionOptions = {
												// provider name: wasm, webnn
												// deviceType: cpu, gpu
												// powerPreference: default, high-performance
												executionProviders: 
													[{ name: "webnn", deviceType: "gpu", powerPreference: 'default' }],
											};

											// ...
										};

										const results = await model.run(feeds);
										const output = results[model.outputNames[0]];
								
							
WebNN 后端

WebNN XNNPack/CPU 性能数据 (标准化)

MediaPipe 模型, 越高越好

W3C Machine Learning for the Web

社区组

讨论和探索新想法,孵化机器学习推理的新提案

39 个组织代表, 126 名参与者

工作组

基于社区组孵化的提案,标准化机器学习推理的 Web API

17 个组织代表, 43 名参与者 (3 名特邀专家)

Local Peer-to-Peer API

  • 通过近距离通信传输数据、消息或文件
  • 隐藏各底层点对点技术的复杂性
  • 于 2023 年 11 月 10 日进入 W3C Web Incubator CG 孵化
  • 基于 Open Screen, QUICHE 等实现

谢谢!

https://webnn.dev
WebNN 交流群
张敏 (Belem)