-
Notifications
You must be signed in to change notification settings - Fork 1
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sending some comments, feel free to merge and send another PR to core
Reviewed 17 of 17 files at r1.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @annxingyuan)
src/backend_webgpu.ts, line 1 at r1 (raw file):
import {DataMover, DataType, KernelBackend, Rank, ShapeMap, Tensor, tensor1d, Tensor3D, util} from '@tensorflow/tfjs-core';
make sure you get a license on this and also the rest of the files
src/backend_webgpu.ts, line 27 at r1 (raw file):
compileOpts: any; private binaryCache: {[key: string]: any};
can you type this, even if you have to make custom interfaces?
src/backend_webgpu.ts, line 62 at r1 (raw file):
private setBufferData(buffer: any, data: Float32Array|Int32Array|Uint8Array) { buffer.setSubData(0, data.slice().buffer);
why do you have to slice?
src/backend_webgpu.ts, line 176 at r1 (raw file):
} batchMatMul(
just curious, why do we need batchmatmul? it's done so nbd, but would be good to just sort by ops needed for posenet for now
src/kernels/multiply_webgpu.ts, line 15 at r1 (raw file):
this.userCode = ` #version 450 layout(std140, set = 0, binding = 0) buffer A {
I think it probably makes sense to move this to a utility (analogous to shader compiler) that automatically splices this in. otherwise we're gonna have a lot of repetitive code
src/kernels/multiply_webgpu.ts, line 26 at r1 (raw file):
void main() { uint index = gl_GlobalInvocationID.x;
one of the most important things we did in the WebGL world was make a logical sampling layer. It may feel like a premature thing, but this may speed up op-writing, especially if you decide to do things like squarification.
src/kernels/webgpu_math.ts, line 14 at r1 (raw file):
source, shaderKind, 'file', 'main', compileOptions); const error = result.GetErrorMessage(); if (error) {
!= null
src/kernels/webgpu_math.ts, line 17 at r1 (raw file):
console.error(error); } const code = result.GetBinary().slice(0).buffer;
why a slice(0)?
src/kernels/webgpu_math.ts, line 29 at r1 (raw file):
export function makeShaderKey(program: WebGPUProgram): string { const keyUserCode = program.userCode; let key = program.constructor.name;
why do you need the constructor name? for now just user the code itself? also make sure you use user code and not all the headers (to make this faster)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 10 unresolved discussions (waiting on @annxingyuan)
src/kernels/matmul_webgpu.ts, line 16 at r1 (raw file):
this.userCode = ` #version 450 #define TS ${TS}
can this be a uniform so that it's not part of the shader source and cache key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 11 unresolved discussions (waiting on @annxingyuan and @nsthorat)
src/backend_webgpu.ts, line 62 at r1 (raw file):
Previously, nsthorat (Nikhil Thorat) wrote…
why do you have to slice?
This is only really necessary if the ArrayBufferView points at a subrange of the ArrayBuffer.
We're fixing this in Chrome so that you can pass an ArrayBufferView in here instead.
src/kernels/matmul_webgpu.ts, line 16 at r1 (raw file):
Previously, nsthorat (Nikhil Thorat) wrote…
can this be a uniform so that it's not part of the shader source and cache key?
No, it's part of the layout (...) in
declaration just below which must be static. That said, it's unlikely that it will be different for different invocations of an op, so it shouldn't hurt caching.
It could be const uint TS = ${TS};
though, which I think works and is slightly nicer.
Might also want to expand it out to TileSize
.
src/kernels/multiply_webgpu.ts, line 15 at r1 (raw file):
Previously, nsthorat (Nikhil Thorat) wrote…
I think it probably makes sense to move this to a utility (analogous to shader compiler) that automatically splices this in. otherwise we're gonna have a lot of repetitive code
Probably best to wait on this until we figure out how to abstract the logical sampling layer (see below), IMO. BTW, it will be nontrivial since it needs to deal with variations in number of inputs, number of outputs, binding numbers, eventually set
s (bind groups), possibly sizes of inputs/outputs.
src/kernels/multiply_webgpu.ts, line 16 at r1 (raw file):
#version 450 layout(std140, set = 0, binding = 0) buffer A { float dataA[];
I'd rename A and dataA here to match the naming convention of the other shader, just for consistency.
src/kernels/multiply_webgpu.ts, line 26 at r1 (raw file):
Previously, nsthorat (Nikhil Thorat) wrote…
one of the most important things we did in the WebGL world was make a logical sampling layer. It may feel like a premature thing, but this may speed up op-writing, especially if you decide to do things like squarification.
IIU, it would be pretty trivial to do this. But it might require some changes to the way data is fed into the shaders in order to be abstractable into a function, so I'd advocate trying to do that after this patch.
src/kernels/webgpu_math.ts, line 17 at r1 (raw file):
Previously, nsthorat (Nikhil Thorat) wrote…
why a slice(0)?
Same reason as above, except this one is strictly necessary. The ArrayBufferView returned by GetBinary() is a view into the entire WASM heap. If you don't slice, then .buffer will be the entire WASM heap.
This is also getting fixed in Chrome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the huge cleanup, looks way better than what I put in! :)
I would expect the performance of tile size = 1 to ~match webgl in the ideal case (and tiles closer to ~4x4 or 4x8 to be faster), but it's really hard to say. We're going to need to look at it a little more deeply to see what is going on. (For example I'm not sure if GPUBuffers are getting reused; if they're not then the allocations are going to hurt a lot.)
Reviewable status: all files reviewed, 11 unresolved discussions (waiting on @annxingyuan and @nsthorat)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(still figuring out how to use reviewable)
Reviewable status: all files reviewed, 11 unresolved discussions (waiting on @annxingyuan and @nsthorat)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 11 of 18 files reviewed, 11 unresolved discussions (waiting on @annxingyuan, @kainino0x, and @nsthorat)
src/kernels/matmul_webgpu.ts, line 16 at r1 (raw file):
Previously, kainino0x (Kai Ninomiya) wrote…
No, it's part of the
layout (...) in
declaration just below which must be static. That said, it's unlikely that it will be different for different invocations of an op, so it shouldn't hurt caching.It could be
const uint TS = ${TS};
though, which I think works and is slightly nicer.Might also want to expand it out to
TileSize
.
Ah got it - sounds like that won't change much so this program is agnostic of input dimensions (so different shaped matmuls will hit our shader binary cache). Thanks!
src/kernels/multiply_webgpu.ts, line 15 at r1 (raw file):
Previously, kainino0x (Kai Ninomiya) wrote…
Probably best to wait on this until we figure out how to abstract the logical sampling layer (see below), IMO. BTW, it will be nontrivial since it needs to deal with variations in number of inputs, number of outputs, binding numbers, eventually
set
s (bind groups), possibly sizes of inputs/outputs.
in webgl world we have information about all the variables and their inputs in the GPGPUProgram here so we could programmatically splice this together, but it's fine to wait, just a suggestion
src/kernels/multiply_webgpu.ts, line 26 at r1 (raw file):
Previously, kainino0x (Kai Ninomiya) wrote…
IIU, it would be pretty trivial to do this. But it might require some changes to the way data is fed into the shaders in order to be abstractable into a function, so I'd advocate trying to do that after this patch.
sg!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In WebGL we include a warmup run before benchmarking - how should we think about warmup in WebGPU?
I don't have a very good grasp on what happens during warmup in WebGL. I wouldn't expect it to be necessary in WebGPU. E.g. requestDevice won't resolve until it's ready (although I don't think this is true today), pipelines are created ahead of time so they don't incur implicit compilation times (we will be adding a way to know when pipeline compilation is completed and won't stall things - but it's not in yet).
The WebGL2 example calls memoryBarrierShared() before barrier() - however when I try calling memoryBarrierShared() I no longer get accurate buffer reads - I commented out the line in the kernel here.
This is strange. memoryBarrierShared(); barrier();
seems it should work. I would only expect memoryBarrierShared to break things if there were already a race between multiple threads writing into the same location in shared memory. We'll need to look more closely at this (and try it on some other platforms).
Anything larger than 600x600 matMul caused the browser to crash - are there limits to input sizes we should be aware of?
There's currently a limit on the size of setSubData and map calls because we haven't fully implemented it yet. That limit is about 16MB (though it might be somewhat unreliable in that Chromium build).
When setting the dimensions of the dispatch, should we in general be trying to squarify the thread groups for performance? What would be the optimal dispatch dimensions for something like elementwise multiplication?
Nx1x1, with any tilesize, should AFAIK give ~best performance for ops like elementwise multiplication.
In the browser I get Buffer does not have map usage error after the second matMul run and then with every subsequent run - do you see any issues with my benchmarking code that could explain this?
IIRC, the only buffer with a map usage, and the only buffer that gets mapped, is the one in getBufferData. So I wouldn't expect anything to be wrong here. This might be a bug on our side.
Are there heuristics for choosing a tile size? (32 caused the browser to crash)
I don't think we have done an in-depth investigation into this. However, there are limits (which we will need to validate but don't currently). In Vulkan,
maxComputeWorkGroupInvocations
must be at least 128maxComputeWorkGroupSize
must be at least (128,128,64)maxComputeSharedMemorySize
must be at least 16384
which means if you have X*Y*Z <= 128
, X <= 128
, Y < 128
, and Z < 64
, you should be safe. These rules are probably true across other backends too.
In Metal, the lowest values across hardware are:
- Maximum threads per threadgroup: 512
- Maximum total threadgroup memory allocation: 16352 B
So you'll also need to stay under 16352 bytes of shared memory per threadgroup.
Is there an example somewhere of uploading uniforms with WebGPU? I'm using storage buffers for everything in this PR but it probably makes sense to upload things like shared dimension size as a uniform? What are the tradeoffs / things to consider when uploading as uniform vs ssbo?
Uniforms might be better optimized because they're readonly. Adding a uniform buffer is very similar to a storage buffer. It just needs the GPUBufferUsage.UNIFORM
usage, and you bind it as "uniform-buffer"
instead of "storage-buffer"
. Example.
There are only uniform buffers, no individual uniforms.
Corentin - from our discussion over email it sounded like there's not much reason to share command encoders, but am I correct to say that we should try to batch calls to submit on the device queue for performance? If so, do you have any metrics to characterize the performance benefits we could expect from batching? Everything in TF.js is eager right now, so I just want to understand how to prioritize adding support for something like a graph mode.
I don't think we have any metrics, and it probably depends on the platform too. I don't think it will hurt too much to ignore this for now.
Reviewable status: 11 of 18 files reviewed, 11 unresolved discussions (waiting on @annxingyuan, @kainino0x, and @nsthorat)
src/kernels/multiply_webgpu.ts, line 15 at r1 (raw file):
Previously, nsthorat (Nikhil Thorat) wrote…
in webgl world we have information about all the variables and their inputs in the GPGPUProgram here so we could programmatically splice this together, but it's fine to wait, just a suggestion
Yeah, I think eventually you'll be able to generate this. Definitely a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 11 of 18 files reviewed, 7 unresolved discussions (waiting on @kainino0x and @nsthorat)
src/backend_webgpu.ts, line 1 at r1 (raw file):
Previously, nsthorat (Nikhil Thorat) wrote…
make sure you get a license on this and also the rest of the files
Done
src/backend_webgpu.ts, line 27 at r1 (raw file):
Previously, nsthorat (Nikhil Thorat) wrote…
can you type this, even if you have to make custom interfaces?
Done
src/backend_webgpu.ts, line 176 at r1 (raw file):
Previously, nsthorat (Nikhil Thorat) wrote…
just curious, why do we need batchmatmul? it's done so nbd, but would be good to just sort by ops needed for posenet for now
For compatibility with the backend interface.
src/kernels/multiply_webgpu.ts, line 16 at r1 (raw file):
Previously, kainino0x (Kai Ninomiya) wrote…
I'd rename A and dataA here to match the naming convention of the other shader, just for consistency.
Done
src/kernels/webgpu_math.ts, line 14 at r1 (raw file):
Previously, nsthorat (Nikhil Thorat) wrote…
!= null
Done
src/kernels/webgpu_math.ts, line 29 at r1 (raw file):
Previously, nsthorat (Nikhil Thorat) wrote…
why do you need the constructor name? for now just user the code itself? also make sure you use user code and not all the headers (to make this faster)
Done
If I understand the code correctly pipeline compilation happens on the first run. If that's the case we should do a warmup run first, and then do the benchmark. Also the benchmark does a readback after every matrix multiplication. This means the cost of readbacks (and previously pipelined operations) are taken into account in the result. To help remove this we could do 100 matrix multiplications on the GPU (maybe?). @annxingyuan if your laptop has two GPUs I believe WebGPU will currently use the integrated one while WebGL will use the discrete one, leading to the performance difference. We'll see about fixing WebGPU to use the discrete one for the next Chromium build.
This should be possible to work around by making multiple setSubData calls. This limitation should disappear eventually.
+1 because cachelines would be looking like Nx1x1 pieces of the tensor themselves.
I commented on the other PR that
The easy way to do fake eager mode would be to have a backend-global command buffer and encode commands in it as they come in. Then when a readback is requested you can flush all pending operations with a call to
|
Additions
compileAndRun
pipeline added to backend to mimic WebGLBenchmarks
I benchmarked matMul on input shapes 500 x 500 and 500 x 500 (code) - here are the results (in ms):
WebGL backend:
Average:
12.25
Min:
10.20
WebGPU tile size 2:
Average:
129.65
Min:
122.66
WebGPU tile size 16:
Average:
40.61
Min:
36.07
Tile size 32 caused the browser to crash.
Questions
memoryBarrierShared()
beforebarrier()
- however when I try callingmemoryBarrierShared()
I no longer get accurate buffer reads - I commented out the line in the kernel here.Buffer does not have map usage
error after the second matMul run and then with every subsequent run - do you see any issues with my benchmarking code that could explain this?submit
on the device queue for performance? If so, do you have any metrics to characterize the performance benefits we could expect from batching? Everything in TF.js is eager right now, so I just want to understand how to prioritize adding support for something like a graph mode.This change is