Add matMul, refactoring. #2

annxingyuan · 2019-04-10T15:14:24Z

Additions

matMul kernel using shared memory
modularization of kernels
compileAndRun pipeline added to backend to mimic WebGL

Benchmarks

I benchmarked matMul on input shapes 500 x 500 and 500 x 500 (code) - here are the results (in ms):

WebGL backend:
Average: 12.25
Min: 10.20

WebGPU tile size 2:
Average: 129.65
Min: 122.66

WebGPU tile size 16:
Average: 40.61
Min: 36.07

Tile size 32 caused the browser to crash.

Questions

Do the benchmarking results align with your expectations? I haven't included all the optimizations from the WebGL2 example yet, but I believe this implementation does use shared memory (here's the reference).
In WebGL we include a warmup run before benchmarking - how should we think about warmup in WebGPU?
The WebGL2 example calls memoryBarrierShared() before barrier() - however when I try calling memoryBarrierShared() I no longer get accurate buffer reads - I commented out the line in the kernel here.
Anything larger than 600x600 matMul caused the browser to crash - are there limits to input sizes we should be aware of?
When setting the dimensions of the dispatch, should we in general be trying to squarify the thread groups for performance? What would be the optimal dispatch dimensions for something like elementwise multiplication?
In the browser I get Buffer does not have map usage error after the second matMul run and then with every subsequent run - do you see any issues with my benchmarking code that could explain this?
Are there heuristics for choosing a tile size? (32 caused the browser to crash)
Is there an example somewhere of uploading uniforms with WebGPU? I'm using storage buffers for everything in this PR but it probably makes sense to upload things like shared dimension size as a uniform? What are the tradeoffs / things to consider when uploading as uniform vs ssbo?
Corentin - from our discussion over email it sounded like there's not much reason to share command encoders, but am I correct to say that we should try to batch calls to submit on the device queue for performance? If so, do you have any metrics to characterize the performance benefits we could expect from batching? Everything in TF.js is eager right now, so I just want to understand how to prioritize adding support for something like a graph mode.

This change is

nsthorat

Sending some comments, feel free to merge and send another PR to core

Reviewed 17 of 17 files at r1.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @annxingyuan)

src/backend_webgpu.ts, line 1 at r1 (raw file):

import {DataMover, DataType, KernelBackend, Rank, ShapeMap, Tensor, tensor1d, Tensor3D, util} from '@tensorflow/tfjs-core';

make sure you get a license on this and also the rest of the files

src/backend_webgpu.ts, line 27 at r1 (raw file):

  compileOpts: any;

  private binaryCache: {[key: string]: any};

can you type this, even if you have to make custom interfaces?

src/backend_webgpu.ts, line 62 at r1 (raw file):

  private setBufferData(buffer: any, data: Float32Array|Int32Array|Uint8Array) {
    buffer.setSubData(0, data.slice().buffer);

why do you have to slice?

src/backend_webgpu.ts, line 176 at r1 (raw file):

  }

  batchMatMul(

just curious, why do we need batchmatmul? it's done so nbd, but would be good to just sort by ops needed for posenet for now

src/kernels/multiply_webgpu.ts, line 15 at r1 (raw file):

    this.userCode = `
      #version 450
      layout(std140, set = 0, binding = 0) buffer A {

I think it probably makes sense to move this to a utility (analogous to shader compiler) that automatically splices this in. otherwise we're gonna have a lot of repetitive code

src/kernels/multiply_webgpu.ts, line 26 at r1 (raw file):

      void main() {
        uint index = gl_GlobalInvocationID.x;

one of the most important things we did in the WebGL world was make a logical sampling layer. It may feel like a premature thing, but this may speed up op-writing, especially if you decide to do things like squarification.

src/kernels/webgpu_math.ts, line 14 at r1 (raw file):

          source, shaderKind, 'file', 'main', compileOptions);
      const error = result.GetErrorMessage();
      if (error) {

!= null

src/kernels/webgpu_math.ts, line 17 at r1 (raw file):

        console.error(error);
      }
      const code = result.GetBinary().slice(0).buffer;

why a slice(0)?

src/kernels/webgpu_math.ts, line 29 at r1 (raw file):

export function makeShaderKey(program: WebGPUProgram): string {
  const keyUserCode = program.userCode;
  let key = program.constructor.name;

why do you need the constructor name? for now just user the code itself? also make sure you use user code and not all the headers (to make this faster)

nsthorat

Reviewable status: all files reviewed, 10 unresolved discussions (waiting on @annxingyuan)

src/kernels/matmul_webgpu.ts, line 16 at r1 (raw file):

    this.userCode = `
      #version 450
      #define TS ${TS}

can this be a uniform so that it's not part of the shader source and cache key?

kainino0x

Reviewable status: all files reviewed, 11 unresolved discussions (waiting on @annxingyuan and @nsthorat)

src/backend_webgpu.ts, line 62 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

why do you have to slice?

This is only really necessary if the ArrayBufferView points at a subrange of the ArrayBuffer.
We're fixing this in Chrome so that you can pass an ArrayBufferView in here instead.

src/kernels/matmul_webgpu.ts, line 16 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

can this be a uniform so that it's not part of the shader source and cache key?

No, it's part of the layout (...) in declaration just below which must be static. That said, it's unlikely that it will be different for different invocations of an op, so it shouldn't hurt caching.

It could be const uint TS = ${TS}; though, which I think works and is slightly nicer.

Might also want to expand it out to TileSize.

src/kernels/multiply_webgpu.ts, line 15 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

I think it probably makes sense to move this to a utility (analogous to shader compiler) that automatically splices this in. otherwise we're gonna have a lot of repetitive code

Probably best to wait on this until we figure out how to abstract the logical sampling layer (see below), IMO. BTW, it will be nontrivial since it needs to deal with variations in number of inputs, number of outputs, binding numbers, eventually sets (bind groups), possibly sizes of inputs/outputs.

src/kernels/multiply_webgpu.ts, line 16 at r1 (raw file):

      #version 450
      layout(std140, set = 0, binding = 0) buffer A {
        float dataA[];

I'd rename A and dataA here to match the naming convention of the other shader, just for consistency.

src/kernels/multiply_webgpu.ts, line 26 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

one of the most important things we did in the WebGL world was make a logical sampling layer. It may feel like a premature thing, but this may speed up op-writing, especially if you decide to do things like squarification.

IIU, it would be pretty trivial to do this. But it might require some changes to the way data is fed into the shaders in order to be abstractable into a function, so I'd advocate trying to do that after this patch.

src/kernels/webgpu_math.ts, line 17 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

why a slice(0)?

Same reason as above, except this one is strictly necessary. The ArrayBufferView returned by GetBinary() is a view into the entire WASM heap. If you don't slice, then .buffer will be the entire WASM heap.

This is also getting fixed in Chrome.

kainino0x

Thanks for the huge cleanup, looks way better than what I put in! :)

I would expect the performance of tile size = 1 to ~match webgl in the ideal case (and tiles closer to ~4x4 or 4x8 to be faster), but it's really hard to say. We're going to need to look at it a little more deeply to see what is going on. (For example I'm not sure if GPUBuffers are getting reused; if they're not then the allocations are going to hurt a lot.)

Reviewable status: all files reviewed, 11 unresolved discussions (waiting on @annxingyuan and @nsthorat)

kainino0x

(still figuring out how to use reviewable)

Reviewable status: all files reviewed, 11 unresolved discussions (waiting on @annxingyuan and @nsthorat)

nsthorat

Reviewable status: 11 of 18 files reviewed, 11 unresolved discussions (waiting on @annxingyuan, @kainino0x, and @nsthorat)

src/kernels/matmul_webgpu.ts, line 16 at r1 (raw file):

Previously, kainino0x (Kai Ninomiya) wrote…

No, it's part of the layout (...) in declaration just below which must be static. That said, it's unlikely that it will be different for different invocations of an op, so it shouldn't hurt caching.

It could be const uint TS = ${TS}; though, which I think works and is slightly nicer.

Might also want to expand it out to TileSize.

Ah got it - sounds like that won't change much so this program is agnostic of input dimensions (so different shaped matmuls will hit our shader binary cache). Thanks!

src/kernels/multiply_webgpu.ts, line 15 at r1 (raw file):

Previously, kainino0x (Kai Ninomiya) wrote…

Probably best to wait on this until we figure out how to abstract the logical sampling layer (see below), IMO. BTW, it will be nontrivial since it needs to deal with variations in number of inputs, number of outputs, binding numbers, eventually sets (bind groups), possibly sizes of inputs/outputs.

in webgl world we have information about all the variables and their inputs in the GPGPUProgram here so we could programmatically splice this together, but it's fine to wait, just a suggestion

src/kernels/multiply_webgpu.ts, line 26 at r1 (raw file):

Previously, kainino0x (Kai Ninomiya) wrote…

IIU, it would be pretty trivial to do this. But it might require some changes to the way data is fed into the shaders in order to be abstractable into a function, so I'd advocate trying to do that after this patch.

sg!

kainino0x

In WebGL we include a warmup run before benchmarking - how should we think about warmup in WebGPU?

I don't have a very good grasp on what happens during warmup in WebGL. I wouldn't expect it to be necessary in WebGPU. E.g. requestDevice won't resolve until it's ready (although I don't think this is true today), pipelines are created ahead of time so they don't incur implicit compilation times (we will be adding a way to know when pipeline compilation is completed and won't stall things - but it's not in yet).

The WebGL2 example calls memoryBarrierShared() before barrier() - however when I try calling memoryBarrierShared() I no longer get accurate buffer reads - I commented out the line in the kernel here.

This is strange. memoryBarrierShared(); barrier(); seems it should work. I would only expect memoryBarrierShared to break things if there were already a race between multiple threads writing into the same location in shared memory. We'll need to look more closely at this (and try it on some other platforms).

Anything larger than 600x600 matMul caused the browser to crash - are there limits to input sizes we should be aware of?

There's currently a limit on the size of setSubData and map calls because we haven't fully implemented it yet. That limit is about 16MB (though it might be somewhat unreliable in that Chromium build).

When setting the dimensions of the dispatch, should we in general be trying to squarify the thread groups for performance? What would be the optimal dispatch dimensions for something like elementwise multiplication?

Nx1x1, with any tilesize, should AFAIK give ~best performance for ops like elementwise multiplication.

In the browser I get Buffer does not have map usage error after the second matMul run and then with every subsequent run - do you see any issues with my benchmarking code that could explain this?

IIRC, the only buffer with a map usage, and the only buffer that gets mapped, is the one in getBufferData. So I wouldn't expect anything to be wrong here. This might be a bug on our side.

Are there heuristics for choosing a tile size? (32 caused the browser to crash)

I don't think we have done an in-depth investigation into this. However, there are limits (which we will need to validate but don't currently). In Vulkan,

maxComputeWorkGroupInvocations must be at least 128
maxComputeWorkGroupSize must be at least (128,128,64)
maxComputeSharedMemorySize must be at least 16384

which means if you have X*Y*Z <= 128, X <= 128, Y < 128, and Z < 64, you should be safe. These rules are probably true across other backends too.

In Metal, the lowest values across hardware are:

Maximum threads per threadgroup: 512
Maximum total threadgroup memory allocation: 16352 B

So you'll also need to stay under 16352 bytes of shared memory per threadgroup.

Is there an example somewhere of uploading uniforms with WebGPU? I'm using storage buffers for everything in this PR but it probably makes sense to upload things like shared dimension size as a uniform? What are the tradeoffs / things to consider when uploading as uniform vs ssbo?

Uniforms might be better optimized because they're readonly. Adding a uniform buffer is very similar to a storage buffer. It just needs the GPUBufferUsage.UNIFORM usage, and you bind it as "uniform-buffer" instead of "storage-buffer". Example.
There are only uniform buffers, no individual uniforms.

Corentin - from our discussion over email it sounded like there's not much reason to share command encoders, but am I correct to say that we should try to batch calls to submit on the device queue for performance? If so, do you have any metrics to characterize the performance benefits we could expect from batching? Everything in TF.js is eager right now, so I just want to understand how to prioritize adding support for something like a graph mode.

I don't think we have any metrics, and it probably depends on the platform too. I don't think it will hurt too much to ignore this for now.

Reviewable status: 11 of 18 files reviewed, 11 unresolved discussions (waiting on @annxingyuan, @kainino0x, and @nsthorat)

src/kernels/multiply_webgpu.ts, line 15 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

in webgl world we have information about all the variables and their inputs in the GPGPUProgram here so we could programmatically splice this together, but it's fine to wait, just a suggestion

Yeah, I think eventually you'll be able to generate this. Definitely a good idea.

annxingyuan

Reviewable status: 11 of 18 files reviewed, 7 unresolved discussions (waiting on @kainino0x and @nsthorat)

src/backend_webgpu.ts, line 1 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

make sure you get a license on this and also the rest of the files

Done

src/backend_webgpu.ts, line 27 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

can you type this, even if you have to make custom interfaces?

Done

src/backend_webgpu.ts, line 176 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

just curious, why do we need batchmatmul? it's done so nbd, but would be good to just sort by ops needed for posenet for now

For compatibility with the backend interface.

src/kernels/multiply_webgpu.ts, line 16 at r1 (raw file):

Previously, kainino0x (Kai Ninomiya) wrote…

I'd rename A and dataA here to match the naming convention of the other shader, just for consistency.

Done

src/kernels/webgpu_math.ts, line 14 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

!= null

Done

src/kernels/webgpu_math.ts, line 29 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

why do you need the constructor name? for now just user the code itself? also make sure you use user code and not all the headers (to make this faster)

Done

Kangz · 2019-04-15T09:49:11Z

In WebGL we include a warmup run before benchmarking - how should we think about warmup in WebGPU?

I don't have a very good grasp on what happens during warmup in WebGL. I wouldn't expect it to be necessary in WebGPU. E.g. requestDevice won't resolve until it's ready (although I don't think this is true today), pipelines are created ahead of time so they don't incur implicit compilation times (we will be adding a way to know when pipeline compilation is completed and won't stall things - but it's not in yet).

If I understand the code correctly pipeline compilation happens on the first run. If that's the case we should do a warmup run first, and then do the benchmark. Also the benchmark does a readback after every matrix multiplication. This means the cost of readbacks (and previously pipelined operations) are taken into account in the result. To help remove this we could do 100 matrix multiplications on the GPU (maybe?).

@annxingyuan if your laptop has two GPUs I believe WebGPU will currently use the integrated one while WebGL will use the discrete one, leading to the performance difference. We'll see about fixing WebGPU to use the discrete one for the next Chromium build.

The WebGL2 example calls memoryBarrierShared() before barrier() - however when I try calling memoryBarrierShared() I no longer get accurate buffer reads - I commented out the line in the kernel here.

This is strange. memoryBarrierShared(); barrier(); seems it should work. I would only expect memoryBarrierShared to break things if there were already a race between multiple threads writing into the same location in shared memory. We'll need to look more closely at this (and try it on some other platforms).

Anything larger than 600x600 matMul caused the browser to crash - are there limits to input sizes we should be aware of?

There's currently a limit on the size of setSubData and map calls because we haven't fully implemented it yet. That limit is about 16MB (though it might be somewhat unreliable in that Chromium build).

This should be possible to work around by making multiple setSubData calls. This limitation should disappear eventually.

When setting the dimensions of the dispatch, should we in general be trying to squarify the thread groups for performance? What would be the optimal dispatch dimensions for something like elementwise multiplication?

Nx1x1, with any tilesize, should AFAIK give ~best performance for ops like elementwise multiplication.

+1 because cachelines would be looking like Nx1x1 pieces of the tensor themselves.

In the browser I get Buffer does not have map usage error after the second matMul run and then with every subsequent run - do you see any issues with my benchmarking code that could explain this?

IIRC, the only buffer with a map usage, and the only buffer that gets mapped, is the one in getBufferData. So I wouldn't expect anything to be wrong here. This might be a bug on our side.

I commented on the other PR that info.buffer.unmap shouldn't be needed. I believe that's what causes the validation error.

Are there heuristics for choosing a tile size? (32 caused the browser to crash)

I don't think we have done an in-depth investigation into this. However, there are limits (which we will need to validate but don't currently). In Vulkan,

maxComputeWorkGroupInvocations must be at least 128

maxComputeWorkGroupSize must be at least (128,128,64)

maxComputeSharedMemorySize must be at least 16384

which means if you have X*Y*Z <= 128, X <= 128, Y < 128, and Z < 64, you should be safe. These rules are probably true across other backends too.

In Metal, the lowest values across hardware are:

Maximum threads per threadgroup: 512

Maximum total threadgroup memory allocation: 16352 B

So you'll also need to stay under 16352 bytes of shared memory per threadgroup.

Is there an example somewhere of uploading uniforms with WebGPU? I'm using storage buffers for everything in this PR but it probably makes sense to upload things like shared dimension size as a uniform? What are the tradeoffs / things to consider when uploading as uniform vs ssbo?

Uniforms might be better optimized because they're readonly. Adding a uniform buffer is very similar to a storage buffer. It just needs the GPUBufferUsage.UNIFORM usage, and you bind it as "uniform-buffer" instead of "storage-buffer". Example.
There are only uniform buffers, no individual uniforms.

Corentin - from our discussion over email it sounded like there's not much reason to share command encoders, but am I correct to say that we should try to batch calls to submit on the device queue for performance? If so, do you have any metrics to characterize the performance benefits we could expect from batching? Everything in TF.js is eager right now, so I just want to understand how to prioritize adding support for something like a graph mode.

I don't think we have any metrics, and it probably depends on the platform too. I don't think it will hurt too much to ignore this for now.

The easy way to do fake eager mode would be to have a backend-global command buffer and encode commands in it as they come in. Then when a readback is requested you can flush all pending operations with a call to submit(). Hopefully that's much easier to implement than a full retained mode.

Reviewable status: 11 of 18 files reviewed, 11 unresolved discussions (waiting on @annxingyuan, @kainino0x, and @nsthorat)

src/kernels/multiply_webgpu.ts, line 15 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…
Yeah, I think eventually you'll be able to generate this. Definitely a good idea.

annxingyuan added 18 commits April 9, 2019 08:11

add rollup stuff

13a51c9

add script

c77bcf7

setup

6539132

passing in mnk

05cdbeb

reuse

333f8cd

pass in batch as well

365ebc5

convert to int

627f946

test passing

74107c2

adapt for asymmetric

dc54257

add test

f10f4cf

refactor

86d8475

clean

c8b56d3

benchmarking setup

0258372

use shared memory

5db3c89

all but one

548ca19

add out of bounds check

b7d68c2

fix indexing

f39b1fb

odd shared dim

c3a4aa4

nsthorat approved these changes Apr 12, 2019

View reviewed changes

kainino0x reviewed Apr 12, 2019

View reviewed changes

add licenses

b65db3a

nsthorat reviewed Apr 12, 2019

View reviewed changes

annxingyuan added 4 commits April 12, 2019 17:01

add type

18b416b

add tilesize

fb73f66

comments

7a5e117

naming

c6933a3

kainino0x reviewed Apr 12, 2019

View reviewed changes

comments

ff839dd

annxingyuan commented Apr 13, 2019

View reviewed changes

annxingyuan merged commit 8d40819 into master Apr 13, 2019

annxingyuan deleted the add_op branch April 13, 2019 19:28

annxingyuan mentioned this pull request Apr 15, 2019

Add WebGPU backend. tensorflow/tfjs-core#1676

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add matMul, refactoring. #2

Add matMul, refactoring. #2

annxingyuan commented Apr 10, 2019 •

edited

Loading

nsthorat left a comment

nsthorat left a comment

kainino0x left a comment

kainino0x left a comment

kainino0x left a comment

nsthorat left a comment

kainino0x left a comment •

edited

Loading

annxingyuan left a comment

Kangz commented Apr 15, 2019

Add matMul, refactoring. #2

Add matMul, refactoring. #2

Conversation

annxingyuan commented Apr 10, 2019 • edited Loading

Additions

Benchmarks

Questions

nsthorat left a comment

Choose a reason for hiding this comment

nsthorat left a comment

Choose a reason for hiding this comment

kainino0x left a comment

Choose a reason for hiding this comment

kainino0x left a comment

Choose a reason for hiding this comment

kainino0x left a comment

Choose a reason for hiding this comment

nsthorat left a comment

Choose a reason for hiding this comment

kainino0x left a comment • edited Loading

Choose a reason for hiding this comment

annxingyuan left a comment

Choose a reason for hiding this comment

Kangz commented Apr 15, 2019

annxingyuan commented Apr 10, 2019 •

edited

Loading

kainino0x left a comment •

edited

Loading