Skip to content
This repository has been archived by the owner on Sep 26, 2019. It is now read-only.

Add matMul, refactoring. #2

Merged
merged 24 commits into from
Apr 13, 2019
Merged

Add matMul, refactoring. #2

merged 24 commits into from
Apr 13, 2019

Conversation

annxingyuan
Copy link
Owner

@annxingyuan annxingyuan commented Apr 10, 2019

Additions

  • matMul kernel using shared memory
  • modularization of kernels
  • compileAndRun pipeline added to backend to mimic WebGL

Benchmarks

I benchmarked matMul on input shapes 500 x 500 and 500 x 500 (code) - here are the results (in ms):

WebGL backend:
Average: 12.25
Min: 10.20

WebGPU tile size 2:
Average: 129.65
Min: 122.66

WebGPU tile size 16:
Average: 40.61
Min: 36.07

Tile size 32 caused the browser to crash.

Questions

  • Do the benchmarking results align with your expectations? I haven't included all the optimizations from the WebGL2 example yet, but I believe this implementation does use shared memory (here's the reference).
  • In WebGL we include a warmup run before benchmarking - how should we think about warmup in WebGPU?
  • The WebGL2 example calls memoryBarrierShared() before barrier() - however when I try calling memoryBarrierShared() I no longer get accurate buffer reads - I commented out the line in the kernel here.
  • Anything larger than 600x600 matMul caused the browser to crash - are there limits to input sizes we should be aware of?
  • When setting the dimensions of the dispatch, should we in general be trying to squarify the thread groups for performance? What would be the optimal dispatch dimensions for something like elementwise multiplication?
  • In the browser I get Buffer does not have map usage error after the second matMul run and then with every subsequent run - do you see any issues with my benchmarking code that could explain this?
  • Are there heuristics for choosing a tile size? (32 caused the browser to crash)
  • Is there an example somewhere of uploading uniforms with WebGPU? I'm using storage buffers for everything in this PR but it probably makes sense to upload things like shared dimension size as a uniform? What are the tradeoffs / things to consider when uploading as uniform vs ssbo?
  • Corentin - from our discussion over email it sounded like there's not much reason to share command encoders, but am I correct to say that we should try to batch calls to submit on the device queue for performance? If so, do you have any metrics to characterize the performance benefits we could expect from batching? Everything in TF.js is eager right now, so I just want to understand how to prioritize adding support for something like a graph mode.

This change is Reviewable

Copy link
Collaborator

@nsthorat nsthorat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sending some comments, feel free to merge and send another PR to core

Reviewed 17 of 17 files at r1.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @annxingyuan)


src/backend_webgpu.ts, line 1 at r1 (raw file):

import {DataMover, DataType, KernelBackend, Rank, ShapeMap, Tensor, tensor1d, Tensor3D, util} from '@tensorflow/tfjs-core';

make sure you get a license on this and also the rest of the files


src/backend_webgpu.ts, line 27 at r1 (raw file):

  compileOpts: any;

  private binaryCache: {[key: string]: any};

can you type this, even if you have to make custom interfaces?


src/backend_webgpu.ts, line 62 at r1 (raw file):

  private setBufferData(buffer: any, data: Float32Array|Int32Array|Uint8Array) {
    buffer.setSubData(0, data.slice().buffer);

why do you have to slice?


src/backend_webgpu.ts, line 176 at r1 (raw file):

  }

  batchMatMul(

just curious, why do we need batchmatmul? it's done so nbd, but would be good to just sort by ops needed for posenet for now


src/kernels/multiply_webgpu.ts, line 15 at r1 (raw file):

    this.userCode = `
      #version 450
      layout(std140, set = 0, binding = 0) buffer A {

I think it probably makes sense to move this to a utility (analogous to shader compiler) that automatically splices this in. otherwise we're gonna have a lot of repetitive code


src/kernels/multiply_webgpu.ts, line 26 at r1 (raw file):

      void main() {
        uint index = gl_GlobalInvocationID.x;

one of the most important things we did in the WebGL world was make a logical sampling layer. It may feel like a premature thing, but this may speed up op-writing, especially if you decide to do things like squarification.


src/kernels/webgpu_math.ts, line 14 at r1 (raw file):

          source, shaderKind, 'file', 'main', compileOptions);
      const error = result.GetErrorMessage();
      if (error) {

!= null


src/kernels/webgpu_math.ts, line 17 at r1 (raw file):

        console.error(error);
      }
      const code = result.GetBinary().slice(0).buffer;

why a slice(0)?


src/kernels/webgpu_math.ts, line 29 at r1 (raw file):

export function makeShaderKey(program: WebGPUProgram): string {
  const keyUserCode = program.userCode;
  let key = program.constructor.name;

why do you need the constructor name? for now just user the code itself? also make sure you use user code and not all the headers (to make this faster)

Copy link
Collaborator

@nsthorat nsthorat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 10 unresolved discussions (waiting on @annxingyuan)


src/kernels/matmul_webgpu.ts, line 16 at r1 (raw file):

    this.userCode = `
      #version 450
      #define TS ${TS}

can this be a uniform so that it's not part of the shader source and cache key?

Copy link
Contributor

@kainino0x kainino0x left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 11 unresolved discussions (waiting on @annxingyuan and @nsthorat)


src/backend_webgpu.ts, line 62 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

why do you have to slice?

This is only really necessary if the ArrayBufferView points at a subrange of the ArrayBuffer.
We're fixing this in Chrome so that you can pass an ArrayBufferView in here instead.


src/kernels/matmul_webgpu.ts, line 16 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

can this be a uniform so that it's not part of the shader source and cache key?

No, it's part of the layout (...) in declaration just below which must be static. That said, it's unlikely that it will be different for different invocations of an op, so it shouldn't hurt caching.

It could be const uint TS = ${TS}; though, which I think works and is slightly nicer.

Might also want to expand it out to TileSize.


src/kernels/multiply_webgpu.ts, line 15 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

I think it probably makes sense to move this to a utility (analogous to shader compiler) that automatically splices this in. otherwise we're gonna have a lot of repetitive code

Probably best to wait on this until we figure out how to abstract the logical sampling layer (see below), IMO. BTW, it will be nontrivial since it needs to deal with variations in number of inputs, number of outputs, binding numbers, eventually sets (bind groups), possibly sizes of inputs/outputs.


src/kernels/multiply_webgpu.ts, line 16 at r1 (raw file):

      #version 450
      layout(std140, set = 0, binding = 0) buffer A {
        float dataA[];

I'd rename A and dataA here to match the naming convention of the other shader, just for consistency.


src/kernels/multiply_webgpu.ts, line 26 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

one of the most important things we did in the WebGL world was make a logical sampling layer. It may feel like a premature thing, but this may speed up op-writing, especially if you decide to do things like squarification.

IIU, it would be pretty trivial to do this. But it might require some changes to the way data is fed into the shaders in order to be abstractable into a function, so I'd advocate trying to do that after this patch.


src/kernels/webgpu_math.ts, line 17 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

why a slice(0)?

Same reason as above, except this one is strictly necessary. The ArrayBufferView returned by GetBinary() is a view into the entire WASM heap. If you don't slice, then .buffer will be the entire WASM heap.

This is also getting fixed in Chrome.

Copy link
Contributor

@kainino0x kainino0x left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the huge cleanup, looks way better than what I put in! :)

I would expect the performance of tile size = 1 to ~match webgl in the ideal case (and tiles closer to ~4x4 or 4x8 to be faster), but it's really hard to say. We're going to need to look at it a little more deeply to see what is going on. (For example I'm not sure if GPUBuffers are getting reused; if they're not then the allocations are going to hurt a lot.)

Reviewable status: all files reviewed, 11 unresolved discussions (waiting on @annxingyuan and @nsthorat)

Copy link
Contributor

@kainino0x kainino0x left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: (still figuring out how to use reviewable)

Reviewable status: all files reviewed, 11 unresolved discussions (waiting on @annxingyuan and @nsthorat)

Copy link
Collaborator

@nsthorat nsthorat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 11 of 18 files reviewed, 11 unresolved discussions (waiting on @annxingyuan, @kainino0x, and @nsthorat)


src/kernels/matmul_webgpu.ts, line 16 at r1 (raw file):

Previously, kainino0x (Kai Ninomiya) wrote…

No, it's part of the layout (...) in declaration just below which must be static. That said, it's unlikely that it will be different for different invocations of an op, so it shouldn't hurt caching.

It could be const uint TS = ${TS}; though, which I think works and is slightly nicer.

Might also want to expand it out to TileSize.

Ah got it - sounds like that won't change much so this program is agnostic of input dimensions (so different shaped matmuls will hit our shader binary cache). Thanks!


src/kernels/multiply_webgpu.ts, line 15 at r1 (raw file):

Previously, kainino0x (Kai Ninomiya) wrote…

Probably best to wait on this until we figure out how to abstract the logical sampling layer (see below), IMO. BTW, it will be nontrivial since it needs to deal with variations in number of inputs, number of outputs, binding numbers, eventually sets (bind groups), possibly sizes of inputs/outputs.

in webgl world we have information about all the variables and their inputs in the GPGPUProgram here so we could programmatically splice this together, but it's fine to wait, just a suggestion


src/kernels/multiply_webgpu.ts, line 26 at r1 (raw file):

Previously, kainino0x (Kai Ninomiya) wrote…

IIU, it would be pretty trivial to do this. But it might require some changes to the way data is fed into the shaders in order to be abstractable into a function, so I'd advocate trying to do that after this patch.

sg!

Copy link
Contributor

@kainino0x kainino0x left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In WebGL we include a warmup run before benchmarking - how should we think about warmup in WebGPU?

I don't have a very good grasp on what happens during warmup in WebGL. I wouldn't expect it to be necessary in WebGPU. E.g. requestDevice won't resolve until it's ready (although I don't think this is true today), pipelines are created ahead of time so they don't incur implicit compilation times (we will be adding a way to know when pipeline compilation is completed and won't stall things - but it's not in yet).

The WebGL2 example calls memoryBarrierShared() before barrier() - however when I try calling memoryBarrierShared() I no longer get accurate buffer reads - I commented out the line in the kernel here.

This is strange. memoryBarrierShared(); barrier(); seems it should work. I would only expect memoryBarrierShared to break things if there were already a race between multiple threads writing into the same location in shared memory. We'll need to look more closely at this (and try it on some other platforms).

Anything larger than 600x600 matMul caused the browser to crash - are there limits to input sizes we should be aware of?

There's currently a limit on the size of setSubData and map calls because we haven't fully implemented it yet. That limit is about 16MB (though it might be somewhat unreliable in that Chromium build).

When setting the dimensions of the dispatch, should we in general be trying to squarify the thread groups for performance? What would be the optimal dispatch dimensions for something like elementwise multiplication?

Nx1x1, with any tilesize, should AFAIK give ~best performance for ops like elementwise multiplication.

In the browser I get Buffer does not have map usage error after the second matMul run and then with every subsequent run - do you see any issues with my benchmarking code that could explain this?

IIRC, the only buffer with a map usage, and the only buffer that gets mapped, is the one in getBufferData. So I wouldn't expect anything to be wrong here. This might be a bug on our side.

Are there heuristics for choosing a tile size? (32 caused the browser to crash)

I don't think we have done an in-depth investigation into this. However, there are limits (which we will need to validate but don't currently). In Vulkan,

  • maxComputeWorkGroupInvocations must be at least 128
  • maxComputeWorkGroupSize must be at least (128,128,64)
  • maxComputeSharedMemorySize must be at least 16384

which means if you have X*Y*Z <= 128, X <= 128, Y < 128, and Z < 64, you should be safe. These rules are probably true across other backends too.

In Metal, the lowest values across hardware are:

  • Maximum threads per threadgroup: 512
  • Maximum total threadgroup memory allocation: 16352 B

So you'll also need to stay under 16352 bytes of shared memory per threadgroup.

Is there an example somewhere of uploading uniforms with WebGPU? I'm using storage buffers for everything in this PR but it probably makes sense to upload things like shared dimension size as a uniform? What are the tradeoffs / things to consider when uploading as uniform vs ssbo?

Uniforms might be better optimized because they're readonly. Adding a uniform buffer is very similar to a storage buffer. It just needs the GPUBufferUsage.UNIFORM usage, and you bind it as "uniform-buffer" instead of "storage-buffer". Example.
There are only uniform buffers, no individual uniforms.

Corentin - from our discussion over email it sounded like there's not much reason to share command encoders, but am I correct to say that we should try to batch calls to submit on the device queue for performance? If so, do you have any metrics to characterize the performance benefits we could expect from batching? Everything in TF.js is eager right now, so I just want to understand how to prioritize adding support for something like a graph mode.

I don't think we have any metrics, and it probably depends on the platform too. I don't think it will hurt too much to ignore this for now.

Reviewable status: 11 of 18 files reviewed, 11 unresolved discussions (waiting on @annxingyuan, @kainino0x, and @nsthorat)


src/kernels/multiply_webgpu.ts, line 15 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

in webgl world we have information about all the variables and their inputs in the GPGPUProgram here so we could programmatically splice this together, but it's fine to wait, just a suggestion

Yeah, I think eventually you'll be able to generate this. Definitely a good idea.

Copy link
Owner Author

@annxingyuan annxingyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 11 of 18 files reviewed, 7 unresolved discussions (waiting on @kainino0x and @nsthorat)


src/backend_webgpu.ts, line 1 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

make sure you get a license on this and also the rest of the files

Done


src/backend_webgpu.ts, line 27 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

can you type this, even if you have to make custom interfaces?

Done


src/backend_webgpu.ts, line 176 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

just curious, why do we need batchmatmul? it's done so nbd, but would be good to just sort by ops needed for posenet for now

For compatibility with the backend interface.


src/kernels/multiply_webgpu.ts, line 16 at r1 (raw file):

Previously, kainino0x (Kai Ninomiya) wrote…

I'd rename A and dataA here to match the naming convention of the other shader, just for consistency.

Done


src/kernels/webgpu_math.ts, line 14 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

!= null

Done


src/kernels/webgpu_math.ts, line 29 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

why do you need the constructor name? for now just user the code itself? also make sure you use user code and not all the headers (to make this faster)

Done

@annxingyuan annxingyuan merged commit 8d40819 into master Apr 13, 2019
@annxingyuan annxingyuan deleted the add_op branch April 13, 2019 19:28
@Kangz
Copy link

Kangz commented Apr 15, 2019

In WebGL we include a warmup run before benchmarking - how should we think about warmup in WebGPU?

I don't have a very good grasp on what happens during warmup in WebGL. I wouldn't expect it to be necessary in WebGPU. E.g. requestDevice won't resolve until it's ready (although I don't think this is true today), pipelines are created ahead of time so they don't incur implicit compilation times (we will be adding a way to know when pipeline compilation is completed and won't stall things - but it's not in yet).

If I understand the code correctly pipeline compilation happens on the first run. If that's the case we should do a warmup run first, and then do the benchmark. Also the benchmark does a readback after every matrix multiplication. This means the cost of readbacks (and previously pipelined operations) are taken into account in the result. To help remove this we could do 100 matrix multiplications on the GPU (maybe?).

@annxingyuan if your laptop has two GPUs I believe WebGPU will currently use the integrated one while WebGL will use the discrete one, leading to the performance difference. We'll see about fixing WebGPU to use the discrete one for the next Chromium build.

The WebGL2 example calls memoryBarrierShared() before barrier() - however when I try calling memoryBarrierShared() I no longer get accurate buffer reads - I commented out the line in the kernel here.

This is strange. memoryBarrierShared(); barrier(); seems it should work. I would only expect memoryBarrierShared to break things if there were already a race between multiple threads writing into the same location in shared memory. We'll need to look more closely at this (and try it on some other platforms).

Anything larger than 600x600 matMul caused the browser to crash - are there limits to input sizes we should be aware of?

There's currently a limit on the size of setSubData and map calls because we haven't fully implemented it yet. That limit is about 16MB (though it might be somewhat unreliable in that Chromium build).

This should be possible to work around by making multiple setSubData calls. This limitation should disappear eventually.

When setting the dimensions of the dispatch, should we in general be trying to squarify the thread groups for performance? What would be the optimal dispatch dimensions for something like elementwise multiplication?

Nx1x1, with any tilesize, should AFAIK give ~best performance for ops like elementwise multiplication.

+1 because cachelines would be looking like Nx1x1 pieces of the tensor themselves.

In the browser I get Buffer does not have map usage error after the second matMul run and then with every subsequent run - do you see any issues with my benchmarking code that could explain this?

IIRC, the only buffer with a map usage, and the only buffer that gets mapped, is the one in getBufferData. So I wouldn't expect anything to be wrong here. This might be a bug on our side.

I commented on the other PR that info.buffer.unmap shouldn't be needed. I believe that's what causes the validation error.

Are there heuristics for choosing a tile size? (32 caused the browser to crash)

I don't think we have done an in-depth investigation into this. However, there are limits (which we will need to validate but don't currently). In Vulkan,

  • maxComputeWorkGroupInvocations must be at least 128
  • maxComputeWorkGroupSize must be at least (128,128,64)
  • maxComputeSharedMemorySize must be at least 16384

which means if you have X*Y*Z <= 128, X <= 128, Y < 128, and Z < 64, you should be safe. These rules are probably true across other backends too.

In Metal, the lowest values across hardware are:

  • Maximum threads per threadgroup: 512
  • Maximum total threadgroup memory allocation: 16352 B

So you'll also need to stay under 16352 bytes of shared memory per threadgroup.

Is there an example somewhere of uploading uniforms with WebGPU? I'm using storage buffers for everything in this PR but it probably makes sense to upload things like shared dimension size as a uniform? What are the tradeoffs / things to consider when uploading as uniform vs ssbo?

Uniforms might be better optimized because they're readonly. Adding a uniform buffer is very similar to a storage buffer. It just needs the GPUBufferUsage.UNIFORM usage, and you bind it as "uniform-buffer" instead of "storage-buffer". Example.
There are only uniform buffers, no individual uniforms.

Corentin - from our discussion over email it sounded like there's not much reason to share command encoders, but am I correct to say that we should try to batch calls to submit on the device queue for performance? If so, do you have any metrics to characterize the performance benefits we could expect from batching? Everything in TF.js is eager right now, so I just want to understand how to prioritize adding support for something like a graph mode.

I don't think we have any metrics, and it probably depends on the platform too. I don't think it will hurt too much to ignore this for now.

The easy way to do fake eager mode would be to have a backend-global command buffer and encode commands in it as they come in. Then when a readback is requested you can flush all pending operations with a call to submit(). Hopefully that's much easier to implement than a full retained mode.

Reviewable status: 11 of 18 files reviewed, 11 unresolved discussions (waiting on @annxingyuan, @kainino0x, and @nsthorat)

src/kernels/multiply_webgpu.ts, line 15 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…
Yeah, I think eventually you'll be able to generate this. Definitely a good idea.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants