Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Dynamic custom operator GPU support #17270

Merged
merged 16 commits into from
Jan 31, 2020
Merged

Conversation

rondogency
Copy link
Contributor

@rondogency rondogency commented Jan 11, 2020

Description

Add GPU support for custom operators.
This is a continuation of custom operators project, initial CPU support is implemented here: #15921

Design

Working backward from the user. Here is an example of custom operator forward function for GPU.

Notice the function interface is the same as CPU operator. The input and output tensors are already in GPU, so you don't need to memcpy them to GPU.

You need the CUDA stream from OpResource in order to launch your CUDA kernel on the correct GPU, and mx_stream_t is defined as cudaStream_t when you compile the code with NVCC.

MXReturnValue forwardGPU(std::map<std::string, std::string> attrs,
                         std::vector<MXTensor> inputs,
                         std::vector<MXTensor> outputs,
                         OpResource res) {
    ... 
    mx_stream_t cuda_stream = res.get_cuda_stream();
    gpu_forward<<<grid, block, 0, cuda_stream>>>(out_data, in_data, N);
    return MX_SUCCESS;
}

Then user will register a single operator with both CPU and GPU computation logic, by specifying the device type in register function. This registration works for any context as user only passes in a string.

REGISTER_OP(my_op)
.setForward(forwardCPU, "cpu")
.setForward(forwardGPU, "gpu")
.setBackward(backwardCPU, "cpu")
.setBackward(backwardGPU, "gpu");

Then user compiles the library similar to compiling the CPU operator library. Python usage is exactly the same as CPU custom operators

import mxnet as mx
mx.library.load(os.path.abspath('libmy_op_lib.so'))
a = mx.nd.array([[1,2,3],[4,5,6]], ctx=mx.gpu())
b = mx.nd.array([[7],[8],[9]], ctx=mx.gpu())
mx.nd.my_op(a,b)

This PR is not

  • Supporting CPU function is one library and GPU function in another library. Both CPU and GPU functions have to be in the same library.
  • Making inferShape and inferType context aware
  • Supporting dynamic loading of custom context

Checklist

Essentials

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Add Fcompute registration, and pass NDArray context to custom library in c_api.cc
  • Add context info to MXTensor class in lib_api.h
  • Add lib_custom_op/relu.cu example file containing full registration of custom operator "my_relu", and add both CPU and GPU kernel functions in that file
  • Modify lib_custom_op/Makefile to compile .cu file using nvcc to custom library

Comments

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some interfaces require more discussion.

src/c_api/c_api.cc Outdated Show resolved Hide resolved
src/c_api/c_api.cc Outdated Show resolved Hide resolved
src/c_api/c_api.cc Outdated Show resolved Hide resolved
src/c_api/c_api.cc Outdated Show resolved Hide resolved
src/c_api/c_api.cc Outdated Show resolved Hide resolved
src/c_api/c_api.cc Outdated Show resolved Hide resolved
src/c_api/c_api.cc Outdated Show resolved Hide resolved
src/c_api/c_api.cc Outdated Show resolved Hide resolved
@rondogency
Copy link
Contributor Author

@ptrendx thanks for your comments! I have resolved those comments, and I will appreciate if you could take another quick look and approve this.

CMakeLists.txt Outdated Show resolved Hide resolved
@samskalicky
Copy link
Contributor

@rondogency looks like the windows build/test is working now with those cmake changes:

test_operator_gpu.test_custom_op_gpu ... 
MXNet version 10600 supported
[
10:16:08] C:\jenkins_slave\workspace\build-gpu\src\c_api\c_api.cc:286: 
Found 2 operators in library
[
10:16:08] C:\jenkins_slave\workspace\build-gpu\src\c_api\c_api.cc:350: 	Op[0] my_relu
[
10:16:08] C:\jenkins_slave\workspace\build-gpu\src\c_api\c_api.cc:350: 	Op[1] my_state_relu
[
10:16:08] C:\jenkins_slave\workspace\build-gpu\src\c_api\c_api.cc:785: Found 0 partitioners in library
ok (0.6834s)

Now we just need to work through the flaky tests

CMakeLists.txt Outdated Show resolved Hide resolved
Makefile Show resolved Hide resolved
Copy link
Member

@wkcn wkcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! LGTM. Thank you!

@wkcn wkcn added pr-awaiting-merge Review and CI is complete. Ready to Merge and removed pr-awaiting-review PR is waiting for code review labels Jan 31, 2020
@rondogency
Copy link
Contributor Author

@wkcn Thanks! Can you help me to merge it?

@wkcn wkcn merged commit a726c40 into apache:master Jan 31, 2020
@wkcn
Copy link
Member

wkcn commented Jan 31, 2020

Merged. : )

@rondogency rondogency deleted the custom_op_gpu branch February 3, 2020 23:25
zheyuye pushed a commit to zheyuye/incubator-mxnet that referenced this pull request Feb 19, 2020
* poc gpu customop end to end

* add backward and device id

* clear up customop makefile

* new fcomp register

* new setforward to pass custom context to c_api

* resolve sam comment: add cond register and fix setforward char

* tmp stateful op

* passing ctx of stateful op

* add gpu alloc and refactor all fcomp

* resolve sam comments and refactor alloc

* add gpu check to pass cpu build

* add unittest and resolve ptrend comments

* add cmake and jenkins

* fix windows

* windows gpu cmake build fix

* remove verbose
leezu pushed a commit that referenced this pull request Apr 8, 2020
Add random number generator support for custom operator libraries.

Design: We pass from MXNet the initialized and seeded states, located on CPU and GPU, to custom library. So user could use those seeds to generate deterministic values from a given seed passed to MXNet. Basically this workflow:

mx.random.seed(128)
r1 = mx.nd.some_custom_random_op(data)
mx.random.seed(128)
r2 = mx.nd.some_custom_random_op(data)
assert (r1 == r2)

This PR does not let custom library generate exactly the same sequence of random numbers comparing to MXNet

This is a continuation of the custom operator project #15921 and #17270
samskalicky pushed a commit to samskalicky/incubator-mxnet that referenced this pull request Apr 15, 2020
Add random number generator support for custom operator libraries.

Design: We pass from MXNet the initialized and seeded states, located on CPU and GPU, to custom library. So user could use those seeds to generate deterministic values from a given seed passed to MXNet. Basically this workflow:

mx.random.seed(128)
r1 = mx.nd.some_custom_random_op(data)
mx.random.seed(128)
r2 = mx.nd.some_custom_random_op(data)
assert (r1 == r2)

This PR does not let custom library generate exactly the same sequence of random numbers comparing to MXNet

This is a continuation of the custom operator project apache#15921 and apache#17270
pengzhao-intel pushed a commit that referenced this pull request Apr 16, 2020
…18069)

* Dynamic subgraph compile support (#17623)

This PR adds support for passing the NDArrays from the existing optimize_for API down to the reviewSubgraph function in an external library. It also adds a new API for HybridBlock called optimize_for that can partition the model without running a forward pass.

Feature changes

    Adds new API to HybridBlock optimize_for that partitions the model but does not call the cachedOp
    Modifies the subgraph library example to optionally require args to be provided
    Adds annotation on subgraph inputs for the name of the original param so that inputs can be mapped and passes annotations to input nodes of subgraphs
    Adds support for tensors in MKLDNN format, calls Reorder2Default

New tests

    Adds a new test to partition operators that directly consume params
    add a new model to test where ops to be partitioned have args/params

Bug Fixes

    fixes bug in passing ids vector by value instead of by reference
    fixes bug in passing copies of attributes instead of by reference
    fixes bug where _cached_graph was not updated after partitioning
    fixes memory leak where user-specified attributes on subgraph ops were not freed if subgraph was rejected
    fixes problem incorrectly indexing into shape/dtype maps when annotating the graph

Docs

    Updates the README doc with the latest changes described above

* Adding sparse support to MXTensor for custom operators (#17569)

* Added enum for sparse storage

* Add structure for Dense and Sparse

* redesign the data structure for MXSparse

* pull out aux data from sparse NDArray

* Added more sparse arguments to API interface

* Passed sparse from c_api to lib_api.h and set in MXTensor

* Fix indent

* fix segfault

* Fix NDArray to MXTensor errors

* Add a sample of sparse(CSR) transpose

* Make CSR transpose temporarily work by hardcoding

* Fixed sparse output size(Refined)

* Add tests for symbolic and stateful ops

* Added a sample for row sparse transpose

* Added real row sparse transpose

* Fix output size issue by adding lambda for CheckAndAlloc()

* Fix mixed storage formats error

* Added infer storage type function

* resolve comments

* Set inferSType as optional function

* Resolve comments

* Add error messages

* Resolve comments

* verify transpose ops results

* fix sanity check

* update MX_LIBRARY_VERSION to 5

* Custom Operator Random Number Generator Support (#17762)

Add random number generator support for custom operator libraries.

Design: We pass from MXNet the initialized and seeded states, located on CPU and GPU, to custom library. So user could use those seeds to generate deterministic values from a given seed passed to MXNet. Basically this workflow:

mx.random.seed(128)
r1 = mx.nd.some_custom_random_op(data)
mx.random.seed(128)
r2 = mx.nd.some_custom_random_op(data)
assert (r1 == r2)

This PR does not let custom library generate exactly the same sequence of random numbers comparing to MXNet

This is a continuation of the custom operator project #15921 and #17270

Co-authored-by: guanxinq <58794120+guanxinq@users.noreply.github.com>
Co-authored-by: Ziyi Mu <ziyi.mu@columbia.edu>
pengzhao-intel pushed a commit that referenced this pull request Apr 16, 2020
* Dynamic subgraph compile support (#17623)

This PR adds support for passing the NDArrays from the existing optimize_for API down to the reviewSubgraph function in an external library. It also adds a new API for HybridBlock called optimize_for that can partition the model without running a forward pass.

Feature changes

    Adds new API to HybridBlock optimize_for that partitions the model but does not call the cachedOp
    Modifies the subgraph library example to optionally require args to be provided
    Adds annotation on subgraph inputs for the name of the original param so that inputs can be mapped and passes annotations to input nodes of subgraphs
    Adds support for tensors in MKLDNN format, calls Reorder2Default

New tests

    Adds a new test to partition operators that directly consume params
    add a new model to test where ops to be partitioned have args/params

Bug Fixes

    fixes bug in passing ids vector by value instead of by reference
    fixes bug in passing copies of attributes instead of by reference
    fixes bug where _cached_graph was not updated after partitioning
    fixes memory leak where user-specified attributes on subgraph ops were not freed if subgraph was rejected
    fixes problem incorrectly indexing into shape/dtype maps when annotating the graph

Docs

    Updates the README doc with the latest changes described above

* Adding sparse support to MXTensor for custom operators (#17569)

* Added enum for sparse storage

* Add structure for Dense and Sparse

* redesign the data structure for MXSparse

* pull out aux data from sparse NDArray

* Added more sparse arguments to API interface

* Passed sparse from c_api to lib_api.h and set in MXTensor

* Fix indent

* fix segfault

* Fix NDArray to MXTensor errors

* Add a sample of sparse(CSR) transpose

* Make CSR transpose temporarily work by hardcoding

* Fixed sparse output size(Refined)

* Add tests for symbolic and stateful ops

* Added a sample for row sparse transpose

* Added real row sparse transpose

* Fix output size issue by adding lambda for CheckAndAlloc()

* Fix mixed storage formats error

* Added infer storage type function

* resolve comments

* Set inferSType as optional function

* Resolve comments

* Add error messages

* Resolve comments

* verify transpose ops results

* fix sanity check

* update MX_LIBRARY_VERSION to 5

* Custom Operator Random Number Generator Support (#17762)

Add random number generator support for custom operator libraries.

Design: We pass from MXNet the initialized and seeded states, located on CPU and GPU, to custom library. So user could use those seeds to generate deterministic values from a given seed passed to MXNet. Basically this workflow:

mx.random.seed(128)
r1 = mx.nd.some_custom_random_op(data)
mx.random.seed(128)
r2 = mx.nd.some_custom_random_op(data)
assert (r1 == r2)

This PR does not let custom library generate exactly the same sequence of random numbers comparing to MXNet

This is a continuation of the custom operator project #15921 and #17270

Co-authored-by: guanxinq <58794120+guanxinq@users.noreply.github.com>
Co-authored-by: Ziyi Mu <ziyi.mu@columbia.edu>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-merge Review and CI is complete. Ready to Merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants