Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Fix for duplicate subgraph inputs/outputs #16131

Merged
merged 34 commits into from
Sep 10, 2020

Conversation

samskalicky
Copy link
Contributor

@samskalicky samskalicky commented Sep 9, 2019

Description

Fixes #16108

Subgraph API can create duplicated tensors for both inputs to a subgraph and outputs from a subgraph. This happens when there are multiple nodes in a subgraph that consume the same input, or when there are multiple nodes that consume a single subgraph output. This results in tensor duplication, and causes OOM due to excessive memory usage.

Fix for duplicate inputs

Currently, inputs to a subgraph are duplicated when there are multiple nodes in a subgraph consuming the same input (see diagram below). In this PR we change the behavior of CutGraphInputs function to store a NodeEntry for each input to the subgraph in a map, and re-use that node for each node in the subgraph that consumes that input (see diagram below). This prevents input nodes from being duplicated, and results in less copies of tensors at runtime and a much smaller memory footprint.

duplicateInputs

Fix for duplicate outputs

Similar to the inputs, when an output from a subgraph is consumed by multiple nodes outside the subgraph, the output is duplicated for each consumer (see diagram below). In this PR we change the behavior of CreateSubgraphNode to collapse duplicate output dependencies to a single output by checking if neighboring output entries are the same. If so we do not add another output, preventing duplicates.

Then in the ConnectSubgraphOutputs function we create dependencies from each node outside the subgraph that consumes a subgraph output. If neighboring nodes share the same subgraph output, they point to the same subgraph output.

duplicateOutputs

Example

Added an example graph with duplicate inputs/outputs to the test_subgraph_op.py. Heres the original graph partitioned before this PR. Notice that it has two input nodes: data0 and data1 that both come from the same top level data. Similarly, notice that it has two outputs coming from the same node _plus0.

{
  "nodes": [
    {
      "op": "null", 
      "name": "data", 
      "attrs": {"__shape__": "(1,)"}, 
      "inputs": []
    }, 
    {
      "op": "_CachedOp", 
      "name": "_CachedOp0", 
      "inputs": [[0, 0, 0], [0, 0, 0]], 
      "subgraphs": [
        {
          "nodes": [
            {
              "op": "null", 
              "name": "data0", 
              "inputs": []
            }, 
            {
              "op": "sin", 
              "name": "sin0", 
              "inputs": [[0, 0, 0]]
            }, 
            {
              "op": "null", 
              "name": "data1", 
              "inputs": []
            }, 
            {
              "op": "sin", 
              "name": "sin1", 
              "inputs": [[2, 0, 0]]
            }, 
            {
              "op": "elemwise_add", 
              "name": "_plus0", 
              "inputs": [[1, 0, 0], [3, 0, 0]]
            }
          ], 
          "arg_nodes": [0, 2], 
          "node_row_ptr": [0, 1, 2, 3, 4, 5], 
          "heads": [[4, 0, 0], [4, 0, 0]]
        }
      ]
    }, 
    {
      "op": "cos", 
      "name": "cos0", 
      "inputs": [[1, 0, 0]]
    }, 
    {
      "op": "cos", 
      "name": "cos1", 
      "inputs": [[1, 1, 0]]
    }, 
    {
      "op": "elemwise_sub", 
      "name": "_minus0", 
      "inputs": [[2, 0, 0], [3, 0, 0]]
    }
  ], 
  "arg_nodes": [0], 
  "node_row_ptr": [0, 1, 3, 4, 5, 6], 
  "heads": [[4, 0, 0]], 
  "attrs": {"mxnet_version": ["int", 10700]}
}

Heres the same graph partitioned after this PR. Notice that now there is only 1 input node data0 and both sin ops use it. Also notice that now there is only 1 output from the subgraph, and both cos ops use it.

{
  "nodes": [
    {
      "op": "null", 
      "name": "data", 
      "attrs": {
        "__profiler_scope__": "<unk>:", 
        "__shape__": "(1,)"
      }, 
      "inputs": []
    }, 
    {
      "op": "_CachedOp", 
      "name": "_CachedOp0", 
      "inputs": [[0, 0, 0]], 
      "subgraphs": [
        {
          "nodes": [
            {
              "op": "null", 
              "name": "data0", 
              "inputs": []
            }, 
            {
              "op": "sin", 
              "name": "sin0", 
              "attrs": {"__profiler_scope__": "<unk>:"}, 
              "inputs": [[0, 0, 0]]
            }, 
            {
              "op": "sin", 
              "name": "sin1", 
              "attrs": {"__profiler_scope__": "<unk>:"}, 
              "inputs": [[0, 0, 0]]
            }, 
            {
              "op": "elemwise_add", 
              "name": "_plus0", 
              "attrs": {"__profiler_scope__": "<unk>:"}, 
              "inputs": [[1, 0, 0], [2, 0, 0]]
            }
          ], 
          "arg_nodes": [0], 
          "node_row_ptr": [0, 1, 2, 3, 4], 
          "heads": [[3, 0, 0]]
        }
      ]
    }, 
    {
      "op": "cos", 
      "name": "cos0", 
      "attrs": {"__profiler_scope__": "<unk>:"}, 
      "inputs": [[1, 0, 0]]
    }, 
    {
      "op": "cos", 
      "name": "cos1", 
      "attrs": {"__profiler_scope__": "<unk>:"}, 
      "inputs": [[1, 0, 0]]
    }, 
    {
      "op": "elemwise_sub", 
      "name": "_minus0", 
      "attrs": {"__profiler_scope__": "<unk>:"}, 
      "inputs": [[2, 0, 0], [3, 0, 0]]
    }
  ], 
  "arg_nodes": [0], 
  "node_row_ptr": [0, 1, 2, 3, 4, 5], 
  "heads": [[4, 0, 0]], 
  "attrs": {"mxnet_version": ["int", 20000]}
}

@samskalicky
Copy link
Contributor Author

@zheng-da @reminisce could you please review?

@samskalicky
Copy link
Contributor Author

@mxnet-label-bot add [pr-awaiting-review]

@marcoabreu marcoabreu added the pr-awaiting-review PR is waiting for code review label Sep 10, 2019
@samskalicky
Copy link
Contributor Author

@ZhennanQin for review. Any suggestions on how to better fix the problem?

@samskalicky samskalicky changed the title [WIP] fix for duplicate subgraph inputs/outputs Fix for duplicate subgraph inputs/outputs Sep 11, 2019
@samskalicky
Copy link
Contributor Author

samskalicky commented Sep 17, 2019

@ZhennanQin ive made the changes, do they still look good? If so we can start going through and updating all other subgraph properties.

@ZhennanQin
Copy link
Contributor

LGTM. Please coordinate all backends with new ConnectSubgraphOutputs().

@samskalicky samskalicky changed the title Fix for duplicate subgraph inputs/outputs Fix for duplicate subgraph inputs/outputs Sep 18, 2019
@samskalicky samskalicky changed the title Fix for duplicate subgraph inputs/outputs [WIP] Fix for duplicate subgraph inputs/outputs Sep 18, 2019
@samskalicky samskalicky changed the title [WIP] Fix for duplicate subgraph inputs/outputs Fix for duplicate subgraph inputs/outputs Sep 18, 2019
@DickJC123
Copy link
Contributor

I've noticed that MXNet does not preserve what I'm calling 'output independence' during it's memory planning (issue forthcoming). While I'm talking here about the model as a whole, I've noticed many graph passes are applied to subgraphs as well, so perhaps the issue is relevent to this PR. Any update on status here?

@samskalicky
Copy link
Contributor Author

Hi @DickJC123 sorry I havent been able to prioritize finishing off this PR. IMO the problem is similar and endemic in mxnet but not related to this PR. Please do file an issue at the very least for posterity.

@samskalicky
Copy link
Contributor Author

samskalicky commented Feb 15, 2020

@mseth10 looks like we made an aux an arg somehow, thats why the MKLDNN tests are failing:

in_args_map arg_names arg_names
data data data
conv1_weight conv1_weight conv1_weight
bn1_gamma bn1_gamma bn1_gamma
bn1_beta bn1_beta bn1_beta
conv2_weight conv2_weight conv2_weight
bn2_gamma bn2_gamma bn2_gamma
bn2_beta bn2_beta bn2_beta
bn2_moving_mean

So this is why the error says

MXNetError: Check failed: arg_names.size() == in_args_map.size() (8 vs. 7) : 

Theres one extra name in arg_names for bn2_moving_mean

@HahTK HahTK mentioned this pull request Aug 18, 2020
@szha szha added this to the v1.8.0 milestone Aug 23, 2020
@@ -537,10 +537,11 @@ void FindOutputEntries(nnvm::Graph* g,
*/
void CutGraphInputs(const std::vector<nnvm::NodeEntry*> &input_entries,
std::vector<nnvm::NodeEntry> *orig_entries,
std::vector<nnvm::NodeEntry> *unique_inputs,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an alternative implementation that does not need t track the actual unique inputs separately.
A counter would allow us to correctly modify orig_entries into unique_entries.
This keeps the function signature unchanged and minimizes changes elsewhere.

It is already working in a private build. I add it here for consideration if desired

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion @HahTK, but we cant modify orig_entries since we need it unmodified for ReattachGraphInputs:
/~https://github.com/apache/incubator-mxnet/blob/e249d71ad4621afcba7f0f3af77095a3f9a4bc83/src/operator/subgraph/build_subgraph.cc#L578
this is used to reject a subgraph from the reviewSubgraph API:
/~https://github.com/apache/incubator-mxnet/blob/e249d71ad4621afcba7f0f3af77095a3f9a4bc83/src/operator/subgraph/build_subgraph.cc#L657
the example you provided doesnt work in the latest version of MXNet where you can reject a subgraph by returning nullptr from CreateSubgraphNode here: /~https://github.com/apache/incubator-mxnet/blob/e249d71ad4621afcba7f0f3af77095a3f9a4bc83/src/operator/subgraph/build_subgraph.cc#L630-L633

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed change was tested on 1.5.1 and it seems like newer versions of MXNet have new features that needs orig_entries to be preserved. So acknowledged !

Copy link
Contributor

@waytrue17 waytrue17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks!

Copy link
Contributor

@mseth10 mseth10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the fix.

@samskalicky samskalicky merged commit 6b01dc2 into apache:master Sep 10, 2020
samskalicky added a commit to samskalicky/incubator-mxnet that referenced this pull request Sep 10, 2020
* fix for duplicate inputs

* fixed error

* fixed whitespace

* Remove duplicate outputs from subgraphs

* changed subgraph to create map of outputs

* added static_cast

* changed map<int,v> to vector

* sanity fix

* sanity2

* updated backends with new connectSubgraphOutputs API

* fixed map creation logic

* added updates for reattach function

* creating node only if it is not an input to subgraph

* creating object based on var_name only

* updating ConnectSubgraphOutputs for mkldnn_elemwisemul_post_quantize_property.h

* add debug prints to debug error in CI

* remove prints

* added prints to debug in the CI

* revert changes

* reverted changes

* deduplicaated inputs to subgraph

* deduplicated subgraph inputs

* simplified inputs

* cleaned up

* deduplicate outputs

* cleand up

* added deduplication to subgraph node outputs

* fixed prev compare

* fixed issue with inputs and added test

* fixd whitespace, removed prints

Co-authored-by: Sam Skalicky <sskalic@amazon.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-39-72.us-west-2.compute.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-92-239.ec2.internal>
Co-authored-by: Manu Seth <22492939+mseth10@users.noreply.github.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-6-220.us-west-2.compute.internal>
@samskalicky
Copy link
Contributor Author

@mseth10 looks like we made an aux an arg somehow, thats why the MKLDNN tests are failing:

in_args_map arg_names arg_names
data data data
conv1_weight conv1_weight conv1_weight
bn1_gamma bn1_gamma bn1_gamma
bn1_beta bn1_beta bn1_beta
conv2_weight conv2_weight conv2_weight
bn2_gamma bn2_gamma bn2_gamma
bn2_beta bn2_beta bn2_beta
bn2_moving_mean
So this is why the error says

MXNetError: Check failed: arg_names.size() == in_args_map.size() (8 vs. 7) : 

Theres one extra name in arg_names for bn2_moving_mean

For the record @mseth10 i finally found the issue that caused this "making an aux an arg" problem: #19112 (comment)

samskalicky added a commit that referenced this pull request Sep 15, 2020
…9112)

* Fix for duplicate subgraph inputs/outputs (#16131)

* fix for duplicate inputs

* fixed error

* fixed whitespace

* Remove duplicate outputs from subgraphs

* changed subgraph to create map of outputs

* added static_cast

* changed map<int,v> to vector

* sanity fix

* sanity2

* updated backends with new connectSubgraphOutputs API

* fixed map creation logic

* added updates for reattach function

* creating node only if it is not an input to subgraph

* creating object based on var_name only

* updating ConnectSubgraphOutputs for mkldnn_elemwisemul_post_quantize_property.h

* add debug prints to debug error in CI

* remove prints

* added prints to debug in the CI

* revert changes

* reverted changes

* deduplicaated inputs to subgraph

* deduplicated subgraph inputs

* simplified inputs

* cleaned up

* deduplicate outputs

* cleand up

* added deduplication to subgraph node outputs

* fixed prev compare

* fixed issue with inputs and added test

* fixd whitespace, removed prints

Co-authored-by: Sam Skalicky <sskalic@amazon.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-39-72.us-west-2.compute.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-92-239.ec2.internal>
Co-authored-by: Manu Seth <22492939+mseth10@users.noreply.github.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-6-220.us-west-2.compute.internal>

* added flag to enable dedupe ondemand

* fixed dedup logic

* improved dedup logic

* fixed sanity

* propogated option

* check option in custom subgraph prop

* fixed options map

* fixed missing

* added dedup to subgraph_prop base class for testing

* added test for dedup

* added comments

Co-authored-by: Sam Skalicky <sskalic@amazon.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-39-72.us-west-2.compute.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-92-239.ec2.internal>
Co-authored-by: Manu Seth <22492939+mseth10@users.noreply.github.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-6-220.us-west-2.compute.internal>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Subgraph API creates duplicate inputs and outputs
8 participants