Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Simplify creation of NodeEntry instances and use emplace_back #14095

Merged
merged 4 commits into from
May 23, 2019

Conversation

larroy
Copy link
Contributor

@larroy larroy commented Feb 8, 2019

Description

Reduce copying of shared_ptr. In a microbenchmark is 10% faster. But the main motivation is simplifying creation of nnvm graph on operators.

Optimize move semantics of NodeEntry
/~https://github.com/dmlc/tvm/pull/2576
Making copies of shared_ptr is more expensive than moving.
This PR reduces lock contention by using move semantics in NNVM nodes
making also more convenient to construct NodeEntry classes in the code
due to the added ctors.

Update NDarray with NodeEntry constructors and refine initializer lists.

Sync gradient.cc with tvm

imperative.cc will be addressed in a different refactor.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

@larroy larroy requested a review from szha as a code owner February 8, 2019 12:57
@vandanavk
Copy link
Contributor

@mxnet-label-bot add [pr-work-in-progress]

@marcoabreu marcoabreu added the pr-work-in-progress PR is still work in progress label Feb 8, 2019
@ptrendx
Copy link
Member

ptrendx commented Feb 8, 2019

Does this change improve the time of scheduling launches to the engine?

@larroy
Copy link
Contributor Author

larroy commented Feb 8, 2019

@ptrendx thanks for your comment, would still need to measure

@larroy larroy changed the title [Don't merge] Optimize move semantics of NodeEntry [Don't merge][Review] Optimize move semantics of NodeEntry Feb 11, 2019
@larroy larroy force-pushed the node_ptr branch 2 times, most recently from b4f1e57 to ca7dde0 Compare February 13, 2019 23:16
@abhinavs95
Copy link
Contributor

@larroy Could you please fix the CI failures?

@piyushghai
Copy link
Contributor

@larroy Is this PR still WIP ?

@larroy larroy requested a review from anirudh2290 as a code owner April 11, 2019 01:22
@larroy larroy changed the title [Don't merge][Review] Optimize move semantics of NodeEntry Optimize move semantics of NodeEntry Apr 11, 2019
@larroy larroy force-pushed the node_ptr branch 2 times, most recently from 43b24ab to 9046556 Compare April 13, 2019 01:50
@larroy larroy changed the title Optimize move semantics of NodeEntry [Don't merge] Optimize move semantics of NodeEntry Apr 13, 2019
@larroy larroy force-pushed the node_ptr branch 2 times, most recently from 966fb9e to 287c7e3 Compare April 17, 2019 01:25
@larroy
Copy link
Contributor Author

larroy commented Apr 17, 2019

@ptrendx suggestions on measurements?

@larroy larroy force-pushed the node_ptr branch 2 times, most recently from aab4834 to 1b93548 Compare April 19, 2019 20:53
.gitmodules Outdated Show resolved Hide resolved
dev_menu.py Outdated Show resolved Hide resolved
Copy link
Member

@szha szha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch. Could you please report the difference that this change is making through experiment?

@larroy
Copy link
Contributor Author

larroy commented May 20, 2019

@szha what would you suggest?

@larroy
Copy link
Contributor Author

larroy commented May 20, 2019

@szha the initial motivation is to lean out node creation and for correctness since there are difficult to catch bugs depending on Node initialization.

Copy link
Contributor

@apeforest apeforest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for answering my question. LGTM!

@larroy
Copy link
Contributor Author

larroy commented May 20, 2019

@szha Performance is the same, so motivation is correctness and ease of graph building:

Tested in a tight loop:

TEST(NodeTestX, NodeTest) {
    using namespace nnvm;
    using namespace std;
    using namespace std::chrono;
    vector<nnvm::NodeEntry> v;
    nnvm::NodePtr ng = nnvm::Node::Create();
    ng->attrs.op = Op::Get("_zeros_without_dtype");
    ng->attrs.name = "zeros_without_dtype";
#if 0
    high_resolution_clock::time_point t1 = high_resolution_clock::now();
    for(size_t i = 0; i < 10000000; ++i) {
        v.push_back(NodeEntry{ng, 0, 0});
    }
    high_resolution_clock::time_point t2 = high_resolution_clock::now();
    auto duration = duration_cast<microseconds>( t2 - t1 ).count();
    cout << duration;
#endif
#if 1
    auto t1 = high_resolution_clock::now();
    for(size_t i = 0; i < 10000000; ++i) {
        v.emplace_back(ng, 0, 0);
    }
    auto t2 = high_resolution_clock::now();
    auto duration = duration_cast<microseconds>( t2 - t1 ).count();
    cout << duration;
#endif
}
piotr@ec2 cpu:0: ~/mxnet [master]> build/tests/mxnet_unit_tests --gtest_filter="NodeTest*"

Note: Google Test filter = NodeTest*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from NodeTestX
[ RUN      ] NodeTestX.NodeTest
1659863[       OK ] NodeTestX.NodeTest (1918 ms)
[----------] 1 test from NodeTestX (1918 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (1919 ms total)
[  PASSED  ] 1 test.

Note: Google Test filter = NodeTest*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from NodeTestX
[ RUN      ] NodeTestX.NodeTest
1717020[       OK ] NodeTestX.NodeTest (1985 ms)
[----------] 1 test from NodeTestX (1985 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (1985 ms total)
[  PASSED  ] 1 test.

Copy link
Member

@szha szha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says it's optimization so I'm expecting some performance difference. I'm not sure what you mean by "correctness" or "ease". Was it not working as expected? If so, is there a test for it?

The "ease of building graph" is probably easier for people to appreciate. It would be great if the change in recommended way of constructing nodes could be documented somewhere.

@larroy
Copy link
Contributor Author

larroy commented May 20, 2019

Moving a shared pointer is faster than copying. When using move in the microbenchmark is 10% faster. Overall is not much and in the big picture won't make a difference, but these small inefficiencies compound together and make the code in general slower in a way that you can't profile as it's scatered all over.

What changes exactly are you suggesting? to the title of the PR? is not clear to me, please help clarify.

It can also cause contention between threads due to the atomic lock.

Note: Google Test filter = NodeTest*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from NodeTestX
[ RUN      ] NodeTestX.NodeTest
1655650[       OK ] NodeTestX.NodeTest (1804 ms)
[----------] 1 test from NodeTestX (1804 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (1804 ms total)
[  PASSED  ] 1 test.

@szha
Copy link
Member

szha commented May 20, 2019

The information in the PR suggests this patch is premature optimization as it's not making any difference in speed. If any of the symptoms can be captured in a test, please do so. Otherwise, if it's a usability improvement, then documenting the suggested use would be helpful.

@larroy
Copy link
Contributor Author

larroy commented May 20, 2019

I just provided a microbenchmark where you can see it's 10% faster. Usually in C++ we try to avoid doing more work than necessary such as copying a shared_ptr when is not necessary. Do you disagree with that? It creates inefficiencies compounded that don't show in a profiler. It also makes it easier and less verbose to add to an nnvm graph.

@larroy
Copy link
Contributor Author

larroy commented May 20, 2019

I don't think is premature optimization, is generally agreed that is better to emplace_back rather than push_back and create additional copies, it also makes less verbose to add nodes. Why you call it a premature optimization? What exact changes you are suggesting?

@szha
Copy link
Member

szha commented May 20, 2019

I'm worried that it would be premature because of the lack of the measurable improvements in the problem space what mxnet concerns.

@larroy
Copy link
Contributor Author

larroy commented May 20, 2019

Would you be willing to merge since it simplifies creating graphs on operators if you don't find the performance improvement justified?

#14992
#14613

@larroy larroy changed the title Optimize move semantics of NodeEntry Simplify creation of NodeEntry instances and use emplace_back May 20, 2019
@larroy
Copy link
Contributor Author

larroy commented May 20, 2019

@szha I changed the title according to your request, are we good to merge?

@szha
Copy link
Member

szha commented May 21, 2019

Thanks for documenting the change. This PR needs one more rebase.

larroy added 2 commits May 21, 2019 14:16
apache/tvm#2576
Making copies of shared_ptr is more expensive than moving.
This PR reduces lock contention by using move semantics in NNVM nodes
making also more convenient to construct NodeEntry classes in the code
due to the added ctors.

Update NDarray with NodeEntry constructors and refine initializer lists.

Sync gradient.cc with tvm
@larroy
Copy link
Contributor Author

larroy commented May 22, 2019

@szha Any more comments, or are we ready to go?

@szha szha merged commit 038b9fb into apache:master May 23, 2019
@larroy larroy deleted the node_ptr branch May 25, 2019 01:58
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
…#14095)

* Optimize move semantics of NodeEntry
apache/tvm#2576
Making copies of shared_ptr is more expensive than moving.
This PR reduces lock contention by using move semantics in NNVM nodes
making also more convenient to construct NodeEntry classes in the code
due to the added ctors.

Update NDarray with NodeEntry constructors and refine initializer lists.

Sync gradient.cc with tvm

* Remove additional calls to NodeEntry in emplace_back

* refine patch

* Fix lint
@larroy larroy mentioned this pull request Jul 18, 2019
7 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Backend Issues related to the backend of MXNet pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants