-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[RFC] Unified API for Distributed Data Parallel Training #16795
Comments
In the Limitation, I suppose you meant 'use case 1,3,4', right? |
I did mean use case 2,3,4. I have not discussed problem 1 for too much details. horovod uses mpirun to setup connection and launch processes, while byteps/p3 and native kvstore currently use the |
Would it make sense to add optional support for sparse ndarrays and gradient compression in |
I do expect the API to change in the future. Currently @szhengac @zhongyuchen and I are exploring APIs for gradient compression with a few algorithms, and we may bring back the best practice back to MXNet. |
…pe of kvstore backend (#17555) * Add Byteps backend for kvstore * Add a temp launcher for byteps backend * make the test fit for byteps kvstore. * final workable test * Remove trashy print and logs * correct comment * add hostfile for ci test * add ci test for byteps kvstore * add visibile devices for byteps-kvstore ci test * add licenses for tools/byteps_launcher.py * syntax error * pylint error (remove unused import like logging) * pylint error * pylint error * enable launching without hostfile (local byteps) * 1. rename byteps_kvstore.py to byteps.py; 2. shorten the launch option to ; 3. add instruction for -H and -SH options for launch; 4. add documentation for byteps kvstore in kvstore/base.py: create(name='local') * edit documentation of KVStoreBase::is_capable(capability); reture fasle for BytePS(KVStoreBase):is_capable(any). * pylint error * remove an error of arg.byteps * use --env option to set workers' environment * error in byteps-launcher.py * remove the unpurposed editing mistake in runtime_functions.sh * disable cpu support for byteps kvstore. * 1. format the document to avoid julia doc build error; 2. little change to nightly test; 3. add byteps copy right declararation in byteps_launcher.py 4. if args.byteps == True ===> if args.byteps * remove the --scheduler_ip and --scheduler_port options in launch.py * 1. maintain the origin value of broadcast and pushpull 2. optimize when out = value or [out]=value 3. add some missing documentation to avoid doc building error. * Add bytePS to CI * add dependency * +integrationtest_ubuntu_gpu_byteps * add byteps pipeline * disable a few tests * remove more tests * fix permission * remove apt-get * fix python path * improve logging * fix printns * add back CI Co-authored-by: Ubuntu <ubuntu@ip-172-31-39-16.ec2.internal> Co-authored-by: Piyush Ghai <ghai.8@osu.edu> Co-authored-by: eric-haibin-lin <linhaibin.eric@gmail.com> Co-authored-by: eric-haibin-lin <--global> Co-authored-by: Lin <haibilin@a483e7be4c92.ant.amazon.com>
…new type of kvstore backend (apache#17555) * Add Byteps backend for kvstore * Add a temp launcher for byteps backend * make the test fit for byteps kvstore. * final workable test * Remove trashy print and logs * correct comment * add hostfile for ci test * add ci test for byteps kvstore * add visibile devices for byteps-kvstore ci test * add licenses for tools/byteps_launcher.py * syntax error * pylint error (remove unused import like logging) * pylint error * pylint error * enable launching without hostfile (local byteps) * 1. rename byteps_kvstore.py to byteps.py; 2. shorten the launch option to ; 3. add instruction for -H and -SH options for launch; 4. add documentation for byteps kvstore in kvstore/base.py: create(name='local') * edit documentation of KVStoreBase::is_capable(capability); reture fasle for BytePS(KVStoreBase):is_capable(any). * pylint error * remove an error of arg.byteps * use --env option to set workers' environment * error in byteps-launcher.py * remove the unpurposed editing mistake in runtime_functions.sh * disable cpu support for byteps kvstore. * 1. format the document to avoid julia doc build error; 2. little change to nightly test; 3. add byteps copy right declararation in byteps_launcher.py 4. if args.byteps == True ===> if args.byteps * remove the --scheduler_ip and --scheduler_port options in launch.py * 1. maintain the origin value of broadcast and pushpull 2. optimize when out = value or [out]=value 3. add some missing documentation to avoid doc building error. * Add bytePS to CI * add dependency * +integrationtest_ubuntu_gpu_byteps * add byteps pipeline * disable a few tests * remove more tests * fix permission * remove apt-get * fix python path * improve logging * fix printns * add back CI Co-authored-by: Ubuntu <ubuntu@ip-172-31-39-16.ec2.internal> Co-authored-by: Piyush Ghai <ghai.8@osu.edu> Co-authored-by: eric-haibin-lin <linhaibin.eric@gmail.com> Co-authored-by: eric-haibin-lin <--global> Co-authored-by: Lin <haibilin@a483e7be4c92.ant.amazon.com>
…xnet as new type of kvstore backend (apache#17555)" This reverts commit c244f9f.
…new type of kvstore backend (apache#17555) * Add Byteps backend for kvstore * Add a temp launcher for byteps backend * make the test fit for byteps kvstore. * final workable test * Remove trashy print and logs * correct comment * add hostfile for ci test * add ci test for byteps kvstore * add visibile devices for byteps-kvstore ci test * add licenses for tools/byteps_launcher.py * syntax error * pylint error (remove unused import like logging) * pylint error * pylint error * enable launching without hostfile (local byteps) * 1. rename byteps_kvstore.py to byteps.py; 2. shorten the launch option to ; 3. add instruction for -H and -SH options for launch; 4. add documentation for byteps kvstore in kvstore/base.py: create(name='local') * edit documentation of KVStoreBase::is_capable(capability); reture fasle for BytePS(KVStoreBase):is_capable(any). * pylint error * remove an error of arg.byteps * use --env option to set workers' environment * error in byteps-launcher.py * remove the unpurposed editing mistake in runtime_functions.sh * disable cpu support for byteps kvstore. * 1. format the document to avoid julia doc build error; 2. little change to nightly test; 3. add byteps copy right declararation in byteps_launcher.py 4. if args.byteps == True ===> if args.byteps * remove the --scheduler_ip and --scheduler_port options in launch.py * 1. maintain the origin value of broadcast and pushpull 2. optimize when out = value or [out]=value 3. add some missing documentation to avoid doc building error. * Add bytePS to CI * add dependency * +integrationtest_ubuntu_gpu_byteps * add byteps pipeline * disable a few tests * remove more tests * fix permission * remove apt-get * fix python path * improve logging * fix printns * add back CI Co-authored-by: Ubuntu <ubuntu@ip-172-31-39-16.ec2.internal> Co-authored-by: Piyush Ghai <ghai.8@osu.edu> Co-authored-by: eric-haibin-lin <linhaibin.eric@gmail.com> Co-authored-by: eric-haibin-lin <--global> Co-authored-by: Lin <haibilin@a483e7be4c92.ant.amazon.com>
* [MXNET-#16795] Byteps-KVStore: Intergrate Byteps into mxnet as new type of kvstore backend (#17555) * Add Byteps backend for kvstore * Add a temp launcher for byteps backend * make the test fit for byteps kvstore. * final workable test * Remove trashy print and logs * correct comment * add hostfile for ci test * add ci test for byteps kvstore * add visibile devices for byteps-kvstore ci test * add licenses for tools/byteps_launcher.py * syntax error * pylint error (remove unused import like logging) * pylint error * pylint error * enable launching without hostfile (local byteps) * 1. rename byteps_kvstore.py to byteps.py; 2. shorten the launch option to ; 3. add instruction for -H and -SH options for launch; 4. add documentation for byteps kvstore in kvstore/base.py: create(name='local') * edit documentation of KVStoreBase::is_capable(capability); reture fasle for BytePS(KVStoreBase):is_capable(any). * pylint error * remove an error of arg.byteps * use --env option to set workers' environment * error in byteps-launcher.py * remove the unpurposed editing mistake in runtime_functions.sh * disable cpu support for byteps kvstore. * 1. format the document to avoid julia doc build error; 2. little change to nightly test; 3. add byteps copy right declararation in byteps_launcher.py 4. if args.byteps == True ===> if args.byteps * remove the --scheduler_ip and --scheduler_port options in launch.py * 1. maintain the origin value of broadcast and pushpull 2. optimize when out = value or [out]=value 3. add some missing documentation to avoid doc building error. * Add bytePS to CI * add dependency * +integrationtest_ubuntu_gpu_byteps * add byteps pipeline * disable a few tests * remove more tests * fix permission * remove apt-get * fix python path * improve logging * fix printns * add back CI Co-authored-by: Ubuntu <ubuntu@ip-172-31-39-16.ec2.internal> Co-authored-by: Piyush Ghai <ghai.8@osu.edu> Co-authored-by: eric-haibin-lin <linhaibin.eric@gmail.com> Co-authored-by: eric-haibin-lin <--global> Co-authored-by: Lin <haibilin@a483e7be4c92.ant.amazon.com> * fix byteps logging and declare tensor * check exceptions and return -1 * print logging in CI * Update byteps.py * Update runtime_functions.sh * add numa dependency * pin dependency * Update runtime_functions.sh * Update Dockerfile.build.ubuntu * Update runtime_functions.sh * Update runtime_functions.sh * Update runtime_functions.sh * Update runtime_functions.sh * Update Jenkins_steps.groovy * remove launcher. use bpslauncher instead. Co-authored-by: Chaokun Chang <33217209+ChaokunChang@users.noreply.github.com> Co-authored-by: Ubuntu <ubuntu@ip-172-31-39-16.ec2.internal> Co-authored-by: Piyush Ghai <ghai.8@osu.edu> Co-authored-by: Lin <haibilin@a483e7be4c92.ant.amazon.com> Co-authored-by: Ubuntu <ubuntu@ip-172-31-37-108.ec2.internal> Co-authored-by: EC2 Default User <ec2-user@ip-172-31-81-80.ec2.internal> Co-authored-by: Ubuntu <ubuntu@ip-172-31-57-164.ec2.internal>
* [MXNET-apache#16795] Byteps-KVStore: Intergrate Byteps into mxnet as new type of kvstore backend (apache#17555) * Add Byteps backend for kvstore * Add a temp launcher for byteps backend * make the test fit for byteps kvstore. * final workable test * Remove trashy print and logs * correct comment * add hostfile for ci test * add ci test for byteps kvstore * add visibile devices for byteps-kvstore ci test * add licenses for tools/byteps_launcher.py * syntax error * pylint error (remove unused import like logging) * pylint error * pylint error * enable launching without hostfile (local byteps) * 1. rename byteps_kvstore.py to byteps.py; 2. shorten the launch option to ; 3. add instruction for -H and -SH options for launch; 4. add documentation for byteps kvstore in kvstore/base.py: create(name='local') * edit documentation of KVStoreBase::is_capable(capability); reture fasle for BytePS(KVStoreBase):is_capable(any). * pylint error * remove an error of arg.byteps * use --env option to set workers' environment * error in byteps-launcher.py * remove the unpurposed editing mistake in runtime_functions.sh * disable cpu support for byteps kvstore. * 1. format the document to avoid julia doc build error; 2. little change to nightly test; 3. add byteps copy right declararation in byteps_launcher.py 4. if args.byteps == True ===> if args.byteps * remove the --scheduler_ip and --scheduler_port options in launch.py * 1. maintain the origin value of broadcast and pushpull 2. optimize when out = value or [out]=value 3. add some missing documentation to avoid doc building error. * Add bytePS to CI * add dependency * +integrationtest_ubuntu_gpu_byteps * add byteps pipeline * disable a few tests * remove more tests * fix permission * remove apt-get * fix python path * improve logging * fix printns * add back CI Co-authored-by: Ubuntu <ubuntu@ip-172-31-39-16.ec2.internal> Co-authored-by: Piyush Ghai <ghai.8@osu.edu> Co-authored-by: eric-haibin-lin <linhaibin.eric@gmail.com> Co-authored-by: eric-haibin-lin <--global> Co-authored-by: Lin <haibilin@a483e7be4c92.ant.amazon.com> * fix byteps logging and declare tensor * check exceptions and return -1 * print logging in CI * Update byteps.py * Update runtime_functions.sh * add numa dependency * pin dependency * Update runtime_functions.sh * Update Dockerfile.build.ubuntu * Update runtime_functions.sh * Update runtime_functions.sh * Update runtime_functions.sh * Update runtime_functions.sh * Update Jenkins_steps.groovy * remove launcher. use bpslauncher instead. Co-authored-by: Chaokun Chang <33217209+ChaokunChang@users.noreply.github.com> Co-authored-by: Ubuntu <ubuntu@ip-172-31-39-16.ec2.internal> Co-authored-by: Piyush Ghai <ghai.8@osu.edu> Co-authored-by: Lin <haibilin@a483e7be4c92.ant.amazon.com> Co-authored-by: Ubuntu <ubuntu@ip-172-31-37-108.ec2.internal> Co-authored-by: EC2 Default User <ec2-user@ip-172-31-81-80.ec2.internal> Co-authored-by: Ubuntu <ubuntu@ip-172-31-57-164.ec2.internal>
* [MXNET-apache#16795] Byteps-KVStore: Intergrate Byteps into mxnet as new type of kvstore backend (apache#17555) * Add Byteps backend for kvstore * Add a temp launcher for byteps backend * make the test fit for byteps kvstore. * final workable test * Remove trashy print and logs * correct comment * add hostfile for ci test * add ci test for byteps kvstore * add visibile devices for byteps-kvstore ci test * add licenses for tools/byteps_launcher.py * syntax error * pylint error (remove unused import like logging) * pylint error * pylint error * enable launching without hostfile (local byteps) * 1. rename byteps_kvstore.py to byteps.py; 2. shorten the launch option to ; 3. add instruction for -H and -SH options for launch; 4. add documentation for byteps kvstore in kvstore/base.py: create(name='local') * edit documentation of KVStoreBase::is_capable(capability); reture fasle for BytePS(KVStoreBase):is_capable(any). * pylint error * remove an error of arg.byteps * use --env option to set workers' environment * error in byteps-launcher.py * remove the unpurposed editing mistake in runtime_functions.sh * disable cpu support for byteps kvstore. * 1. format the document to avoid julia doc build error; 2. little change to nightly test; 3. add byteps copy right declararation in byteps_launcher.py 4. if args.byteps == True ===> if args.byteps * remove the --scheduler_ip and --scheduler_port options in launch.py * 1. maintain the origin value of broadcast and pushpull 2. optimize when out = value or [out]=value 3. add some missing documentation to avoid doc building error. * Add bytePS to CI * add dependency * +integrationtest_ubuntu_gpu_byteps * add byteps pipeline * disable a few tests * remove more tests * fix permission * remove apt-get * fix python path * improve logging * fix printns * add back CI Co-authored-by: Ubuntu <ubuntu@ip-172-31-39-16.ec2.internal> Co-authored-by: Piyush Ghai <ghai.8@osu.edu> Co-authored-by: eric-haibin-lin <linhaibin.eric@gmail.com> Co-authored-by: eric-haibin-lin <--global> Co-authored-by: Lin <haibilin@a483e7be4c92.ant.amazon.com> * fix byteps logging and declare tensor * check exceptions and return -1 * print logging in CI * Update byteps.py * Update runtime_functions.sh * add numa dependency * pin dependency * Update runtime_functions.sh * Update Dockerfile.build.ubuntu * Update runtime_functions.sh * Update runtime_functions.sh * Update runtime_functions.sh * Update runtime_functions.sh * Update Jenkins_steps.groovy * remove launcher. use bpslauncher instead. Co-authored-by: Chaokun Chang <33217209+ChaokunChang@users.noreply.github.com> Co-authored-by: Ubuntu <ubuntu@ip-172-31-39-16.ec2.internal> Co-authored-by: Piyush Ghai <ghai.8@osu.edu> Co-authored-by: Lin <haibilin@a483e7be4c92.ant.amazon.com> Co-authored-by: Ubuntu <ubuntu@ip-172-31-37-108.ec2.internal> Co-authored-by: EC2 Default User <ec2-user@ip-172-31-81-80.ec2.internal> Co-authored-by: Ubuntu <ubuntu@ip-172-31-57-164.ec2.internal>
commit a77f774ed179786fc8429d913a2da1d942528de9 Author: Leonard Lausen <lausen@amazon.com> Date: Fri Jul 17 05:01:17 2020 +0000 Remove NNPACK integration (#18722) commit 3ef00b8840c05c49118705f6fd9663ebb951f3a1 Author: Andrei Ivanov <andrey.ivanov@gmail.com> Date: Thu Jul 16 16:57:58 2020 -0700 Refactoring of Pooled Storage Manager classes (#18582) * Refactoring of Pooled Storage Manager classes * Adding test for new functionality * Fixing compilation problems which appear for MXNET_USE_CUDA=0 * Fixing compilation problems for WINDOWS and ANDROID * Fixing compilation problems which appear for WINDOWS and __APPLE__ * Fixing lint problems * test_dataloader_context(): Bypassing custom_dev_id pinned mem test on system with GPUs < 2. * Fixing compilation for Android. Elimination of unused includes. * Fixing problems with CPUPinned Storage Manager which appears when MXNET_USE_CUDA = 0 * Removing test_bucketing.py * Imroving CPU_Pinned Pooled Storage Manager case. * Fixing lint problem * The GPU profiling commands calls moved into mutex area * Fixing lint problem * Improved reporting regarding the Storage Manager used. * Fixing lint problem * Trigger CI * Removing some comments, as suggested by @szha * Trigger CI * Trigger CI Co-authored-by: andreii <andreii@nvidia.com> commit 2abf0b8c2b3361c73c9dfdeabdb8a88278b693d0 Author: Leonard Lausen <lausen@amazon.com> Date: Thu Jul 16 17:41:22 2020 +0000 Initialize docker cache in build.py for docker-compose containers (#18724) commit 37bdf0bf981d11a89bd248b02f473211d57bc9c6 Author: JackieWu <wkcn@live.cn> Date: Fri Jul 17 01:25:01 2020 +0800 [MXNET-1453] Support the intput whose dimension is greater than 6 for Transpose and Rollaxis (#18707) * support 6+ dims for transpose * test over * reorder code * fix transposeex commit 8198442f0c7bde0fc47f507c3f81a0b5cf0a5235 Author: AntiZpvoh <59728467+AntiZpvoh@users.noreply.github.com> Date: Thu Jul 16 15:01:59 2020 +0800 [numpy] symbolic advanced indexing (#18319) * add ndarray and boolean indexing for numpy symbol * fix sanity and unit test * ensure consistency between the imperative and symbolic interface * Update python/mxnet/numpy/multiarray.py and add new test Co-authored-by: Leonard Lausen <leonard@lausen.nl> * Don't rely on indexing_key_expand_implicit_axes for deciding if _npi.advanced_indexing_multiple is applicable * fix sanity Co-authored-by: Leonard Lausen <lausen@amazon.com> commit 690132516a0a99337625248772fd44930686a82b Author: 蔡舒起 <867907127@qq.com> Date: Thu Jul 16 10:12:20 2020 +0800 Add the newest mxnet discuss version. Add d2l.ai (#18663) * Add the newest mxnet discuss version. Add d2l.ai * delete [] and insert old version commit e2366e9102e6862416bf998af52baaa5e9c0a31b Author: Leonard Lausen <lausen@amazon.com> Date: Wed Jul 15 22:01:36 2020 +0000 Refactor scope functionality in Python API (#18619) * Refactor scope functionality in Python API - Remove deprecated metaclass functionality - Remove global state in naming - Switch from threading.local to asyncio compatible contextvars - Stop exposing UUIDs in parameter name * Fix dependencies * Fixes * Fixes * Fix * Fix after merge master commit 12ec04611c78a603c03707488d66bdbbedf0d536 Author: Chaitanya Prakash Bapat <chai.bapat@gmail.com> Date: Wed Jul 15 13:59:34 2020 -0700 Migrate from private to public jetson toolchain files (#18677) commit 0dc30a2c170fd0aa369d325a1feae6aad75a52c2 Author: Leonard Lausen <lausen@amazon.com> Date: Wed Jul 15 01:02:36 2020 +0000 Enable GPU Memory profiler tests (#18701) * Enable GPU Memory profiler tests Previously tests are not run as test_profiler.py was not taken into account on GPU CI runs and some tests were marked for being skipped if run on a CPU-only machine. * Disable broken tests commit d512814c2981f9bfb23937064634982ca97d0338 Author: Leonard Lausen <lausen@amazon.com> Date: Wed Jul 15 00:57:38 2020 +0000 Disable test coverage in MKL builds (#18443) * Disable test coverage in MKL builds * Enable test parallelization * Set OMP_NUM_THREADS * Fix * Fix unpack_and_init commit d8430b6b412e637d07b291dbee1350df7168234d Author: Leonard Lausen <lausen@amazon.com> Date: Wed Jul 15 00:53:49 2020 +0000 Set CMAKE_CUDA_COMPILER in aarch64-linux-gnu-toolchain.cmake (#18713) CMAKE_CUDA_HOST_COMPILER will be reset if CMAKE_CUDA_COMPILER is not set as of cmake 3.17.3 See https://gitlab.kitware.com/cmake/cmake/-/issues/20826 commit f125f5fd9ff91e9a70e5add3735c32d4e3bf9cd0 Author: Yang Shi <yangshia@amazon.com> Date: Tue Jul 14 14:29:14 2020 -0700 Fix all anchor shifts on website (#18674) commit 7c9c4fc3d3ef66310537c0bc6810a90af551a63e Author: Yang Shi <yangshia@amazon.com> Date: Tue Jul 14 14:28:17 2020 -0700 Merge content from numpy.mxnet.io into mxnet official website (#18691) commit 7f7e1c5a714262e8cd1015716258416e6ce1ff3e Author: Serge Panev <spanev@nvidia.com> Date: Tue Jul 14 14:12:00 2020 -0700 Add better partial args/aux handling in symbol optimize_for (#18350) * Add missing args/aux support in optimize_for and deferred inference option Signed-off-by: Serge Panev <spanev@nvidia.com> * Add input shape_dict, type_dict and stype_dict to optimize_for Signed-off-by: Serge Panev <spanev@nvidia.com> * Remove warnings for Werror Signed-off-by: Serge Panev <spanev@nvidia.com> * Address PR comments Signed-off-by: Serge Panev <spanev@nvidia.com> commit 9d623926d4857a2cfa32515b58cd1398371f97f3 Author: Yang Shi <yangshia@amazon.com> Date: Mon Jul 13 15:54:51 2020 -0700 Fix python micro-site table of content bugs (#18664) * update footer style * add compiled css of footer styles changes * add same style for footer2 * more fix to the toc commit 8ebb5372c3ad414cde096fb82de8be14cb748b11 Author: Sheng Zha <szha@users.noreply.github.com> Date: Mon Jul 13 13:17:12 2020 -0700 add 'needs triage' label to new bug reports (#18696) commit 9c5b95a9c5d6f83a067504fb47fac4e3aed27e81 Author: Serge Panev <spanev@nvidia.com> Date: Mon Jul 13 11:45:29 2020 -0700 Partition API adding and deleting new params to Block and Symbol (#18405) * Add deleting of args aux aux to Partition API Signed-off-by: Serge Panev <spanev@nvidia.com> * Delete args from Block.params Signed-off-by: Serge Panev <spanev@nvidia.com> * Fix to use arg/auxdict when optimize_for is called in HybridBlock Signed-off-by: Serge Panev <spanev@nvidia.com> * Address PR comments Signed-off-by: Serge Panev <spanev@nvidia.com> commit 19e373daac76b466cf11b5d31fa5d5e2eb518a21 Author: Leonard Lausen <lausen@amazon.com> Date: Sat Jul 11 09:09:51 2020 -0700 Fix scipy dependency in probability module (#18689) * Fix scipy dependency in probability module * Fix copy-paste error * dtype='float32' for digamma and gammaln commit a9b16f7024878611b236c9f3734ccd37a5a35d38 Author: JackieWu <wkcn@live.cn> Date: Sat Jul 11 02:59:21 2020 +0800 change bn test (#18688) commit beafba76395e75c093f99d20ac62e38f48e91012 Author: JackieWu <wkcn@live.cn> Date: Thu Jul 9 08:01:35 2020 +0800 [Improvement] Invoke mkldnn and cudnn BatchNorm when axis != 1 (#18504) * fix batch norm when fix_gamma is True * support gradient accumulation for batch norm * mkldnn batchnorm support grad add * unittest for bn * fix bn arg * fix lint * fix mkldnn * fix mkldnn bn * fix grad when fixing gamma * fix naive gpu bn * fix lint * invoke mkldnn and cudnn batchnorm when axis != 1 * backport 18500 * change condition * fix * fix * add mkldnn_off for bn * remove mkldnn_off * recover save_000800.json * cast commit 348ab4d8d77359bf60d97a0befbd9086fd52ee49 Author: Yang Shi <yangshia@amazon.com> Date: Tue Jul 7 15:06:34 2020 -0700 fix broken installation widget - remove empty entries (#18661) commit b4b8b805fe94a6df905c6eae7f6c1f83cfea9b73 Author: Xi Wang <xidulu@gmail.com> Date: Wed Jul 8 01:22:05 2020 +0800 Gluon.probability (#18403) * package created * mvn WIP * normal wip, to be tested * update * docstring added, normal mostly done * add test file * Bernoulli WIP * bernoulli wip * bernoulli doc done * dense variational WIP * add kl infra * implement normal kl method * refactor kl * add not implemented handling, rename kl_storage * add abstract method and Categorical class * rewrite logit2prob prob2logit for multiclass support * normal broadcast_to implemented * categorical mostly done * update distributions/utils.py * add dot ahead of import * fix normal F * bernoulli, normal brief tests implemented * add hybridize tests * transformation infras done * affine transformation, implemented tested * add tests cases * add sum_right_most * fix get F bug * compose transform implemented, tested * fix * add event_dim * fetch mvn from upstremm * clean code, implement normal cdf and tests * constraint in bernoulli done * fix constraint * finish half normal * add cached_property * add test on cached_property * add more features to distribution and constratins * change constraint * fix bernoulli * add independent * add independent tests * update naming of cached_property * revert * add constraints * add Cat * add Stack for imperative mode * add Stack for imperative mode * add bernoulli entropy * categorical WIP * categorical sampling implemented * finish categorical log_prob, sampling * enumerate_support finished * polish StochasticBlock, add test * add test for stochastic sequential * clean loss list in __call__ * fix affine, implement sigmoid, softmax * add gumbel, relaxed bernoulli * relaxed one-hot sampling implemented * gamma done * gamma, dirichlet implemented * beta done * gumbel softmax log-likelihood implemented * refactor tests, implement exponential, fix compose transform * weibull implemented, transformed distribution cdf icdf added * pareto implemented * uniform wip * uniform done * rewrite lgamma, implement chi2 * fix chi2 scale * F distributiion done * t implemented * fix tiny problem * cauchy done * add half cauchy * multinomial done, tests to be added * add multinomial test * MVN done, tests todo * mvn polished * fix a few precison issues * add erf, erfinv unified api and learnable transform * fix mvn attribute check * MVN done * poisson done * hack poisson for size support * geometric finished * negative binomial done * binomial done * implement some kl * add more kl * refactor kl test * add more kl * binomial kl todo * change constraint logical op implement * implement gamma entropy * finish beta dirchlet entropy * finishi all entropy * kl finished * add constraint test * domain map done * remove bayesian dense * fix tiny problems * add kl uniform normal * add kl tests * acquire patch from upstream * add some doc * finish doc * refactor kl test(WIP) * add more kl, fix float32 underflow issue * make sampling more stable * handle inconsistent mode * replace boolean idx with np.where * fix file name * add more doc * add constraint check * add half_normal/cauchy pdf cdf support check * fix import problem * change nosetest to pytest * remove buggy lines * change alias register path * attempt to fix ci * fix lint, change a few tests * fix lint * modify hybrid sequential * fix lint * change import order * add test gluon probability v2 * fix hybridize flag * change implementation of stochastic block * fix lint * fix comments * fix block * modify domain map * add raises for improper add_loss * add raises for improper add_loss * add extra cases * change collectLoss decorator to mandatory * skip stochastic block tests * remove test cases * put gpu tests back * add test_gluon_stochastic_block back * remove export test * put a test back * tiny refactor * add memory leak flag * small changes Co-authored-by: Zheng <shzheng@a483e789dd93.ant.amazon.com> commit 54c0155b7581f5e10b1469a17ddf127d3c75e156 Author: Yang Shi <yangshia@amazon.com> Date: Mon Jul 6 17:01:42 2020 -0700 User Feedback Widget (#18639) * user feedback widget implementation * add user feedback widget to python docs site * update margin * add apache license * one more license * turn off feedback widget on python site * update copy * format * add event value field * turn on widget on Python site commit 646288716cbba482d4ede0fb4f6141b2ea505090 Author: Yiyan66 <57363390+Yiyan66@users.noreply.github.com> Date: Sat Jul 4 09:13:41 2020 +0800 [numpy] Fix less/greater bug with scalar input (#18642) * fix ffi * fix less/greater error * back * submodule * fixed Co-authored-by: Ubuntu <ubuntu@ip-172-31-8-94.us-east-2.compute.internal> commit d1b0a09669d1fa17b12a9acee887672d1e621523 Author: Yiyan66 <57363390+Yiyan66@users.noreply.github.com> Date: Fri Jul 3 15:10:55 2020 +0800 [numpy] FFI flip, rollaxis, stack (#18614) * flip * rollaxis * stack * fixed * retrigger ci Co-authored-by: Ubuntu <ubuntu@ip-172-31-18-97.us-east-2.compute.internal> commit c519e0e2db54fb8ad74e0e44d586720bf4023490 Author: Leonard Lausen <lausen@amazon.com> Date: Thu Jul 2 18:21:08 2020 -0700 Mark test_get_symbol as garbage_expected (#18595) commit d1b2cd9d8ada39ab4f16caff4ac43337476f2efc Author: Leonard Lausen <lausen@amazon.com> Date: Thu Jul 2 18:20:48 2020 -0700 build.py --no-pull (#18589) Add --no-pull option which disables overwriting the local docker cache based on CI docker cache. It is useful when locally changing Dockerfiles. commit 0c8b6b2405e8313db3cf1a6f374a945d3c871b26 Author: Yang Shi <yangshia@amazon.com> Date: Thu Jul 2 13:15:54 2020 -0700 Clipboard refactor (#18605) * refactor clipboard * make lang getter more extensible * trigger ci commit a8c8dea67593df7f1d2061893dddfdeee4750d9f Author: Tao Lv <tao.a.lv@intel.com> Date: Wed Jul 1 22:53:54 2020 +0800 update to onednn v1.4 (#18273) commit 9a122cac5e1317ccca2dea6884253ce32ac3671a Author: bgawrych <bartlomiej.gawrych@intel.com> Date: Wed Jul 1 16:43:06 2020 +0200 Fix softmax, logsoftmax failed on empty ndarray (#18602) * Fix failing empty array (log_)softmax * Modify test for npx (log_)softmax commit 37bed6e3af794624d651e888101eceb30c27c001 Author: Andrzej Kotłowski <Andrzej.Kotlowski@intel.com> Date: Wed Jul 1 16:39:22 2020 +0200 Fix BatchNorm backward synchronization (#18644) * Add test for BatchNorm running variables synchronization * Fix BatchNorm backward synchronization It fixes issue #18610 commit 21581060d2f967cc2faeb5a76979cdffbf578657 Author: XIAO-XIA <47599701+XIAO-XIA@users.noreply.github.com> Date: Tue Jun 30 14:16:20 2020 +0800 [Numpy] FFI: tril_indices (#18546) * add numpy tril_indices ffi * Update src/api/operator/numpy/np_matrix_op.cc Co-authored-by: Haozheng Fan <hzfan9@outlook.com> Co-authored-by: Haozheng Fan <hzfan9@outlook.com> commit 638622f37dcc4ef4b36dcabfd3d7a695fdb7d4c9 Author: Rohit Kumar Srivastava <srivastava.141@osu.edu> Date: Mon Jun 29 14:36:42 2020 -0700 Improve performance of broadcast_axis on CPU (#17882) * adding comments explaining code optimizations * fixing broadcast_axis kernel to int32 * fixing slice_axis kernel to int32 * combining CPU and GPU implementation method signatures and cleaned up code * adding new broadcast_axis to np_matmul Co-authored-by: Rohit Kumar Srivastava <srivastava.141@buckeyemail.osu.edu> commit becb9ca694f51fdc0583d58429ccc943e6462810 Author: Sheng Zha <szha@users.noreply.github.com> Date: Mon Jun 29 12:16:16 2020 -0700 Remove mention of nightly in pypi (#18635) commit b12abbfb356be93f8c24d427c72448f91d1980ec Author: ciyong <ciyong.chen@intel.com> Date: Mon Jun 29 11:14:34 2020 +0800 Enhance license checker to cover multiple license header and md files (#18633) commit d6c35785a870ac6e0b42903d7e27de2c9a6efdbe Author: Shuai Zheng <szhengac@users.noreply.github.com> Date: Sat Jun 27 13:25:03 2020 -0700 Add LANS optimizer (#18620) * add lans optimizer * fix * fix Co-authored-by: Zheng <shzheng@a483e789dd93.ant.amazon.com> commit 8ee460077b8e8f2d7a1dd96efca1751fc337cb63 Author: Yang Shi <yangshia@amazon.com> Date: Fri Jun 26 11:22:15 2020 -0700 fix contrib interleaved_matmul_selfatt_valatt not render correctly (#18621) commit ecbda07c7bf8ce671744f0e9d361a1e8b5b744da Author: Yang Shi <yangshia@amazon.com> Date: Thu Jun 25 11:11:00 2020 -0700 fix julia api redirect (#18613) commit c9dcdd11853e8600879615c8d8be0aa5cdf851cf Author: Yang Shi <yangshia@amazon.com> Date: Thu Jun 25 11:02:09 2020 -0700 add version check on installation guide (#18587) commit e4c93e3e3a68559cb38e4ff92c9e0bf9c9cdd0bf Author: Shuai Zheng <szhengac@users.noreply.github.com> Date: Wed Jun 24 22:03:39 2020 -0700 add epsilon to adamax (#18532) Co-authored-by: Ubuntu <ubuntu@ip-172-31-92-136.ec2.internal> commit 3f555f850f4eef897bbafcb61df726491954ffbb Author: Leonard Lausen <lausen@amazon.com> Date: Wed Jun 24 19:41:34 2020 -0700 Update disclaimer wording (#18616) commit 1fcc7ea8b8f5dfebd3f5440ffe9e0c7d4b13b90f Author: RuRo <andrey.stotskiy@tevian.ru> Date: Wed Jun 24 12:03:20 2020 +0300 use new mxnet.gluon.block APIs (#18601) commit acf2d27efe583ceb0f6b5253f0ac78ad6bf00e8e Author: acphile <phile_999@126.com> Date: Wed Jun 24 10:25:44 2020 +0800 Update tutorials (#18609) Update docs according to new Block APIs (#18413) commit 4b86c32832a994e76b97dfc58c8a672db87e721d Author: mk-61 <56651474+mk-61@users.noreply.github.com> Date: Tue Jun 23 13:49:06 2020 -0700 Allow input reordering duing Gluon / CachedOp graph transformations (#17949) * Initial commit of input reordering in Gluon * Add test for Gluon input reorder * Fix backward in CachedOp for input reordering * Fix test_input_reorder for backward pass * Fix merge error in NaiveCachedOp * Include correct header for std::iota Co-authored-by: Vladimir Cherepanov <vcherepanov@nvidia.com> commit 74fcb9938a14ec80f0c690b5a58a700537a621c5 Author: Yang Shi <yangshia@amazon.com> Date: Mon Jun 22 18:54:05 2020 -0700 redirect api reference on v-master to v1.6 (#18607) * redirect api reference on v-master to v1.6 * update R docs commit 56cfd9c272e81988682db6fde1b9205becc6a235 Author: Ram Rachum <ram@rachum.com> Date: Mon Jun 22 21:23:04 2020 +0300 Use chain.from_iterable in artifact_repository.py (#18578) commit 2fbec60e0da8832d71f7e3f93d4407dbca745e51 Author: Haibin Lin <linhaibin.eric@gmail.com> Date: Sun Jun 21 23:02:13 2020 -0700 graph executor c api removal (#18598) * add default ctx to cachedop fwd * add test * perl fix * initial commit * update sparse tests * add aux_states * fix aux-state type * fix some tests * fix check symbolic forwrad/backward * fix symbolic grad check * arg_dict fixes * support init ops * support forward only graph * fix check symbolic backward stype * add missing file * replace extension test bind * replace bind with _bind * simplify backward_mul implementation * small fix * drop contrib.sparseembedding * remove simple_bind in test sparse ops * use simple_bind * replave simple bind in quantization * fix aux index * update amp simple_bind calls * drop ifft * fix a bug found in subgraph op * add aux_array method * replace symbols * minor fix * fix executor default context * fix import * bug fix for nd.where * add subgraph test * fix forward grad req * fix batch dot dtype * remove unused code * fix slice dtype * fix attach grad * remove tests for non-existing sparse ops * MXCachedOpGetOptimizedSymbol * fix foreach test * enhance err msg * skip failed test * add docs * add docs * fix lint * fix lint, remove quantization * fix lint * fix lint * fix lint * fix build and import * fix import * remove scala, R, julia, perl bindings * remove cpp, matlab bindings * fix perl call * fix test * remove perl binding * remove reshape test * fix profiler, trt * remove tensorrt test * remove quantization tests * fix import * fix conflcit * fix lint * skip buggy test * remove clojure * remove executor c api * remove amalgamation * fix build * move executor folder * fix import * fix lint * fix cpp pcakge * fix predict cpp * fix cpp make * remove jnilint * remove cpp package tset * remove julia test pipeline * disable numpy tests * disable compat test for delete Co-authored-by: EC2 Default User <ec2-user@ip-172-31-81-80.ec2.internal> Co-authored-by: Lin <haibilin@a483e7be4c92.ant.amazon.com> commit c1098aa33d6795f84a19601d0319d5bb8e19f317 Author: Haibin Lin <linhaibin.eric@gmail.com> Date: Sat Jun 20 14:49:58 2020 -0700 Switch to cached op in the testing suite (#18579) * add default ctx to cachedop fwd * add test * perl fix * initial commit * update sparse tests * add aux_states * fix aux-state type * fix some tests * fix check symbolic forwrad/backward * fix symbolic grad check * arg_dict fixes * support init ops * support forward only graph * fix check symbolic backward stype * add missing file * replace extension test bind * replace bind with _bind * simplify backward_mul implementation * small fix * drop contrib.sparseembedding * remove simple_bind in test sparse ops * use simple_bind * replave simple bind in quantization * fix aux index * update amp simple_bind calls * drop ifft * fix a bug found in subgraph op * add aux_array method * replace symbols * minor fix * fix executor default context * fix import * bug fix for nd.where * add subgraph test * fix forward grad req * fix batch dot dtype * remove unused code * fix slice dtype * fix attach grad * remove tests for non-existing sparse ops * MXCachedOpGetOptimizedSymbol * fix foreach test * enhance err msg * skip failed test * add docs * add docs * fix lint * fix lint, remove quantization * fix lint * fix lint * fix lint * fix build and import * fix import * fix perl call * fix test * remove perl binding * remove reshape test * fix profiler, trt * remove tensorrt test * remove quantization tests * fix import * fix conflcit * fix lint * skip buggy test Co-authored-by: EC2 Default User <ec2-user@ip-172-31-81-80.ec2.internal> Co-authored-by: Lin <haibilin@a483e7be4c92.ant.amazon.com> commit c1b96f562f55dfa024ac941d7b104f00e239ee0f Author: Leonard Lausen <lausen@amazon.com> Date: Fri Jun 19 14:46:27 2020 -0700 cmake: x86 options only on x86 and remove manual specification on CI (#18588) Use CMAKE_SYSTEM_PROCESSOR to detect target architecture and make x86 related options available only when compiling for x86. Remove the code turning these options manually off on CI. Remove ANDROID cmake option which was used to decide if -lpthread needs to be specified explicitly (on most Linux systems) or not (on Android). Instead auto-detect the behavior. commit 041bd3016375c6bdadddc9e9f43655923ee739bf Author: RuRo <andrey.stotskiy@tevian.ru> Date: Fri Jun 19 21:56:05 2020 +0300 [MXNET-889] Implement ONNX export for gluon LSTM. (#17734) * implement onnx translations for _full type nodes * implement onnx translations for _rnn_param_concat * implement onnx translations for RNN (LSTM mode) * implement node export unittest for gluon.LSTM commit bf0753702b37cc932baf417be2af2e7abe034bab Author: Manu Seth <22492939+mseth10@users.noreply.github.com> Date: Fri Jun 19 10:20:55 2020 -0700 Link GluonCV object detection tutorial for Jetson (#18530) * add object detection tutorial for Jetson * adding GluonCV in title * cross reference gluoncv turorial commit cb54a4a99463b23b8abaa2629661954c4ba3c60b Author: acphile <phile_999@126.com> Date: Fri Jun 19 14:31:08 2020 +0800 Simplify mxnet.gluon Block APIs (#18413) ## Motivations Currently the implementation of mxnet.gluon.block is not so pythonic and there are many redundancies ### 1. overlaps between Block._params and Block._reg_params when we want to self-define a model, we currently need to use the code as follows: ``` class Net(nn.HybridBlock): def __init__(self, **kwargs): super(HybridNet, self).__init__(**kwargs) with self.name_scope(): self.hidden1 = nn.Dense(256, activation='relu') self.a=self.params.get('a', shape=(1, )) ``` There are several shortcomings when using this form of registration: a. adding parameter ‘a’ will lead to double recordings in both self._params and self._reg_params, which is a redundancy. And there is also a discrepancy in Block: i. In the method “collect_params”, we use “_params” to get all parameters ii. while in the method “_collect_params_with_prefix” (and methods “load_parameters” accordingly), we use “_reg_params” to get all parameters. b. Currently if we do not use “with self.name_scope():” for children blocks, it will lead to wrong name scopes. For the following example, we actually can not get the parameters of self.hidden1 from the result of collect_params ``` class HybridNet(nn.HybridBlock): def __init__(self, **kwargs): super(HybridNet, self).__init__(**kwargs) self.hidden1 = nn.Dense(256, activation='relu') with self.name_scope(): self.hidden2 = nn.Dense(10, activation='relu') def hybrid_forward(self, F, x): x = self.hidden2(self.hidden1(x)) return x >>> net = HybridNet() >>> net.initialize() >>> print(net.collect_params()) hybridnet0_ ( Parameter dense0_weight (shape=(256, -1), dtype=float32) Parameter dense0_bias (shape=(256,), dtype=float32) Parameter hybridnet0_dense0_weight (shape=(10, -1), dtype=float32) Parameter hybridnet0_dense0_bias (shape=(10,), dtype=float32) ) ``` From the above example we can also find that the parameter names are not related to the attributes’ names, which is not straightforward. In all, we find that using name_scope and ParameterDict is not user-friendly. Thus we plan to remove such redundancies and simplify the definitions of children blocks and parameters, like: ``` class Net(nn.HybridBlock): def __init__(self, **kwargs): super(HybridNet, self).__init__(**kwargs) self.hidden1 = nn.Dense(256, activation='relu') self.a=gluon.parameter.Parameter(name="a", shape=(1, )) ``` ### 2. parameter sharing Currently, we use parameter “params” in the definition of Block for parameter sharing. It means before the __init__ of Block, shared parameters already recorded in self._params.shared. And currently Block forbids overriding parameters. We think that this is not convenient. A most common way to share parameter is like what Pytorch does, like ``` self.hidden1.weight=self.hidden2.weight ``` But note that in the case where we have a HybridBlock and the block has been hybridized, then we shouldn't allow overriding the parameter but ask the user to unhybridize the Block first. To further allow sharing parameters recursively, we plan to add an API: ``` def share_parameters(self, params : Dict): ``` We plan to use the structured based form (like what is used in “_collect_params_with_prefix()”) to represent each parameter recursively. For example, we denote “self.hidden1.weight” as “hidden_weight” In all, we plan to make the following improvements: 1. remove parameters “prefix” and “params” in the “\_\_init\_\_" function. 2. remove the use of self._params(ParameterDict) in Block 3. allow parameter attribute overriding in non-hydridization case. 4. add the method “share_parameters" to recursively share parameters in children blocks. ## Parameter naming Once a parameter is created, `param.name` would not be changed in the following operations. It is in the form of `param_{uuid4}_{name}`, where `name` is from `__init __` parameter. Here `name` is optional, default `weight`. It is mainly used to denote which default initialization should be used. We use `param.name` as the name of a parameter's symbol representation. ## collect_params() It returns a `dict`, where the keys are structural names of parameters, like `{'hidden1.weight': Parameter (shape=(3, -1), dtype=float32), 'hidden1.bias': Parameter (shape=(3,), dtype=float32)}` Note that we use `.` as the linking character again because the structured based naming scheme is no longer used in the symbol representation. ## Save and Load For `HybridBlock`, there are two ways to save and load parameters: ### save_parameters() and load_parameters() In `save_parameters()`, we use `structural name` to save parameters, and they should be loaded by `load_parameters()`, which loads parameters based on a model's structure. ### HybridBlock.export and SymbolBlock.imports In `export`, we only save parameters using `param.name` without `structural name`. The param file should be loaded in SymbolBlock.imports. ## SymbolBlock When using `SymbolBlock.imports`, keys in `self.param` would be the loaded parameters' names `param.name`. While in `SymbolBlock(outputs, inputs, params=None)`, if you provide like `params=net.collect_params()`, keys in `self.param` would be structural names of `net`'s parameters (keys in net.collect_params() ). It is often used in this situation that a `SymbolBlock` is a children block of another `HybridBlock`. Otherwise, keys in `self.param` would be the loaded parameters' names `param.name`. commit 55856066b4b6242f233cc31da8970c91f06d4bc0 Author: ciyong <ciyong.chen@intel.com> Date: Fri Jun 19 06:23:07 2020 +0800 Add KEY for Ciyong Chen (#18577) commit e96fbeb3adb78d4300f5f10cc22531583914e590 Author: Leonard Lausen <lausen@amazon.com> Date: Thu Jun 18 15:20:14 2020 -0700 Update cmake/upstream/FindCUDAToolkit.cmake (#18528) Previously MXNet includes a hotfix for a cross-compiling bug in upstream FindCUDAToolkit.cmake. Upstream has fixed the bug now in their master branch. Replace MXNet's fix by the upstream fix to avoid diverging from upstream. See https://gitlab.kitware.com/cmake/cmake/-/issues/20572 commit 14aeb384a51c9e420c349f42cea001f0a5ef5dfe Author: RuRo <andrey.stotskiy@tevian.ru> Date: Fri Jun 19 01:16:12 2020 +0300 Add parameter name to AssertionError for deferred shape inference (#18537) commit 9591436967347cc8e34a01e126b696b3447f8081 Author: Johannes Czech <QueensGambit@users.noreply.github.com> Date: Thu Jun 18 07:33:08 2020 +0200 [Numpy] Bugfix of slice operator export (MXNet to ONNX) v2 (#18535) * fixed get_inputs() for onnx slice operator export * added unit test for onnx slice operator export * implement get_inputs with_shapes helper * update slice ops to use with_shapes * added verbose parameter for get_outputs() Co-authored-by: Andrey Stotskiy <andrey.stotskiy@tevian.ru> commit 92971b822dd0151aadba965c0c6b8b22cb82bf76 Author: Neutron3529 <qweytr_1@163.com> Date: Thu Jun 18 13:30:10 2020 +0800 fix misbehave of KLDivLoss (#18423) * fix misbehave of KLDivLoss In the current version of KLDivLoss, the return value is not the same value calculated by SoftmaxCrossEntropyLoss, which is not documented. It may due to the incorrect settings which using mean rather than sum dealing with the return value. I provide a fix of this setting, which will keep the return value of `KLDivLoss` and SoftmaxCrossEntropyLoss` almost the same when `from_logits=False` and `sparse_label=False` are set to these functions seperately. Now, the behave of KLDivLoss is exactly the same to what the document say. ``` import mxnet as mx a=mx.nd.array([[-1,1],[1,-1]]) b=mx.nd.array([1,0]).one_hot(2) TrueLoss=mx.gluon.loss.SoftmaxCrossEntropyLoss(sparse_label=False) FalseLoss=mx.gluon.loss.KLDivLoss(from_logits=False) c=TrueLoss(a,b) d=FalseLoss(a,b)*a.shape[-1] assert((c-d).abs().sum()==0 and a.shape[-1]==2) ``` * update sdml loss the current version of SDMLLoss told us to `multiply for the number of labels` but actually it `multiply batch_size`. After this PR, it is no need to `multiply batch_size` or `multiply the number of labels` any more. * remove outdated comment commit b9118d9bfa0b34307c53456ea6af3927e57b8635 Author: Yang Shi <ys2843@nyu.edu> Date: Wed Jun 17 13:00:04 2020 -0700 fix contribute page anchor position shifted (#18571) Co-authored-by: Yang Shi <yangshia@amazon.com> commit eddd27d375ee403a026e3262264485c83161787f Author: Yang Shi <ys2843@nyu.edu> Date: Wed Jun 17 11:59:41 2020 -0700 add FAQ redirect rules (#18552) Co-authored-by: Yang Shi <yangshia@amazon.com> commit 103d839aa8477419ddc82f09e2ddb246e24a8d3d Author: Manu Seth <22492939+mseth10@users.noreply.github.com> Date: Tue Jun 16 16:52:46 2020 -0700 Test CD mxnet_lib/static and python/pypi stages on CI (#18559) * add cd mxnet_lib/static stages to ci * add cd pypi packaging stage to ci * removing existing cmake static compile stages in favor of other added stages * pass mxnet_variant correctly commit 8039377e6630bcb00c5a95abdaf0851803686bc6 Author: JiangZhaoh <54654391+JiangZhaoh@users.noreply.github.com> Date: Wed Jun 17 01:45:30 2020 +0800 add op npx.index_update (#18545) * add op npx.index_update * remove debug comment * change eps * fix stupid error * add blank line in docs * gpu temporary space request alignment * fix test error Co-authored-by: Ubuntu <ubuntu@ip-172-31-54-85.us-west-2.compute.internal> commit 72a54e7a5f427dc73fbd1cb826ff944d9aa82573 Author: andevellicus <762254+andevellicus@users.noreply.github.com> Date: Mon Jun 15 22:13:13 2020 -0400 Julia: fix deprecation in visualize.jl (#18515) * Update visualize.jl matchall has been deprecated as of Julia 1.3. Changes made to fix. * Cleaned * Update julia/src/visualize.jl * Update julia/src/visualize.jl Co-authored-by: Iblis Lin <iblis@hs.ntnu.edu.tw> commit e8fce62b369dac627dec23d730661624ec79b957 Author: Manu Seth <22492939+mseth10@users.noreply.github.com> Date: Mon Jun 15 18:42:51 2020 -0700 Skip flaky test_gpu_memory_profiler_gluon on cd pipeline (#18565) commit 1b02225fefd8ccc93bc73223f0d3cde103fad661 Author: Chaitanya Prakash Bapat <chai.bapat@gmail.com> Date: Mon Jun 15 11:45:03 2020 -0700 Add comments to init.py (#18327) commit cc6c64909afd78c6b5b63ee1215922e8da589c20 Author: Chaitanya Prakash Bapat <chai.bapat@gmail.com> Date: Mon Jun 15 08:55:14 2020 -0700 [OpPerf] Add example of using opperf with internal op locally (#18324) * add example of using opperf with internal op locally * split diff to old and new code for readability * mx.nd.copyto doesnt exist & website title shows ndarray instead of symbol * Revert "mx.nd.copyto doesnt exist & website title shows ndarray instead of symbol" This reverts commit 118b0900a58586aca84ec5c853d00cf687615853. commit af1b45ba3590b21014c55c58838c3e04b3f2cea3 Author: Chaitanya Prakash Bapat <chai.bapat@gmail.com> Date: Sun Jun 14 22:45:57 2020 -0700 Create config.yml (#18553) Add options for stackoverflow and discuss to issue_template & disable blank issue commit da252734c70164a0983404de076464ba7a526a60 Author: Manu Seth <22492939+mseth10@users.noreply.github.com> Date: Sat Jun 13 18:30:29 2020 -0700 remove dependency on train_mnist.py script (#18550) * remove dependency on train_mnist.py script * remove image classification tests from nightly commit 09cf48a24682e308b552a7fa70a816c024308438 Author: Leonard Lausen <lausen@amazon.com> Date: Sat Jun 13 16:31:59 2020 -0700 Use correct array type for outputs in HybridBlock.forward (#18554) commit f1f3f44166e2e47afad6c65025fb48dd47efeb65 Author: Haibin Lin <linhaibin.eric@gmail.com> Date: Sat Jun 13 10:10:25 2020 -0700 Remove the deprecated BatchNorm_v1 op (#18538) * remove batchnorm_v1 * fix gpu build Co-authored-by: EC2 Default User <ec2-user@ip-172-31-81-80.ec2.internal> Co-authored-by: Lin <haibilin@a483e7be4c92.ant.amazon.com> commit 97d4ba5a133f93ff6075dcde3ef842b23d498a12 Author: Haibin Lin <linhaibin.eric@gmail.com> Date: Fri Jun 12 16:52:47 2020 -0700 Remove XXOutput loss operators (#18531) * remove xxOutput operators used in Module * remove SVMOutput * remove RegressionOutput in language binding * remove more examples * fix scala, perl * remove spark examples * remove softmaxoutput op * remove more tests * remove more SoftmaxOutput related code * remove MAERegression * remove symbol.Softmax * fix perl test count * fix failing tests * remove mlp cpu test * fix scala test * remove tests/examples relying on imagenet-1k pretrained symbolic models * fix scala build * remove MultiTaskSuite for scala * fix cpp build * fix scale, clojure test * fix scala and python test * fix scala and clojure test * remove clojure test * remove clojure test * remove test_forward for python * remove clj viz test * remove viz tests * remove clj tutorail test * remove bert test * remove clj tests * remove clj multi-label test * remove module mlp test for clh * remove module test for clj * rm ./contrib/clojure-package/test/org/apache/clojure_mxnet/ndarray_api_test.clj * remove clj tests * rm test_mkldnn_model Co-authored-by: EC2 Default User <ec2-user@ip-172-31-81-80.ec2.internal> Co-authored-by: Lin <haibilin@a483e7be4c92.ant.amazon.com> commit 1bf881f381f91b157a26d9beddcaa8f4960cc038 Author: Yang Shi <ys2843@nyu.edu> Date: Thu Jun 11 14:01:17 2020 -0700 Fix Slow Site Loading Speed part2 (#18512) * host JQuery locally * defer time consuming scripts * defer more render-blocking script * move general version dropdown css from head to scss * update quotation mark * add cache control * add licenses info to jquery * remove jquery from github # Conflicts: # docs/static_site/src/assets/js/jquery-3.3.1.min.js * load jquery based on env * update wget jquery command Co-authored-by: Yang Shi <yangshia@amazon.com> commit a361f33497c8e87a4eab48a666fcb4a586a607b1 Author: Manu Seth <22492939+mseth10@users.noreply.github.com> Date: Thu Jun 11 09:17:44 2020 -0700 revert changes causing cd failures (#18533) Reverting the following changes to cd_unittest_ubuntu causing CD pipeline failures: The first change was using Naive Engine for operator tests, which causes timeout failures in CD Added here: 10b6b48 Second change was running integrationtest_ubuntu_gpu_byteps as part of cu* CD tests, added here: e28e9fe commit 743bbcbc7c8c85661a146d94ebd3196306650677 Author: Yijun Chen <chenyijun0902@gmail.com> Date: Thu Jun 11 23:22:56 2020 +0800 unify impl (#18523) commit fb73de7582de4e622299a4ad045e25f771568193 Author: Haibin Lin <linhaibin.eric@gmail.com> Date: Wed Jun 10 19:54:25 2020 -0700 remove mx.module.* APIs for MXNet 2.0 (#18525) * remove Module tests * remove APIs relying on module * remove docs and tools using mx.module * remove executor manager * remove ssd and ncf examples * add back grad compression api doc * fix lint * add back cpredict exmaple * fix resnet memory test * remove tests * remove tests/python/tensorrt/test_tensorrt_lenet5.py since it depends on a model traiend by mx.Module * skip flaky test * fix quantization test * remove subgraph tests Co-authored-by: EC2 Default User <ec2-user@ip-172-31-81-80.ec2.internal> Co-authored-by: Lin <haibilin@a483e7be4c92.ant.amazon.com> commit 26f44b71d8de84bbc88af496ae0aeb7ce535312d Author: Serge Panev <spanev@nvidia.com> Date: Wed Jun 10 10:41:50 2020 -0700 Add backward Type inference to main NN operators (#18378) * Add backward Type inference to main DNN operators Signed-off-by: Serge Panev <spanev@nvidia.com> * Add comments Signed-off-by: Serge Panev <spanev@nvidia.com> commit b6b40878f0aba2ba5509f3f3a4cd517a654847ce Author: Leonard Lausen <lausen@amazon.com> Date: Tue Jun 9 22:05:16 2020 -0700 Consolidate installation instructions on website and add disclaimer for non-ASF ressources (#18487) * Update website with disclaimer for non-ASF ressources * Integrate Windows instructions to build_from_source.md * Remove master version from selector * Update Download links * Update get_started/download.md per Release Download Page policy commit cf3984bf5c67cb7d1feeb5b3cb55a41ca995e5c8 Author: Yiyan66 <57363390+Yiyan66@users.noreply.github.com> Date: Wed Jun 10 05:56:13 2020 +0800 [numpy] fix op repeat with list input (#18371) * except .h * except storage * repeat * change fwd * delete * codecov Co-authored-by: Ubuntu <ubuntu@ip-172-31-18-97.us-east-2.compute.internal> commit 028d01d5fb4867988a5ca50634562c1f4e75ca6f Author: Sam Skalicky <samskalicky@gmail.com> Date: Mon Jun 8 10:42:09 2020 -0700 Drop list support in optimize_for (#18483) * initial commit * fixed typos * changed warning to exception * updated subgraph_op unittests commit 2d58ff5512e27e7a12ae9c9335d2554ee0b2bc1f Author: JackieWu <wkcn@live.cn> Date: Tue Jun 9 01:41:35 2020 +0800 [Bug Fixed] Fix batch norm when grad_req is `add` (#18500) * fix batch norm when fix_gamma is True * support gradient accumulation for batch norm * mkldnn batchnorm support grad add * unittest for bn * fix bn arg * fix lint * fix mkldnn * fix mkldnn bn * fix grad when fixing gamma * fix naive gpu bn * fix lint * fix cudnn bn * fix flag * fix lint * fix testcase * fix * use @pytest.mark.parametrize * combination * remove redundant test in batchnorm * npx.batch_norm test * try to fix test * reduce the number of tests for batchnorm * fix commit 992ed3c1ea449fdb1f4f7010dfd05d00ae88a020 Author: Haibin Lin <linhaibin.eric@gmail.com> Date: Mon Jun 8 10:39:56 2020 -0700 remove mx.rnn APIs (#18507) * remove mx.rnn APIs * fix test * update test Co-authored-by: Ubuntu <ubuntu@ip-172-31-37-108.ec2.internal> Co-authored-by: Lin <haibilin@a483e7be4c92.ant.amazon.com> commit e3493e7b47ddcaa6974280ee432c82eb89d0f756 Author: Haibin Lin <linhaibin.eric@gmail.com> Date: Sun Jun 7 18:20:46 2020 -0700 remove tools dependent on mx.module APIs (#18508) * remove tools depending on mx.module * remove caffe converter and coreml tools Co-authored-by: Lin <haibilin@a483e7be4c92.ant.amazon.com> commit 5df002567dd2e9ebcfeb620a9ba55adbded743da Author: Przemyslaw Tredak <ptredak@nvidia.com> Date: Fri Jun 5 19:55:06 2020 -0700 Fix race condition in FusedOp (#18498) commit a1db5b29451938e84ade0e768c3b93b8fd71ad15 Author: Leonard Lausen <lausen@amazon.com> Date: Fri Jun 5 16:40:22 2020 -0700 Update .codecov.yml (#18497) commit 644b69d01e5b037c3d7b0bd61d282f406c01b759 Author: Mosalam Ebrahimi <hesham.ebrahimi@gmail.com> Date: Fri Jun 5 13:52:01 2020 -0700 Fix typo (#18496) commit deae9b88c1724e056a4e7dc21f04b58c28304111 Author: RuRo <andrey.stotskiy@tevian.ru> Date: Fri Jun 5 23:18:16 2020 +0300 Fix tests for ONNX version 1.5.0 bump (#18054) * implement onnx translation helpers * bump onnx version to 1.5 * add export only test cases for topk and slice_axis commit 4be095500de74ff95ed18ebdf695eae171375818 Author: ciyong <ciyong.chen@intel.com> Date: Sat Jun 6 03:44:04 2020 +0800 Julia: remove downloading of the non-ASF binary build (#18489) commit 24d88a2cdec3e0ab8f4fe0e436eb0015e9ccfd47 Author: Manu Seth <22492939+mseth10@users.noreply.github.com> Date: Fri Jun 5 09:45:31 2020 -0700 Update Jetson installation guide (#18485) * add config Makefile for jetson * modify jetson install guide commit 7054e42c0786a2b8223b5183b852f68e72822a76 Author: Manu Seth <22492939+mseth10@users.noreply.github.com> Date: Fri Jun 5 09:40:44 2020 -0700 Add image classification tutorial for jetson (#18434) * add image classification tutorial for jetson * update code to use gluon model zoo; update doc * referencing MXNet official website for Jetson installation guide commit a156ed8e37e17f79cf0383dd9b0e1427309ad127 Author: Yang Shi <ys2843@nyu.edu> Date: Fri Jun 5 09:38:02 2020 -0700 Revert installation dropdown change (#18488) This broke the version selector. Co-authored-by: Yang Shi <yangshia@amazon.com> commit b07152244c311b9270b448b6629f8ae470f3fab1 Author: Leonard Lausen <lausen@amazon.com> Date: Thu Jun 4 17:44:52 2020 -0700 Update website instructions for compiling for / on Raspberry Pi. (#18472) * Update ci/README.md * Update raspberry pi instructions commit e28e9fec9bba07708ed0093c882b8070a96dfdd5 Author: Haibin Lin <linhaibin.eric@gmail.com> Date: Thu Jun 4 14:20:52 2020 -0700 BytePS trainer + tests (#18032) * [MXNET-#16795] Byteps-KVStore: Intergrate Byteps into mxnet as new type of kvstore backend (#17555) * Add Byteps backend for kvstore * Add a temp launcher for byteps backend * make the test fit for byteps kvstore. * final workable test * Remove trashy print and logs * correct comment * add hostfile for ci test * add ci test for byteps kvstore * add visibile devices for byteps-kvstore ci test * add licenses for tools/byteps_launcher.py * syntax error * pylint error (remove unused import like logging) * pylint error * pylint error * enable launching without hostfile (local byteps) * 1. rename byteps_kvstore.py to byteps.py; 2. shorten the launch option to ; 3. add instruction for -H and -SH options for launch; 4. add documentation for byteps kvstore in kvstore/base.py: create(name='local') * edit documentation of KVStoreBase::is_capable(capability); reture fasle for BytePS(KVStoreBase):is_capable(any). * pylint error * remove an error of arg.byteps * use --env option to set workers' environment * error in byteps-launcher.py * remove the unpurposed editing mistake in runtime_functions.sh * disable cpu support for byteps kvstore. * 1. format the document to avoid julia doc build error; 2. little change to nightly test; 3. add byteps copy right declararation in byteps_launcher.py 4. if args.byteps == True ===> if args.byteps * remove the --scheduler_ip and --scheduler_port options in launch.py * 1. maintain the origin value of broadcast and pushpull 2. optimize when out = value or [out]=value 3. add some missing documentation to avoid doc building error. * Add bytePS to CI * add dependency * +integrationtest_ubuntu_gpu_byteps * add byteps pipeline * disable a few tests * remove more tests * fix permission * remove apt-get * fix python path * improve logging * fix printns * add back CI Co-authored-by: Ubuntu <ubuntu@ip-172-31-39-16.ec2.internal> Co-authored-by: Piyush Ghai <ghai.8@osu.edu> Co-authored-by: eric-haibin-lin <linhaibin.eric@gmail.com> Co-authored-by: eric-haibin-lin <--global> Co-authored-by: Lin <haibilin@a483e7be4c92.ant.amazon.com> * fix byteps logging and declare tensor * check exceptions and return -1 * print logging in CI * Update byteps.py * Update runtime_functions.sh * add numa dependency * pin dependency * Update runtime_functions.sh * Update Dockerfile.build.ubuntu * Update runtime_functions.sh * Update runtime_functions.sh * Update runtime_functions.sh * Update runtime_functions.sh * Update Jenkins_steps.groovy * remove launcher. use bpslauncher instead. Co-authored-by: Chaokun Chang <33217209+ChaokunChang@users.noreply.github.com> Co-authored-by: Ubuntu <ubuntu@ip-172-31-39-16.ec2.internal> Co-authored-by: Piyush Ghai <ghai.8@osu.edu> Co-authored-by: Lin <haibilin@a483e7be4c92.ant.amazon.com> Co-authored-by: Ubuntu <ubuntu@ip-172-31-37-108.ec2.internal> Co-authored-by: EC2 Default User <ec2-user@ip-172-31-81-80.ec2.internal> Co-authored-by: Ubuntu <ubuntu@ip-172-31-57-164.ec2.internal> commit 7cc6700fdd5e9f6837389155b63c2911652d2c91 Author: Yang Shi <ys2843@nyu.edu> Date: Thu Jun 4 13:29:08 2020 -0700 Add Developer Guide Docs to MXNet Website (#18474) * init dev guide * move dev guide above FAQ * update format and images * hoist git docs and fix styles * use relative urls * remove useless code block * use consistent url and file name * update heading * add apache license header * init dev guide * move dev guide above FAQ * update format and images * hoist git docs and fix styles * use relative urls * remove useless code block * use consistent url and file name * update heading * add apache license header * update doc - git clone recursive * reviewing the dev guide - proof reading and text edits Co-authored-by: Yang Shi <yangshia@amazon.com> Co-authored-by: Talia Chopra <chopt@amazon.com>
Background
Data parallel training is the most common distributed training technique when it comes to multiple GPUs or multiple hosts. Currently, several communication backends provide functionalities for communicating tensors across devices/hosts for data parallel training. For MXNet users, there are a few options:
These different implementations provide different APIs:
mx.gluon.Trainer
kv.push
,kv.pull
,kv.init
hvd.init()
,hvd.DistributedTrainer
hvd.broadcast
,hvd.allreduce
bps.init()
,bps.DistributedTrainer
byteps_declare_tensor
,byteps_push_pull
Here, high level APIs refers to the API a typical novice user uses for a distributed training job. To communicate tensors not managed by
Trainer
orDistributedTrainer
s, users may refer to the low level APIs to send/receive a custom tensor.Problem Statement
Sometimes we want to easily switch between these different distributed communication backends and compare which one performs the best for a particular distributed training environment. Due to different APIs of these implementations, it requires lots of user code changes to try each one of them. It typically involves custom logics to:
Proposal
My proposal is to provide a unified API to allow custom communication backends as plugins for MXNet, so that no new user code is required to switch between these backends.
Specifically, communication backend provider implements the following python APIs.
class
AbstractKVStore
:tensor
atroot_rank
to all rankstensor
and pull inoutput
. When optimizer is not set, it performs summation oftensor
from all ranks. The result of the summation is then pulled back tooutput
tensor.A communication backend provider can implement these APIs and register a new KVStore in MXNet via
mx.kv.register()
. For MXNet users, they only need to interact with the following MXNet APIs:Limitation
The unified interfaces do not advanced features such as sparse ndarrays or gradient compression, which is less mature and not provided by all communication backends.
The above proposal targets use case 2,3,4 in the problem statement. It can be extended to tackle 1 as well if the feedbacks are positive.
@ymjiang @apeforest @anandj91 @rich-junwang
The text was updated successfully, but these errors were encountered: