Low CPU usage of MXNet in subprocesses #13593

YutingZhang · 2018-12-09T23:51:53Z

MXNet has low CPU usage when running CPU operations in multiple process scenarios. Specifically, for MXNet computation in a subprocess, MxNet can use only 1 or 2 CPUs to do its job. This issue shows different behavior for different variants of MxNet (see below) and on different machines ...

This issue is critical because it slows down the multiprocess object-detection data-loading in gluoncv very significantly, making Faster-RCNN training in gluoncv unusable.

This is tested on the 20181207 version, and other versions (e.g., 1.3.1) show similar problems.

Code to reproduce the issue

Filename: mxnet_cpu_test.py

import argparse
import sys
from concurrent import futures
import time
import numpy as np
mx=None


def run(need_import):
    if need_import:
        import mxnet as mx
    else:
        global mx
    A = mx.nd.random.uniform(low=0, high=1, shape=(5000, 5000))
    while True:
        A = mx.nd.dot(A, A)

def parse_args():
    parser = argparse.ArgumentParser("benchmark mxnet cpu")
    parser.add_argument('--num-workers', '-j', dest='num_workers', type=int, default=0)
    parser.add_argument('--late-import', action='store_true')
    return parser.parse_args()

def main(args):

    if args.num_workers == 0:
        print("Main process")
        try:
            run(need_import=args.late_import)
        except KeyboardInterrupt:
            pass
    else:
        print("Subprocesses")
        ex = futures.ProcessPoolExecutor(args.num_workers)

        for _ in range(args.num_workers):
            ex.submit(run, need_import=args.late_import)
        while True:
            try:
                time.sleep(10000)
            except KeyboardInterrupt:
                ex.shutdown(wait=False)
                break
    print("Stopped")


if __name__ == "__main__":
    args = parse_args()
    if not args.late_import:
       import mxnet as mx
    main(args)

Detailed experiments:

Run in the main process:
python3 mxnet_cpu_test.py --num-workers=0

Working fine for all mxnet variants (GPU or CPU-only).
Run in two subproceses
-- mxnet-cu90 on p3.16x:
python3 mxnet_cpu_test.py --num-workers=2

It uses only 2 CPUs per subprocess.
-- mxnet-mkl on p3.16x:
python3 mxnet_cpu_test.py --num-workers=2

Same here. It uses only 2 CPUs per subprocess.
-- mxnet-mkl on CPU-only machine c5.18x:
python3 mxnet_cpu_test.py --num-workers=2

Even worse. It uses only 1.5 CPUs per subprocess.
-- However, for vanilla CPU-version mxnet on c5.18x:
python3 mxnet_cpu_test.py --num-workers=2

It is working better. At least, it uses 5 CPUs per subprocess.
-- Weirdly, still vanilla CPU-version mxnet but on GPU machine p3.16x:
python3 mxnet_cpu_test.py --num-workers=2

It is working worse, i.e., 2 CPUs per subprocesses.
This problem seems relevant to how MXNet manage the thread per subprocess. If I do not import mxnet in the main process and instead import mxnet in each subprocess:
python3 mxnet_cpu_test.py --num-workers=2 --late-import

Then everything is working fine.

The text was updated successfully, but these errors were encountered:

pengzhao-intel · 2018-12-10T02:08:03Z

@TaoLv to help look at this issue

lanking520 · 2018-12-10T04:51:20Z

@YutingZhang Thanks for your issue reporting! @anirudh2290 @apeforest @azai91 @samskalicky please take a look in here.

TaoLv · 2018-12-10T07:36:17Z

Hi @YutingZhang, please try:

set OMP_NUM_THREADS manually. For this test case, I tried OMP_NUM_THREADS=#core/#worker;
remove the two SetEnv form /~https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L61-L62.

Please let me know if it works for you. Thanks.

samskalicky · 2018-12-10T19:17:13Z

Related issue: #12255

zhreshold · 2018-12-10T19:32:44Z

The limitation of 1 thread per worker is deliberately set to avoid thread contention.

Per offline discussion, I think a good solution is to use a ENV variable to control the limit of threads per worker can use (which defaults to 1 now).

anirudh2290 · 2018-12-10T19:59:00Z

@zhreshold this would also require rebuild with modified initialize.cc, otherwise the env variable would get overwritten.

zhreshold · 2018-12-10T20:01:15Z

@anirudh2290 Yes, I mean a PR is required to address this issue.

YutingZhang · 2018-12-14T19:43:12Z

Thanks everyone for discussing and solving the issue!

YutingZhang · 2018-12-19T07:19:58Z

@zhreshold I tried the latest version of mxnet, and do export MXNET_MP_WORKER_NTHREADS=20. However, the example code I posted still results in the same CPU usage. Any ideas?

zhreshold · 2018-12-19T18:44:39Z

@YutingZhang MXNET_MP_WORKER_NTHREADS can only control how many mxnet operators run in parallel, in the case of some transformations, it might not be able to parallelize as much op as possible. Due to a openmp bug, it's disabled for the worker so unfortunately it is the case.

You might want to enable opencv multithreading for each worker which might be the most time consuming part in worker process

YutingZhang · 2019-01-02T18:51:48Z

@pengzhao-intel @TaoLv @anirudh2290 @zhreshold Thank you for everyone's help, and happy new year! This problem seems more complicated (it might be multiple problems in the beginning). @zhreshold's fix solved the problem in most cases.
However, I found, if we call asnumpy in each worker, it interferes among the processes. And it does not seem to be a problem for GPU-version MxNet running on a GPU-machine. It seems only happening on CPU-only machine (I tested on c5.18large with mxnet-mkl).

Code (one-line difference):

import argparse
import sys
from concurrent import futures
import time
import numpy as np
mx=None


def run(need_import):
    if need_import:
        import mxnet as mx
    else:
        global mx
    A = mx.nd.random.uniform(low=0, high=1, shape=(5000, 5000))
    while True:
        A = mx.nd.dot(A, A)
        A.asnumpy()    # ******** only difference ***********

def parse_args():
    parser = argparse.ArgumentParser("benchmark mxnet cpu")
    parser.add_argument('--num-workers', '-j', dest='num_workers', type=int, default=0)
    parser.add_argument('--late-import', action='store_true')
    return parser.parse_args()

def main(args):

    if args.num_workers == 0:
        print("Main process")
        try:
            run(need_import=args.late_import)
        except KeyboardInterrupt:
            pass
    else:
        print("Subprocesses")
        ex = futures.ProcessPoolExecutor(args.num_workers)

        for _ in range(args.num_workers):
            ex.submit(run, need_import=args.late_import)
        while True:
            try:
                time.sleep(10000)
            except KeyboardInterrupt:
                ex.shutdown(wait=False)
                break
    print("Stopped")


if __name__ == "__main__":
    args = parse_args()
    if not args.late_import:
       import mxnet as mx
    main(args)

Launch 10 workers (python3 mxnet_cpu_test.py --num-workers=10). MXNET_MP_WORKER_NTHREADS does not affect the results.

But running it only in the main process is fine:

By the way, another issue I found with mxnet (cpu non-mkl version) is: when you run MxNet in a subprocess, it interferes with many other non-mxnet functions (e.g., cv2.cvtColor). The subprocess got stuck at those functions. This did not happen for mxnet==1.3.1, it started to happen in some nightly build version. Probably, we should create a new ticket for this.

pengzhao-intel · 2019-01-03T05:26:05Z

@YutingZhang thanks for the case, we will look into the issue.

ZhennanQin · 2019-01-08T07:48:08Z

@YutingZhang If you just want to utilize 100% cpu for each process, please try export KMP_AFFINITY=granularity=fine,noduplicates, it works on my environment.

If you want enable openmp multi-threading to utilize >100% cpu for each process, you need to make below change for MXNet:
ZhennanQin@48fe761

Then you can use export OMP_NUM_THREADS=4 to specify 4x cpu usage for each process.

If you don't want to change MXNet and just want to increase the efficiency of MKL dot, you can try export MKL_NUM_THREADS=4. It only works for MKL library.

pengzhao-intel · 2019-01-09T03:45:25Z

@zhreshold do you know some backgrounds why fixed the thread number to 1 in the worker processor as below line shown?
ZhennanQin/incubator-mxnet@48fe761

pengzhao-intel · 2019-01-09T04:07:49Z

Got some info from @YutingZhang #13449 #12380 thanks a lot.

pengzhao-intel · 2019-01-09T04:12:53Z

@anirudh2290

zhreshold · 2019-01-09T19:59:11Z

@pengzhao-intel The thread limit is set to 1 according to comment: #13606 (comment)

If you have better understanding of the problem please let me know.

zhreshold · 2019-09-20T21:55:07Z

@YutingZhang
Just tested out the master version, the ENV variable OMP_NUM_THREADS can now effectively control the OMP threads each worker is allowed to use.

For example, OMP_NUM_THREADS=32 python3 mxnet_cpu_test.py --num-workers=2 gives

YutingZhang changed the title ~~Low CPU usage of MXNet~~ Low CPU usage of MXNet in subprocesses Dec 9, 2018

lanking520 added the Performance label Dec 10, 2018

zhreshold mentioned this issue Dec 11, 2018

Complimentary gluon DataLoader improvements #13606

Merged

4 tasks

YutingZhang closed this as completed Dec 14, 2018

YutingZhang reopened this Dec 19, 2018

TaoLv mentioned this issue Jan 10, 2019

Performance issue when fork on linux oneapi-src/oneDNN#379

Closed

szha mentioned this issue Sep 13, 2019

[RFC] Apache MXNet 2.0 Roadmap #16167

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low CPU usage of MXNet in subprocesses #13593

Low CPU usage of MXNet in subprocesses #13593

YutingZhang commented Dec 9, 2018 •

edited

Loading

pengzhao-intel commented Dec 10, 2018

lanking520 commented Dec 10, 2018

TaoLv commented Dec 10, 2018

samskalicky commented Dec 10, 2018

zhreshold commented Dec 10, 2018

anirudh2290 commented Dec 10, 2018

zhreshold commented Dec 10, 2018

YutingZhang commented Dec 14, 2018

YutingZhang commented Dec 19, 2018

zhreshold commented Dec 19, 2018

YutingZhang commented Jan 2, 2019 •

edited

Loading

pengzhao-intel commented Jan 3, 2019

ZhennanQin commented Jan 8, 2019 •

edited

Loading

pengzhao-intel commented Jan 9, 2019

pengzhao-intel commented Jan 9, 2019

pengzhao-intel commented Jan 9, 2019

zhreshold commented Jan 9, 2019

zhreshold commented Sep 20, 2019

Low CPU usage of MXNet in subprocesses #13593

Low CPU usage of MXNet in subprocesses #13593

Comments

YutingZhang commented Dec 9, 2018 • edited Loading

pengzhao-intel commented Dec 10, 2018

lanking520 commented Dec 10, 2018

TaoLv commented Dec 10, 2018

samskalicky commented Dec 10, 2018

zhreshold commented Dec 10, 2018

anirudh2290 commented Dec 10, 2018

zhreshold commented Dec 10, 2018

YutingZhang commented Dec 14, 2018

YutingZhang commented Dec 19, 2018

zhreshold commented Dec 19, 2018

YutingZhang commented Jan 2, 2019 • edited Loading

pengzhao-intel commented Jan 3, 2019

ZhennanQin commented Jan 8, 2019 • edited Loading

pengzhao-intel commented Jan 9, 2019

pengzhao-intel commented Jan 9, 2019

pengzhao-intel commented Jan 9, 2019

zhreshold commented Jan 9, 2019

zhreshold commented Sep 20, 2019

YutingZhang commented Dec 9, 2018 •

edited

Loading

YutingZhang commented Jan 2, 2019 •

edited

Loading

ZhennanQin commented Jan 8, 2019 •

edited

Loading