MKLDNN Perplexity Issue #13515

azai91 · 2018-12-03T19:22:55Z

The recent upgrade to to 0.17.1 (#13369) has addressed the issue of throughput. However, the the perplexity of the lstm increases dramatically (https://www.dropbox.com/s/lnp1dc9uvwhfcqh/Screenshot%202018-12-03%2011.22.10.png?dl=0).

For Q & A and discussion, please start a discussion thread at https://discuss.mxnet.io

Description

(Brief description of the problem in no more than 2 sentences.)

Environment info (Required)

What to do:
1. Download the diagnosis script from https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py
2. Run the script using `python diagnose.py` and paste its output here.

Package used (Python/R/Scala/Julia):
(I'm using ...)

For Scala user, please provide:

Java version: (java -version)
Maven version: (mvn -version)
Scala runtime if applicable: (scala -version)

For R user, please provide R sessionInfo():

Build info (Required if built from source)

Compiler (gcc/clang/mingw/visual studio):

MXNet commit hash:
(Paste the output of git rev-parse HEAD here.)

Build config:
(Paste the content of config.mk, or the build command.)

Error Message:

(Paste the complete error message, including stack trace.)

Minimum reproducible example

(If you are using your own code, please provide a short script that reproduces the error. Otherwise, please provide link to the existing example.)

Clone repo at git@github.com:mseth10/deeplearning-benchmark.git
Run the following script python word_language_model/lstm_bucketing.py --num-hidden 650 --num-embed 650 --gpus 0 --epochs 30 --kv-store local

The text was updated successfully, but these errors were encountered:

vrakesh · 2018-12-03T19:35:46Z

@mxnet-label-bot add [MKLDNN,Performance]

lupesko · 2018-12-04T00:34:48Z

This is no longer about performance, it is an issue with model perplexity.
@mxnet-label-bot remove [Performance]

harshp8l · 2018-12-04T00:39:59Z

@mxnet-label-bot remove [Performance]

lupesko · 2018-12-04T00:44:07Z

Adding @pengzhao-intel - can you guys look into it on Intel side and discuss with the MKLDNN team?

TaoLv · 2018-12-04T01:53:18Z

@azai91 @lupesko Since the word "perplexity" is more about model, is it possible for you to provide a reproducer and narrow down the problem to MKL-DNN operator? It seems some operator doesn't generate correct result. But we didn't observed that in MXNet CI.

pengzhao-intel · 2018-12-04T04:53:00Z

@azai91 @lupesko
I am trying with 1.4.x and latest master and the PPL is smaller than the threshold (203) from my local training.

commit d772a4b
Author: Anirudh Subramanian anirudh2290@apache.org
Date: Mon Dec 3 15:39:53 2018 -0800

2018-12-04 12:27:47,825 Epoch[29] Train-perplexity=141.206610
2018-12-04 12:27:47,825 Epoch[29] Time cost=174.113
2018-12-04 12:27:53,691 Epoch[29] Validation-perplexity=189.776305

commit a29185a
Author: Aaron Markham markhama@amazon.com
Date: Wed Nov 28 17:46:10 2018 -0800

2018-12-04 12:50:37,061 Epoch[29] Train-perplexity=143.439135
2018-12-04 12:50:37,061 Epoch[29] Time cost=200.166
2018-12-04 12:50:42,988 Epoch[29] Validation-perplexity=189.409277

Vikas-kum · 2018-12-04T19:39:53Z

@pengzhao-intel
We see the regression happening on LSTM on PTB data with symbolic training. Did you use same dataset with symbolic training?
The graph shows that training & validation perplexity increased starting from 11/28 , which is matching with the date of fix of LSTM regression issue.
We do see improvement in speed. There is huge drop in training time which I am skeptical about.

lupesko · 2018-12-04T23:41:47Z

@Vikas89 there are different ways to implement and train an LSTM-based language model on PTB dataset. Can you please share the training script?

pengzhao-intel · 2018-12-05T01:04:22Z

@Vikas89 please share how to reproduce the issue?
I can't reproduce in local. Do you change the example recently? What kind of machine is using?
Data： deeplearning-benchmark/word_language_model/get_ptb_data.sh
CMD: python word_language_model/lstm_bucketing.py --num-hidden 650 --num-embed 650 --epochs 30
Repo: /~https://github.com/mseth10/deeplearning-benchmark
CI:
commit 15da80293ce953ef796503bfb257e6efb7db8dba
Merge: db0c4b7 7601496
Author: Manu Seth 22492939+mseth10@users.noreply.github.com
Date: Wed Nov 28 19:54:29 2018 -0700
Merge pull request #2 from anirudh2290/fix_num_gpus
Fiix task_config_template.cfg

Vikas-kum · 2018-12-05T06:58:21Z

@mseth10 please share the test training script and steps to run and instance type(I think its c5.18x but please confirm) with Patric

Vikas-kum · 2018-12-05T18:58:32Z

@pengzhao-intel Manu has started the training script to verify.

I think there was a bug introduced in our bench-marking set up, which was concluding results after 5 epochs. We fixed our benchmark setup and graphs seems to back to normal. We will observe for few more runs and close this.
Thanks guys for jumping. This is all looking really good.

pengzhao-intel · 2018-12-06T02:58:07Z

Good to know the problem is fixed :)

Feel free to ping me if anything needs our help.

Vikas-kum · 2018-12-06T18:09:08Z

@azai91 please close this issue. benchmark is back to normal.

marcoabreu added Performance MKLDNN labels Dec 3, 2018

marcoabreu removed the Performance label Dec 4, 2018

azai91 closed this as completed Dec 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MKLDNN Perplexity Issue #13515

MKLDNN Perplexity Issue #13515

azai91 commented Dec 3, 2018

vrakesh commented Dec 3, 2018 •

edited

Loading

lupesko commented Dec 4, 2018

harshp8l commented Dec 4, 2018

lupesko commented Dec 4, 2018

TaoLv commented Dec 4, 2018

pengzhao-intel commented Dec 4, 2018 •

edited

Loading

Vikas-kum commented Dec 4, 2018 •

edited

Loading

lupesko commented Dec 4, 2018

pengzhao-intel commented Dec 5, 2018 •

edited

Loading

Vikas-kum commented Dec 5, 2018

Vikas-kum commented Dec 5, 2018

pengzhao-intel commented Dec 6, 2018

Vikas-kum commented Dec 6, 2018

MKLDNN Perplexity Issue #13515

MKLDNN Perplexity Issue #13515

Comments

azai91 commented Dec 3, 2018

Description

Environment info (Required)

Build info (Required if built from source)

Error Message:

Minimum reproducible example

vrakesh commented Dec 3, 2018 • edited Loading

lupesko commented Dec 4, 2018

harshp8l commented Dec 4, 2018

lupesko commented Dec 4, 2018

TaoLv commented Dec 4, 2018

pengzhao-intel commented Dec 4, 2018 • edited Loading

Vikas-kum commented Dec 4, 2018 • edited Loading

lupesko commented Dec 4, 2018

pengzhao-intel commented Dec 5, 2018 • edited Loading

Vikas-kum commented Dec 5, 2018

Vikas-kum commented Dec 5, 2018

pengzhao-intel commented Dec 6, 2018

Vikas-kum commented Dec 6, 2018

vrakesh commented Dec 3, 2018 •

edited

Loading

pengzhao-intel commented Dec 4, 2018 •

edited

Loading

Vikas-kum commented Dec 4, 2018 •

edited

Loading

pengzhao-intel commented Dec 5, 2018 •

edited

Loading