Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

MKLDNN Perplexity Issue #13515

Closed
azai91 opened this issue Dec 3, 2018 · 13 comments
Closed

MKLDNN Perplexity Issue #13515

azai91 opened this issue Dec 3, 2018 · 13 comments
Labels

Comments

@azai91
Copy link
Contributor

azai91 commented Dec 3, 2018

The recent upgrade to to 0.17.1 (#13369) has addressed the issue of throughput. However, the the perplexity of the lstm increases dramatically (https://www.dropbox.com/s/lnp1dc9uvwhfcqh/Screenshot%202018-12-03%2011.22.10.png?dl=0).

For Q & A and discussion, please start a discussion thread at https://discuss.mxnet.io

Description

(Brief description of the problem in no more than 2 sentences.)

Environment info (Required)

What to do:
1. Download the diagnosis script from https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py
2. Run the script using `python diagnose.py` and paste its output here.

Package used (Python/R/Scala/Julia):
(I'm using ...)

For Scala user, please provide:

  1. Java version: (java -version)
  2. Maven version: (mvn -version)
  3. Scala runtime if applicable: (scala -version)

For R user, please provide R sessionInfo():

Build info (Required if built from source)

Compiler (gcc/clang/mingw/visual studio):

MXNet commit hash:
(Paste the output of git rev-parse HEAD here.)

Build config:
(Paste the content of config.mk, or the build command.)

Error Message:

(Paste the complete error message, including stack trace.)

Minimum reproducible example

(If you are using your own code, please provide a short script that reproduces the error. Otherwise, please provide link to the existing example.)

  1. Clone repo at git@github.com:mseth10/deeplearning-benchmark.git
  2. Run the following script python word_language_model/lstm_bucketing.py --num-hidden 650 --num-embed 650 --gpus 0 --epochs 30 --kv-store local
@vrakesh
Copy link
Contributor

vrakesh commented Dec 3, 2018

@mxnet-label-bot add [MKLDNN,Performance]

@lupesko
Copy link
Contributor

lupesko commented Dec 4, 2018

This is no longer about performance, it is an issue with model perplexity.
@mxnet-label-bot remove [Performance]

@harshp8l
Copy link
Contributor

harshp8l commented Dec 4, 2018

@mxnet-label-bot remove [Performance]

@lupesko
Copy link
Contributor

lupesko commented Dec 4, 2018

Adding @pengzhao-intel - can you guys look into it on Intel side and discuss with the MKLDNN team?

@TaoLv
Copy link
Member

TaoLv commented Dec 4, 2018

@azai91 @lupesko Since the word "perplexity" is more about model, is it possible for you to provide a reproducer and narrow down the problem to MKL-DNN operator? It seems some operator doesn't generate correct result. But we didn't observed that in MXNet CI.

@pengzhao-intel
Copy link
Contributor

pengzhao-intel commented Dec 4, 2018

@azai91 @lupesko
I am trying with 1.4.x and latest master and the PPL is smaller than the threshold (203) from my local training.

commit d772a4b
Author: Anirudh Subramanian anirudh2290@apache.org
Date: Mon Dec 3 15:39:53 2018 -0800

2018-12-04 12:27:47,825 Epoch[29] Train-perplexity=141.206610
2018-12-04 12:27:47,825 Epoch[29] Time cost=174.113
2018-12-04 12:27:53,691 Epoch[29] Validation-perplexity=189.776305

commit a29185a
Author: Aaron Markham markhama@amazon.com
Date: Wed Nov 28 17:46:10 2018 -0800

2018-12-04 12:50:37,061 Epoch[29] Train-perplexity=143.439135
2018-12-04 12:50:37,061 Epoch[29] Time cost=200.166
2018-12-04 12:50:42,988 Epoch[29] Validation-perplexity=189.409277

@Vikas-kum
Copy link
Contributor

Vikas-kum commented Dec 4, 2018

@pengzhao-intel
We see the regression happening on LSTM on PTB data with symbolic training. Did you use same dataset with symbolic training?
The graph shows that training & validation perplexity increased starting from 11/28 , which is matching with the date of fix of LSTM regression issue.
We do see improvement in speed. There is huge drop in training time which I am skeptical about.
screen shot 2018-12-04 at 11 36 40 am

@lupesko
Copy link
Contributor

lupesko commented Dec 4, 2018

@Vikas89 there are different ways to implement and train an LSTM-based language model on PTB dataset. Can you please share the training script?

@pengzhao-intel
Copy link
Contributor

pengzhao-intel commented Dec 5, 2018

@Vikas89 please share how to reproduce the issue?
I can't reproduce in local. Do you change the example recently? What kind of machine is using?
Data: deeplearning-benchmark/word_language_model/get_ptb_data.sh
CMD: python word_language_model/lstm_bucketing.py --num-hidden 650 --num-embed 650 --epochs 30
Repo: /~https://github.com/mseth10/deeplearning-benchmark
CI:
commit 15da80293ce953ef796503bfb257e6efb7db8dba
Merge: db0c4b7 7601496
Author: Manu Seth 22492939+mseth10@users.noreply.github.com
Date: Wed Nov 28 19:54:29 2018 -0700
Merge pull request #2 from anirudh2290/fix_num_gpus
Fiix task_config_template.cfg

@Vikas-kum
Copy link
Contributor

@mseth10 please share the test training script and steps to run and instance type(I think its c5.18x but please confirm) with Patric

@Vikas-kum
Copy link
Contributor

@pengzhao-intel Manu has started the training script to verify.

I think there was a bug introduced in our bench-marking set up, which was concluding results after 5 epochs. We fixed our benchmark setup and graphs seems to back to normal. We will observe for few more runs and close this.
Thanks guys for jumping. This is all looking really good.

@pengzhao-intel
Copy link
Contributor

Good to know the problem is fixed :)

Feel free to ping me if anything needs our help.

@Vikas-kum
Copy link
Contributor

@azai91 please close this issue. benchmark is back to normal.

@azai91 azai91 closed this as completed Dec 7, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

8 participants