-
Notifications
You must be signed in to change notification settings - Fork 520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[rabit harden] fix model recovery tests failures #510
Conversation
tracker/dmlc_tracker/local.py
Outdated
@@ -8,38 +8,52 @@ | |||
import logging | |||
from threading import Thread | |||
from . import tracker | |||
import pdb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch. fixed
tracker/dmlc_tracker/local.py
Outdated
num_retry = env.get('DMLC_NUM_ATTEMPT', 0) | ||
|
||
#overwrite default num of retry with commandline value | ||
for parm in cmd: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
param
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
@@ -11,6 +11,8 @@ | |||
#include <dmlc/logging.h> | |||
#include <string> | |||
#include <mutex> | |||
#include <utility> | |||
#include <memory> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from cpplint failure https://travis-ci.org/dmlc/dmlc-core/jobs/503196406
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this refers to a succeeded job?
Added. |
@CodingCat note that the master branch is diverging from the one used by mxnet (tracked in #507). cc @hcho3 who's looking into it. |
@szha thanks! |
I still recommend you point to this branch in dmlc/rabit#81 for cross-validation, and it will prevent the error result by the different environment in your laptop and rabit |
merged, thanks |
if num_retry >= 0: | ||
# failure trail increase by 1 and restart failed worker | ||
for arg in cmd: | ||
if arg.startswith('rabit_num_trial'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might have missed this here, I don't think we should leak more rabit stuff in dmlc-core (hopefully we should rename RabitTracker), essentially they are two different projects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Described on PR item 3, the semantic of rabit_num_trial is to capture the number of worker failures caused by mock. v.s DMLC_NUM_ATTEMPT is setting number of retries in total local tracker can do. The issue really some how up to date rabit_num_trial needs to feed back into rabit init in order to make rabit behavior as expected. Otherwise it will be infinite crash...
/~https://github.com/dmlc/rabit/blob/master/src/allreduce_mock.h#L164
One previous attempt was to reuse DMLC_NUM_ATTEMPT as replacement of rabit_num_trial. The problem is rabit_num_trial starts with 0 and DMLC_NUM_ATTEMPT needs to >= number of crashes for tests to pass. That basically broke tests in rabit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, in this case, why we need to pass a value as rabit_num_trial
?
we should remove it from both of dmlc-core and rabit, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how can restarted rabit mock worker know which trail it is running and decide if it will crash or not. There needs something pass via arguments...
/~https://github.com/dmlc/rabit/blob/master/src/allreduce_mock.h#L52
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest add another argument called
DMLC_WORKER_MAX_RETRY along with DMLC_NUM_ATTEMPT and remove rabit_num_trial from core.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just DMLC_MAX_RETRY and consume rabit_num_trial from rabit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rabit tests has been failing since early 2018.
According to @tqchen this is largely due to tracker semantic issue. As I have been working on enable checkpoint at rabit could work on GCP. I need to make sure those tests can pass before introducing extra features.
Here is some of my observations to tracker/rabit interaction.
Rabit mock relay on command line to pass on which number of trail it runs on as a factor to simulate worker crash (exit(-2)). When worker actually dead and local tracker subprocess exit. worker were restarted without update rabit_num_trail setting. (this is essentially infinite loop of crash)
Total number of restart a worker in local tracker is controlled by DMLC_NUM_ATTEMPT, by default it was set to zero. So when a rabit unit test kick off like following, total number of retry is still set to zero (no retry) and local tracker throw exception.
Rabit Init has a overwrite code which use value from DMLC_NUM_ATTEMPT to overwrite value from rabit_num_trial. I am not sure why since num_trail is incremental and DMLC_NUM_ATTEMPT if set to greater than 0 will skip most of mock failures.
model_recover_10_10k: ../dmlc-core/tracker/dmlc-submit --cluster local --num-workers=2 model_recover 10000 mock=0,0,1,0 mock=1,1,1,0 rabit_num_trial=0 DMLC_NUM_ATTEMPT=10