Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very very strange phenomenon, divergence problem when training RPN RCNN jointly? #10

Open
huangh12 opened this issue Jun 20, 2018 · 0 comments

Comments

@huangh12
Copy link

huangh12 commented Jun 20, 2018

Hello, I have spend weeks struggling with the divergence problem of your code. Simply put, the problem is that the loss always divergent after some iterations(usually near the finish of the first epoch). Please see the following snippet of my training log(training on the whole DOTA dataset):

training on this roidb: /data/dota/DOTA_split/Images/P1054__1__0___19.png
Epoch[0] Batch [18150]  Speed: 4.22 samples/sec Train-RPNAcc=0.968750,  RPNLogLoss=0.061721,    RPNL1Loss=0.159727,    RCNNAcc=0.781250, RCNNLogLoss=0.386619,     RCNNL1Loss=0.023027,

training on this roidb: /data/dota/DOTA_split/Images/P0144__1__0___0.png
Epoch[0] Batch [18151]  Speed: 5.03 samples/sec Train-RPNAcc=0.972656,  RPNLogLoss=0.072231,    RPNL1Loss=0.023806,    RCNNAcc=0.867188, RCNNLogLoss=0.316344,     RCNNL1Loss=0.020055,

training on this roidb: /data/dota/DOTA_split/Images/P0605__1__0___61.png
Epoch[0] Batch [18152]  Speed: 4.65 samples/sec Train-RPNAcc=0.785156,  RPNLogLoss=6.829865,    RPNL1Loss=3.960626,    RCNNAcc=0.960938, RCNNLogLoss=1.082687,     RCNNL1Loss=0.018287,

training on this roidb: /data/dota/DOTA_split/Images/P1994__1__0___0.png
Epoch[0] Batch [18153]  Speed: 4.55 samples/sec Train-RPNAcc=0.738281,  RPNLogLoss=4.315770,    RPNL1Loss=16.474699,   RCNNAcc=0.968750, RCNNLogLoss=1.007381,     RCNNL1Loss=0.105764,

training on this roidb: /data/dota/DOTA_split/Images/P0562__1__0___0.png
Epoch[0] Batch [18154]  Speed: 4.46 samples/sec Train-RPNAcc=0.500000,  RPNLogLoss=15.975729,   RPNL1Loss=4.111494,    RCNNAcc=0.750000, RCNNLogLoss=1.387988,     RCNNL1Loss=0.069026,

training on this roidb: /data/dota/DOTA_split/Images/P0555__1__0___0.png
Epoch[0] Batch [18155]  Speed: 4.75 samples/sec Train-RPNAcc=0.394531,  RPNLogLoss=19.518009,   RPNL1Loss=29.258448,   RCNNAcc=0.945312, RCNNLogLoss=2.378677,     RCNNL1Loss=1.344344,

training on this roidb: /data/dota/DOTA_split/Images/P2257__1__0___57.png
Epoch[0] Batch [18156]  Speed: 4.66 samples/sec Train-RPNAcc=0.890625,  RPNLogLoss=3.525833,    RPNL1Loss=45.849377,   RCNNAcc=0.968750, RCNNLogLoss=0.585869,     RCNNL1Loss=0.166008,

training on this roidb: /data/dota/DOTA_split/Images/P0477__1__0___191.png
Epoch[0] Batch [18157]  Speed: 4.58 samples/sec Train-RPNAcc=0.414062,  RPNLogLoss=18.888393,   RPNL1Loss=173.241898,  RCNNAcc=0.898438, RCNNLogLoss=3.158766,     RCNNL1Loss=0.109512,

You can when training P0605__1__0___61.png, the loss suddenly increase, after some iterations, the loss will become nan. I inspect the annotations of P0605__1__0___61, but find no strange thing. I try to train only on these problematic-like images by adding below sentence block in DOTA.py

       # for debug only##
        print 'debug the input images'
        self.image_set_index = ['P1451__1__1175___2772', 'P1994__1__0___0', 'P0562__1__0___0',
                                'P0555__1__0___0', 'P2257__1__0___57', 'P0477__1__0___191',
                                'P0522__1__0___837', 'P0605__1__0___61']
        # for debug only## 

However, the loss is normal.

I also try to train RPN and RCNN seperatly, there is no divergence problem.

This have troubled me for many days, could do please help me with that?
Many thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant