Very very strange phenomenon, divergence problem when training RPN RCNN jointly? #10

huangh12 · 2018-06-20T07:27:17Z

Hello, I have spend weeks struggling with the divergence problem of your code. Simply put, the problem is that the loss always divergent after some iterations(usually near the finish of the first epoch). Please see the following snippet of my training log(training on the whole DOTA dataset):

training on this roidb: /data/dota/DOTA_split/Images/P1054__1__0___19.png
Epoch[0] Batch [18150]  Speed: 4.22 samples/sec Train-RPNAcc=0.968750,  RPNLogLoss=0.061721,    RPNL1Loss=0.159727,    RCNNAcc=0.781250, RCNNLogLoss=0.386619,     RCNNL1Loss=0.023027,

training on this roidb: /data/dota/DOTA_split/Images/P0144__1__0___0.png
Epoch[0] Batch [18151]  Speed: 5.03 samples/sec Train-RPNAcc=0.972656,  RPNLogLoss=0.072231,    RPNL1Loss=0.023806,    RCNNAcc=0.867188, RCNNLogLoss=0.316344,     RCNNL1Loss=0.020055,

training on this roidb: /data/dota/DOTA_split/Images/P0605__1__0___61.png
Epoch[0] Batch [18152]  Speed: 4.65 samples/sec Train-RPNAcc=0.785156,  RPNLogLoss=6.829865,    RPNL1Loss=3.960626,    RCNNAcc=0.960938, RCNNLogLoss=1.082687,     RCNNL1Loss=0.018287,

training on this roidb: /data/dota/DOTA_split/Images/P1994__1__0___0.png
Epoch[0] Batch [18153]  Speed: 4.55 samples/sec Train-RPNAcc=0.738281,  RPNLogLoss=4.315770,    RPNL1Loss=16.474699,   RCNNAcc=0.968750, RCNNLogLoss=1.007381,     RCNNL1Loss=0.105764,

training on this roidb: /data/dota/DOTA_split/Images/P0562__1__0___0.png
Epoch[0] Batch [18154]  Speed: 4.46 samples/sec Train-RPNAcc=0.500000,  RPNLogLoss=15.975729,   RPNL1Loss=4.111494,    RCNNAcc=0.750000, RCNNLogLoss=1.387988,     RCNNL1Loss=0.069026,

training on this roidb: /data/dota/DOTA_split/Images/P0555__1__0___0.png
Epoch[0] Batch [18155]  Speed: 4.75 samples/sec Train-RPNAcc=0.394531,  RPNLogLoss=19.518009,   RPNL1Loss=29.258448,   RCNNAcc=0.945312, RCNNLogLoss=2.378677,     RCNNL1Loss=1.344344,

training on this roidb: /data/dota/DOTA_split/Images/P2257__1__0___57.png
Epoch[0] Batch [18156]  Speed: 4.66 samples/sec Train-RPNAcc=0.890625,  RPNLogLoss=3.525833,    RPNL1Loss=45.849377,   RCNNAcc=0.968750, RCNNLogLoss=0.585869,     RCNNL1Loss=0.166008,

training on this roidb: /data/dota/DOTA_split/Images/P0477__1__0___191.png
Epoch[0] Batch [18157]  Speed: 4.58 samples/sec Train-RPNAcc=0.414062,  RPNLogLoss=18.888393,   RPNL1Loss=173.241898,  RCNNAcc=0.898438, RCNNLogLoss=3.158766,     RCNNL1Loss=0.109512,

You can when training P0605__1__0___61.png, the loss suddenly increase, after some iterations, the loss will become nan. I inspect the annotations of P0605__1__0___61, but find no strange thing. I try to train only on these problematic-like images by adding below sentence block in DOTA.py

       # for debug only##
        print 'debug the input images'
        self.image_set_index = ['P1451__1__1175___2772', 'P1994__1__0___0', 'P0562__1__0___0',
                                'P0555__1__0___0', 'P2257__1__0___57', 'P0477__1__0___191',
                                'P0522__1__0___837', 'P0605__1__0___61']
        # for debug only##

However, the loss is normal.

I also try to train RPN and RCNN seperatly, there is no divergence problem.

This have troubled me for many days, could do please help me with that?
Many thanks.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very very strange phenomenon, divergence problem when training RPN RCNN jointly? #10

Very very strange phenomenon, divergence problem when training RPN RCNN jointly? #10

huangh12 commented Jun 20, 2018 •

edited

Loading

Very very strange phenomenon, divergence problem when training RPN RCNN jointly? #10

Very very strange phenomenon, divergence problem when training RPN RCNN jointly? #10

Comments

huangh12 commented Jun 20, 2018 • edited Loading

huangh12 commented Jun 20, 2018 •

edited

Loading