You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I have spend weeks struggling with the divergence problem of your code. Simply put, the problem is that the loss always divergent after some iterations(usually near the finish of the first epoch). Please see the following snippet of my training log(training on the whole DOTA dataset):
training on this roidb: /data/dota/DOTA_split/Images/P1054__1__0___19.png
Epoch[0] Batch [18150] Speed: 4.22 samples/sec Train-RPNAcc=0.968750, RPNLogLoss=0.061721, RPNL1Loss=0.159727, RCNNAcc=0.781250, RCNNLogLoss=0.386619, RCNNL1Loss=0.023027,
training on this roidb: /data/dota/DOTA_split/Images/P0144__1__0___0.png
Epoch[0] Batch [18151] Speed: 5.03 samples/sec Train-RPNAcc=0.972656, RPNLogLoss=0.072231, RPNL1Loss=0.023806, RCNNAcc=0.867188, RCNNLogLoss=0.316344, RCNNL1Loss=0.020055,
training on this roidb: /data/dota/DOTA_split/Images/P0605__1__0___61.png
Epoch[0] Batch [18152] Speed: 4.65 samples/sec Train-RPNAcc=0.785156, RPNLogLoss=6.829865, RPNL1Loss=3.960626, RCNNAcc=0.960938, RCNNLogLoss=1.082687, RCNNL1Loss=0.018287,
training on this roidb: /data/dota/DOTA_split/Images/P1994__1__0___0.png
Epoch[0] Batch [18153] Speed: 4.55 samples/sec Train-RPNAcc=0.738281, RPNLogLoss=4.315770, RPNL1Loss=16.474699, RCNNAcc=0.968750, RCNNLogLoss=1.007381, RCNNL1Loss=0.105764,
training on this roidb: /data/dota/DOTA_split/Images/P0562__1__0___0.png
Epoch[0] Batch [18154] Speed: 4.46 samples/sec Train-RPNAcc=0.500000, RPNLogLoss=15.975729, RPNL1Loss=4.111494, RCNNAcc=0.750000, RCNNLogLoss=1.387988, RCNNL1Loss=0.069026,
training on this roidb: /data/dota/DOTA_split/Images/P0555__1__0___0.png
Epoch[0] Batch [18155] Speed: 4.75 samples/sec Train-RPNAcc=0.394531, RPNLogLoss=19.518009, RPNL1Loss=29.258448, RCNNAcc=0.945312, RCNNLogLoss=2.378677, RCNNL1Loss=1.344344,
training on this roidb: /data/dota/DOTA_split/Images/P2257__1__0___57.png
Epoch[0] Batch [18156] Speed: 4.66 samples/sec Train-RPNAcc=0.890625, RPNLogLoss=3.525833, RPNL1Loss=45.849377, RCNNAcc=0.968750, RCNNLogLoss=0.585869, RCNNL1Loss=0.166008,
training on this roidb: /data/dota/DOTA_split/Images/P0477__1__0___191.png
Epoch[0] Batch [18157] Speed: 4.58 samples/sec Train-RPNAcc=0.414062, RPNLogLoss=18.888393, RPNL1Loss=173.241898, RCNNAcc=0.898438, RCNNLogLoss=3.158766, RCNNL1Loss=0.109512,
You can when training P0605__1__0___61.png, the loss suddenly increase, after some iterations, the loss will become nan. I inspect the annotations of P0605__1__0___61, but find no strange thing. I try to train only on these problematic-like images by adding below sentence block in DOTA.py
# for debug only##
print 'debug the input images'
self.image_set_index = ['P1451__1__1175___2772', 'P1994__1__0___0', 'P0562__1__0___0',
'P0555__1__0___0', 'P2257__1__0___57', 'P0477__1__0___191',
'P0522__1__0___837', 'P0605__1__0___61']
# for debug only##
However, the loss is normal.
I also try to train RPN and RCNN seperatly, there is no divergence problem.
This have troubled me for many days, could do please help me with that?
Many thanks.
The text was updated successfully, but these errors were encountered:
Hello, I have spend weeks struggling with the divergence problem of your code. Simply put, the problem is that the loss always divergent after some iterations(usually near the finish of the first epoch). Please see the following snippet of my training log(training on the whole DOTA dataset):
You can when training P0605__1__0___61.png, the loss suddenly increase, after some iterations, the loss will become nan. I inspect the annotations of P0605__1__0___61, but find no strange thing. I try to train only on these problematic-like images by adding below sentence block in DOTA.py
However, the loss is normal.
I also try to train RPN and RCNN seperatly, there is no divergence problem.
This have troubled me for many days, could do please help me with that?
Many thanks.
The text was updated successfully, but these errors were encountered: