Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

SoftmaxOutput crashes with normalization "valid" #14301

Closed
ashokei opened this issue Mar 2, 2019 · 7 comments · Fixed by #14302
Closed

SoftmaxOutput crashes with normalization "valid" #14301

ashokei opened this issue Mar 2, 2019 · 7 comments · Fixed by #14302
Labels

Comments

@ashokei
Copy link
Contributor

ashokei commented Mar 2, 2019

Description

Environment info (Required)

ubuntu 16.04 default build
and, run below script.

import numpy as np                                                
import mxnet as mx                                                
xpu = mx.cpu()                                                    
x = mx.sym.Variable('x')                                          
label = mx.sym.Variable('label')                                  
x_nd = mx.nd.array([[1, 6, 4, 2],[1, 6, 4, 2]], ctx=xpu)          
grad_x = mx.nd.zeros((2,4), ctx=xpu)                              
label_nd = mx.nd.array([1,1], ctx=xpu)                            
                                                                  
sym = mx.sym.SoftmaxOutput(data=x, label=label, ignore_label=0,   
                           use_ignore=True, normalization="valid")
ex = sym.bind(ctx=xpu, args={'x': x_nd, 'label': label_nd},       
              args_grad={'x': grad_x})                            
                                                                  
ex.forward(is_train=True)                                         
softmax_out = ex.outputs[0].asnumpy()                             
ex.backward(is_train=True)                                        

MXNet commit hash:
fb4f9d5

Build config:
make

Error Message:

terminated by signal SIGSEGV (Address boundary error)
@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Bug

@wkcn
Copy link
Member

wkcn commented Mar 2, 2019

Thanks for your report!

reproduce the bug in MXNet fb4f9d5

It is strange that the address of ctx.requested[softmaxout_enum::kTempSpace] is 0 in src/operator/softmax_output-inl.h .
ctx.requested.size() is 0 in Backward.

It is a bug that BackwardResource is not called when ex.backward(is_train=true) is called.
I do not know why SoftmaxOutput is not a legacy operator.

@wkcn wkcn added the Bug label Mar 2, 2019
@DickJC123
Copy link
Contributor

I too recently saw an issue with Softmax that generated a segfault. This behavior began with the Softmax operator changes introduced by #13699 and occurs when the framework is compiled with USE_MKLDNN=0. The failing test is with sockeye:

test/integration/test_constraints_int.py::test_constraints[--encoder rnn --decoder rnn --num-layers 1 --rnn-cell-type lstm --rnn-num-hidden 8 --num-embed 4  --rnn-attention-type mlp --rnn-attention-num-hidden 8 --loss cross-entropy --optimized-metric perplexity --max-updates 2 --checkpoint-frequency 2 --optimizer adam --initial-learning-rate 0.01 --batch-type sentence  --decode-and-evaluate 0-2-10] ./test.sh: line 3:    62 Segmentation fault      python setup.py test
++ RV=139

Perhaps you could verify that your fix corrects this behavior?

@wkcn
Copy link
Member

wkcn commented Mar 4, 2019

@DickJC123
Hi! I test sockeye in my laptop with MXNet(master) with make -j 5 USE_OPENCV=1 USE_BLAS=openblas USE_MKLDNN=0 USE_CPP_PACKAGE=1

All tests pass except for test/unit/test_inference.py::test_topk_func

test/unit/test_inference.py::test_topk_func[1-5-200] FAILED              [ 60%]
test/unit/test_inference.py::test_topk_func[5-5-200] FAILED              [ 60%]
test/unit/test_inference.py::test_topk_func[1-1-200] PASSED              [ 60%]
test/unit/test_inference.py::test_topk_func[5-1-200] PASSED              [ 60%]
test/unit/test_inference.py::test_topk_func[10-10-100] FAILED            [ 60%]

In sockeye, there is no any Softmax with normalization valid, so it couldn't trigger the bug in this issue.

@anirudhacharya
Copy link
Member

I can also confirm this issue, it happens only when normalization-"valid" and while executing the Executor.backward function call. For instance this sample code works fine -

import mxnet as mx                                                                                                                                            
import numpy as np

xpu = mx.cpu()
x_nd = mx.nd.array([[1, 6, 4, 2],[1, 6, 4, 2]], ctx=xpu)    
grad_x = mx.nd.zeros((2,4), ctx=xpu)    
label_nd = mx.nd.array([1,1], ctx=xpu)

x_nd.attach_grad()

with mx.autograd.record():
    y = mx.nd.SoftmaxOutput(data=x_nd, label=label_nd, ignore_label=0, use_ignore=True) #, normalization="valid")

y.backward()
print(x_nd.grad)

So the bug is with the gradient calculation of softmax output when normalization="valid"

@fhieber
Copy link
Contributor

fhieber commented Mar 4, 2019

@wkcn Sockeye can use 'valid' normalization in its SoftmaxOutput operator use, see here.
The failure you are observing for test/unit/test_inference.py::test_topk_func is related to #13862, which is an open problem.

@wkcn
Copy link
Member

wkcn commented Mar 4, 2019

@fhieber Sorry that I overlooked it.
@anirudhacharya #14302 will address the problem.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants