[mvapich-discuss] stack smashing detected

Thiago Ize thiago at sci.utah.edu
Mon Dec 13 18:36:49 EST 2010


In case anyone ever comes across this issue, I just wanted to point out 
that the problem was that some of the nodes had a different version of 
libssl which caused the stack smash problem.  mvapich had nothing to do 
with it.

Thiago

Thiago Ize wrote:
> So, this isn't really an issue with the mpi program since the program 
> never gets executed.  It appears to fail before that part. A few weeks 
> ago this used to work on all our nodes, but something happened and now 
> about half the nodes have this problem.  I'm guessing this is more 
> like an mpirun_rsh, mpispawn, or ssh issue.  Here's all I get in terms 
> of info:
>
>   thiago at node5 $ mpirun_rsh -np 1 node5 mpi_trivial
>   node5 works
> Probably because it's all local
>
>   thiago at node5 $ mpirun_rsh -np 1 node6 mpi_trivial
>   *** stack smashing detected ***: /usr/bin/ssh terminated
>   Error in init phase...wait for cleanup! (0/1 mpispawn connections)
> When it has to connect to another node it fails
>
>   thiago at node5 $ mpirun_rsh -np 1 FAKENODE mpi_trivial
>   *** stack smashing detected ***: /usr/bin/ssh terminated
>   Error in init phase...wait for cleanup! (0/1 mpispawn connections)
> FAKENODE is not a real node, so clearly the problem is not with the 
> actual mpi_trivial which was never run.
>
> Another data point is that unlike mvapich, OpenMPI doesn't appear to 
> have this issue.
>   thiago at node5 $ /usr/mpi/gcc/openmpi-1.4.1/bin/mpirun -host node5 
> mpi_trivial
>   node5 works
>   thiago at node5 $ /usr/mpi/gcc/openmpi-1.4.1/bin/mpirun -host node6 
> mpi_trivial
>   node6 works
>   thiago at node5 $ /usr/mpi/gcc/openmpi-1.4.1/bin/mpirun -host FAKENODE 
> mpi_trivial
>   ssh: Could not resolve hostname FAKENODE: Name or service not known
>   <snip>
>
> It could be that both OpenMPI and mvapich are correct but that mvapich 
> is exposing a bug in our system or that mvapich has some subtle bug in 
> the loader.
>
> Thiago
>
> Sayantan Sur wrote:
>> Hi Thiago,
>>
>> I just tried it out, and it seems to work:
>>
>> head$ ./bin/mpirun_rsh -np 1 node133 ./examples/cpi
>> Process 0 of 1 is on node133.cluster
>> pi is approximately 3.1415926544231341, Error is 0.0000000008333410
>> wall clock time = 0.000320
>>
>> Are there any specific mpi programs with which you see this more
>> often? Also, do you have any error output of when this happens?
>>
>> Thanks.
>>
>> On Tue, Nov 9, 2010 at 4:40 PM, Thiago Ize <thiago at sci.utah.edu> wrote:
>>   
>>> I'm the one who's found that problem.  I've had this happen with several
>>> versions.  mvapich that comes with the system and the mvapich2-1.5.1 that I
>>> downloaded and compiled myself.
>>>
>>> The 1.5.1 version works if I run locally, probably because it's not using
>>> ssh?  But if I try to run on a remote node I get the same error.  For
>>> example:
>>> node1 $ mpirun_rsh -np1 node1 mpiProgram -> works
>>> node1 $ mpirun_rsh -np1 node2 mpiProgram -> fails
>>>
>>> Also, if I go on a node where this still works, I can still run on the "bad"
>>> remote nodes.  For example
>>> nodeGood $ mpirun_rsh -np1 node2 mpiProgram -> works
>>> node1 $ mpirun_rsh -np1 node2 mpiProgram -> does not
>>>
>>> Thiago
>>>
>>> Sayantan Sur wrote:
>>>
>>> Hi Nick,
>>>
>>> On Tue, Nov 9, 2010 at 2:13 PM, Nick Rathke <nick at sci.utah.edu> wrote:
>>>
>>>
>>> Hi,
>>>
>>> We have a small 64 node cluster running RHEL 5.4 and mvapich 1.2.0 and we
>>> have started getting the error " *** stack smashing detected ***:
>>> /usr/bin/ssh terminated " on some of our nodes but not others, when all of
>>> the node are identical.
>>>
>>> I have been searching the web for this error but haven't found anything that
>>> would help me debug this or even tell if this is a mvapich or ssh error.
>>>
>>> Any thoughts would be greatly appreciated.
>>>
>>>
>>>
>>> Just wondering if you saw this with any older MVAPICH version (say,
>>> MVAPICH-1.1) or some of the newer MVAPICH2 releases?
>>>
>>> If you could try MVAPICH2-1.5.1 and see if this error persists, it
>>> will be great.
>>>
>>> Thanks.
>>>
>>>
>>>
>>> Nick Rathke
>>> Scientific Computing and Imaging Institute
>>> IT Manager and Sr. Systems Administrator
>>> nick at sci.utah.edu
>>> www.sci.utah.edu
>>>
>>>
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>
>>>
>>>
>>>
>>>
>>>     
>>
>>
>>
>>   
> ------------------------------------------------------------------------
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20101213/2c083953/attachment.html


More information about the mvapich-discuss mailing list