[mvapich-discuss] stack smashing detected

Thiago Ize thiago at sci.utah.edu
Tue Nov 9 17:25:35 EST 2010


So, this isn't really an issue with the mpi program since the program 
never gets executed.  It appears to fail before that part. A few weeks 
ago this used to work on all our nodes, but something happened and now 
about half the nodes have this problem.  I'm guessing this is more like 
an mpirun_rsh, mpispawn, or ssh issue.  Here's all I get in terms of info:

  thiago at node5 $ mpirun_rsh -np 1 node5 mpi_trivial
  node5 works
Probably because it's all local

  thiago at node5 $ mpirun_rsh -np 1 node6 mpi_trivial
  *** stack smashing detected ***: /usr/bin/ssh terminated
  Error in init phase...wait for cleanup! (0/1 mpispawn connections)
When it has to connect to another node it fails

  thiago at node5 $ mpirun_rsh -np 1 FAKENODE mpi_trivial
  *** stack smashing detected ***: /usr/bin/ssh terminated
  Error in init phase...wait for cleanup! (0/1 mpispawn connections)
FAKENODE is not a real node, so clearly the problem is not with the 
actual mpi_trivial which was never run.

Another data point is that unlike mvapich, OpenMPI doesn't appear to 
have this issue.
  thiago at node5 $ /usr/mpi/gcc/openmpi-1.4.1/bin/mpirun -host node5 
mpi_trivial
  node5 works
  thiago at node5 $ /usr/mpi/gcc/openmpi-1.4.1/bin/mpirun -host node6 
mpi_trivial
  node6 works
  thiago at node5 $ /usr/mpi/gcc/openmpi-1.4.1/bin/mpirun -host FAKENODE 
mpi_trivial
  ssh: Could not resolve hostname FAKENODE: Name or service not known
  <snip>

It could be that both OpenMPI and mvapich are correct but that mvapich 
is exposing a bug in our system or that mvapich has some subtle bug in 
the loader.

Thiago

Sayantan Sur wrote:
> Hi Thiago,
>
> I just tried it out, and it seems to work:
>
> head$ ./bin/mpirun_rsh -np 1 node133 ./examples/cpi
> Process 0 of 1 is on node133.cluster
> pi is approximately 3.1415926544231341, Error is 0.0000000008333410
> wall clock time = 0.000320
>
> Are there any specific mpi programs with which you see this more
> often? Also, do you have any error output of when this happens?
>
> Thanks.
>
> On Tue, Nov 9, 2010 at 4:40 PM, Thiago Ize <thiago at sci.utah.edu> wrote:
>   
>> I'm the one who's found that problem.  I've had this happen with several
>> versions.  mvapich that comes with the system and the mvapich2-1.5.1 that I
>> downloaded and compiled myself.
>>
>> The 1.5.1 version works if I run locally, probably because it's not using
>> ssh?  But if I try to run on a remote node I get the same error.  For
>> example:
>> node1 $ mpirun_rsh -np1 node1 mpiProgram -> works
>> node1 $ mpirun_rsh -np1 node2 mpiProgram -> fails
>>
>> Also, if I go on a node where this still works, I can still run on the "bad"
>> remote nodes.  For example
>> nodeGood $ mpirun_rsh -np1 node2 mpiProgram -> works
>> node1 $ mpirun_rsh -np1 node2 mpiProgram -> does not
>>
>> Thiago
>>
>> Sayantan Sur wrote:
>>
>> Hi Nick,
>>
>> On Tue, Nov 9, 2010 at 2:13 PM, Nick Rathke <nick at sci.utah.edu> wrote:
>>
>>
>> Hi,
>>
>> We have a small 64 node cluster running RHEL 5.4 and mvapich 1.2.0 and we
>> have started getting the error " *** stack smashing detected ***:
>> /usr/bin/ssh terminated " on some of our nodes but not others, when all of
>> the node are identical.
>>
>> I have been searching the web for this error but haven't found anything that
>> would help me debug this or even tell if this is a mvapich or ssh error.
>>
>> Any thoughts would be greatly appreciated.
>>
>>
>>
>> Just wondering if you saw this with any older MVAPICH version (say,
>> MVAPICH-1.1) or some of the newer MVAPICH2 releases?
>>
>> If you could try MVAPICH2-1.5.1 and see if this error persists, it
>> will be great.
>>
>> Thanks.
>>
>>
>>
>> Nick Rathke
>> Scientific Computing and Imaging Institute
>> IT Manager and Sr. Systems Administrator
>> nick at sci.utah.edu
>> www.sci.utah.edu
>>
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>>
>>
>>
>>     
>
>
>
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20101109/e0ea8efc/attachment.html


More information about the mvapich-discuss mailing list