[mvapich-discuss] stack smashing detected
Thiago Ize
thiago at sci.utah.edu
Tue Nov 9 17:25:35 EST 2010
So, this isn't really an issue with the mpi program since the program
never gets executed. It appears to fail before that part. A few weeks
ago this used to work on all our nodes, but something happened and now
about half the nodes have this problem. I'm guessing this is more like
an mpirun_rsh, mpispawn, or ssh issue. Here's all I get in terms of info:
thiago at node5 $ mpirun_rsh -np 1 node5 mpi_trivial
node5 works
Probably because it's all local
thiago at node5 $ mpirun_rsh -np 1 node6 mpi_trivial
*** stack smashing detected ***: /usr/bin/ssh terminated
Error in init phase...wait for cleanup! (0/1 mpispawn connections)
When it has to connect to another node it fails
thiago at node5 $ mpirun_rsh -np 1 FAKENODE mpi_trivial
*** stack smashing detected ***: /usr/bin/ssh terminated
Error in init phase...wait for cleanup! (0/1 mpispawn connections)
FAKENODE is not a real node, so clearly the problem is not with the
actual mpi_trivial which was never run.
Another data point is that unlike mvapich, OpenMPI doesn't appear to
have this issue.
thiago at node5 $ /usr/mpi/gcc/openmpi-1.4.1/bin/mpirun -host node5
mpi_trivial
node5 works
thiago at node5 $ /usr/mpi/gcc/openmpi-1.4.1/bin/mpirun -host node6
mpi_trivial
node6 works
thiago at node5 $ /usr/mpi/gcc/openmpi-1.4.1/bin/mpirun -host FAKENODE
mpi_trivial
ssh: Could not resolve hostname FAKENODE: Name or service not known
<snip>
It could be that both OpenMPI and mvapich are correct but that mvapich
is exposing a bug in our system or that mvapich has some subtle bug in
the loader.
Thiago
Sayantan Sur wrote:
> Hi Thiago,
>
> I just tried it out, and it seems to work:
>
> head$ ./bin/mpirun_rsh -np 1 node133 ./examples/cpi
> Process 0 of 1 is on node133.cluster
> pi is approximately 3.1415926544231341, Error is 0.0000000008333410
> wall clock time = 0.000320
>
> Are there any specific mpi programs with which you see this more
> often? Also, do you have any error output of when this happens?
>
> Thanks.
>
> On Tue, Nov 9, 2010 at 4:40 PM, Thiago Ize <thiago at sci.utah.edu> wrote:
>
>> I'm the one who's found that problem. I've had this happen with several
>> versions. mvapich that comes with the system and the mvapich2-1.5.1 that I
>> downloaded and compiled myself.
>>
>> The 1.5.1 version works if I run locally, probably because it's not using
>> ssh? But if I try to run on a remote node I get the same error. For
>> example:
>> node1 $ mpirun_rsh -np1 node1 mpiProgram -> works
>> node1 $ mpirun_rsh -np1 node2 mpiProgram -> fails
>>
>> Also, if I go on a node where this still works, I can still run on the "bad"
>> remote nodes. For example
>> nodeGood $ mpirun_rsh -np1 node2 mpiProgram -> works
>> node1 $ mpirun_rsh -np1 node2 mpiProgram -> does not
>>
>> Thiago
>>
>> Sayantan Sur wrote:
>>
>> Hi Nick,
>>
>> On Tue, Nov 9, 2010 at 2:13 PM, Nick Rathke <nick at sci.utah.edu> wrote:
>>
>>
>> Hi,
>>
>> We have a small 64 node cluster running RHEL 5.4 and mvapich 1.2.0 and we
>> have started getting the error " *** stack smashing detected ***:
>> /usr/bin/ssh terminated " on some of our nodes but not others, when all of
>> the node are identical.
>>
>> I have been searching the web for this error but haven't found anything that
>> would help me debug this or even tell if this is a mvapich or ssh error.
>>
>> Any thoughts would be greatly appreciated.
>>
>>
>>
>> Just wondering if you saw this with any older MVAPICH version (say,
>> MVAPICH-1.1) or some of the newer MVAPICH2 releases?
>>
>> If you could try MVAPICH2-1.5.1 and see if this error persists, it
>> will be great.
>>
>> Thanks.
>>
>>
>>
>> Nick Rathke
>> Scientific Computing and Imaging Institute
>> IT Manager and Sr. Systems Administrator
>> nick at sci.utah.edu
>> www.sci.utah.edu
>>
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>>
>>
>>
>>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20101109/e0ea8efc/attachment.html
More information about the mvapich-discuss
mailing list