[mvapich-discuss] stack smashing detected

Dhabaleswar Panda panda at cse.ohio-state.edu
Mon Dec 13 22:16:53 EST 2010


Thanks for the update here on this issue.

DK

On Mon, 13 Dec 2010, Thiago Ize wrote:

> In case anyone ever comes across this issue, I just wanted to point out
> that the problem was that some of the nodes had a different version of
> libssl which caused the stack smash problem.  mvapich had nothing to do
> with it.
>
> Thiago
>
> Thiago Ize wrote:
> > So, this isn't really an issue with the mpi program since the program
> > never gets executed.  It appears to fail before that part. A few weeks
> > ago this used to work on all our nodes, but something happened and now
> > about half the nodes have this problem.  I'm guessing this is more
> > like an mpirun_rsh, mpispawn, or ssh issue.  Here's all I get in terms
> > of info:
> >
> >   thiago at node5 $ mpirun_rsh -np 1 node5 mpi_trivial
> >   node5 works
> > Probably because it's all local
> >
> >   thiago at node5 $ mpirun_rsh -np 1 node6 mpi_trivial
> >   *** stack smashing detected ***: /usr/bin/ssh terminated
> >   Error in init phase...wait for cleanup! (0/1 mpispawn connections)
> > When it has to connect to another node it fails
> >
> >   thiago at node5 $ mpirun_rsh -np 1 FAKENODE mpi_trivial
> >   *** stack smashing detected ***: /usr/bin/ssh terminated
> >   Error in init phase...wait for cleanup! (0/1 mpispawn connections)
> > FAKENODE is not a real node, so clearly the problem is not with the
> > actual mpi_trivial which was never run.
> >
> > Another data point is that unlike mvapich, OpenMPI doesn't appear to
> > have this issue.
> >   thiago at node5 $ /usr/mpi/gcc/openmpi-1.4.1/bin/mpirun -host node5
> > mpi_trivial
> >   node5 works
> >   thiago at node5 $ /usr/mpi/gcc/openmpi-1.4.1/bin/mpirun -host node6
> > mpi_trivial
> >   node6 works
> >   thiago at node5 $ /usr/mpi/gcc/openmpi-1.4.1/bin/mpirun -host FAKENODE
> > mpi_trivial
> >   ssh: Could not resolve hostname FAKENODE: Name or service not known
> >   <snip>
> >
> > It could be that both OpenMPI and mvapich are correct but that mvapich
> > is exposing a bug in our system or that mvapich has some subtle bug in
> > the loader.
> >
> > Thiago
> >
> > Sayantan Sur wrote:
> >> Hi Thiago,
> >>
> >> I just tried it out, and it seems to work:
> >>
> >> head$ ./bin/mpirun_rsh -np 1 node133 ./examples/cpi
> >> Process 0 of 1 is on node133.cluster
> >> pi is approximately 3.1415926544231341, Error is 0.0000000008333410
> >> wall clock time = 0.000320
> >>
> >> Are there any specific mpi programs with which you see this more
> >> often? Also, do you have any error output of when this happens?
> >>
> >> Thanks.
> >>
> >> On Tue, Nov 9, 2010 at 4:40 PM, Thiago Ize <thiago at sci.utah.edu> wrote:
> >>
> >>> I'm the one who's found that problem.  I've had this happen with several
> >>> versions.  mvapich that comes with the system and the mvapich2-1.5.1 that I
> >>> downloaded and compiled myself.
> >>>
> >>> The 1.5.1 version works if I run locally, probably because it's not using
> >>> ssh?  But if I try to run on a remote node I get the same error.  For
> >>> example:
> >>> node1 $ mpirun_rsh -np1 node1 mpiProgram -> works
> >>> node1 $ mpirun_rsh -np1 node2 mpiProgram -> fails
> >>>
> >>> Also, if I go on a node where this still works, I can still run on the "bad"
> >>> remote nodes.  For example
> >>> nodeGood $ mpirun_rsh -np1 node2 mpiProgram -> works
> >>> node1 $ mpirun_rsh -np1 node2 mpiProgram -> does not
> >>>
> >>> Thiago
> >>>
> >>> Sayantan Sur wrote:
> >>>
> >>> Hi Nick,
> >>>
> >>> On Tue, Nov 9, 2010 at 2:13 PM, Nick Rathke <nick at sci.utah.edu> wrote:
> >>>
> >>>
> >>> Hi,
> >>>
> >>> We have a small 64 node cluster running RHEL 5.4 and mvapich 1.2.0 and we
> >>> have started getting the error " *** stack smashing detected ***:
> >>> /usr/bin/ssh terminated " on some of our nodes but not others, when all of
> >>> the node are identical.
> >>>
> >>> I have been searching the web for this error but haven't found anything that
> >>> would help me debug this or even tell if this is a mvapich or ssh error.
> >>>
> >>> Any thoughts would be greatly appreciated.
> >>>
> >>>
> >>>
> >>> Just wondering if you saw this with any older MVAPICH version (say,
> >>> MVAPICH-1.1) or some of the newer MVAPICH2 releases?
> >>>
> >>> If you could try MVAPICH2-1.5.1 and see if this error persists, it
> >>> will be great.
> >>>
> >>> Thanks.
> >>>
> >>>
> >>>
> >>> Nick Rathke
> >>> Scientific Computing and Imaging Institute
> >>> IT Manager and Sr. Systems Administrator
> >>> nick at sci.utah.edu
> >>> www.sci.utah.edu
> >>>
> >>>
> >>> _______________________________________________
> >>> mvapich-discuss mailing list
> >>> mvapich-discuss at cse.ohio-state.edu
> >>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >>
> >>
> >>
> > ------------------------------------------------------------------------
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
>



More information about the mvapich-discuss mailing list