[mvapich-discuss] error closing socket at end of mpirun_rsh original posted Oct 11

Scott Shaw sshaw at sgi.com
Mon Jan 14 13:18:48 EST 2008


DK, I submitted a user account request to our support team and should
have an account created later this afternoon.  We have two ICE clusters
available from the internet and not sure which one will be used so I
will provide a hostname in a bit.

The user account requested: 

         Userid: osu_support
    Temp Passwd: sgisgi4u

Thank you again for continued support. 

Scott

> -----Original Message-----
> From: Dhabaleswar Panda [mailto:panda at cse.ohio-state.edu]
> Sent: Monday, January 14, 2008 10:26 AM
> To: Scott Shaw
> Cc: mvapich-discuss at cse.ohio-state.edu; Dhabaleswar Panda
> Subject: RE: [mvapich-discuss] error closing socket at end of
mpirun_rsh
> original posted Oct 11
> 
> Hi Scott,
> 
> Sorry to know that you are still encountering the problem on some
systems.
> Thanks for offering to provide us access to a test cluster. This will
be
> very helpful. Please send me the remote access information and one of
my
> team members will work closely with you to resolve this problem.
> 
> Thanks,
> 
> DK
> 
> On Mon, 14 Jan 2008, Scott Shaw wrote:
> 
> > Hi DK,
> > I installed and tested the NMCAC mvapich patches. Rerunning simple
MPI
> > tests still a problem. What seems interesting is the "termination
> > failed" message does not happen on cluster nodes with drives only
our
> > diskless clusters. Another interesting data point is that this error
can
> > occur when just using "-np 2", two cores, on the same node so this
might
> > rule out networking issues?
> >
> > Following is an email I sent to Michel and Kevin regarding this
issue.
> > Would it help if I provide you access to a cluster for testing
purposes?
> >
> >
> > Thanks,
> > Scott
> >
> > Thursday, January 10, 2008 3:07 PM
> > Hi Michel, Kevin -
> > I have downloaded the rpms from the location Michel provided. I
> > extracted the rpms in my home directory instead of messing with
what's
> > currently installed on orbit6.americas. I recompiled the application
and
> > linked against the new/revised mvapich libs and I still get the
> > termination failed message.  Several applications like NEMO which
are
> > built against mvapich showed the same failure which prompted me to
post
> > the question to the mvapich mail alias. A customer reviewing the
result
> > files will be suspicious of this error messages and _if_ the
analysis
> > completed successfully.  So this could be a potential issue to
customers
> > review benchmark results. Any ideas how to proceed?
> >
> > service0 /store/sshaw> pwd
> > /nas/store/sshaw
> >
> > rpm2cpio
> > mvapich_intel-test-SGINoShip-0.9.9-1326sgi503rp2michel.x86_64.rpm |
cpio
> > -ivd rpm2cpio mvapich_intel-0.9.9-1326sgi503rp2michel.x86_64.rpm |
cpio
> > -ivd
> >
> >
> > service0 /store/sshaw> setenv LD_LIBRARY_PATH
> > /store/sshaw/nmcac/usr/mpi/mvapich-0.9.9/intel/lib
> > service0 /store/sshaw> module load intel-compilers-9 service0
> > /store/sshaw> module list Currently Loaded Modulefiles:
> >   1) intel-cc-9/9.1.052   2) intel-fc-9/9.1.052   3)
intel-compilers-9
> >
> > service0 /store/sshaw> setenv PATH
> > /store/sshaw/nmcac/usr/mpi/mvapich-0.9.9/intel/bin:${PATH}
> > service0 /store/sshaw> which mpirun_rsh
> > /store/sshaw/nmcac/usr/mpi/mvapich-0.9.9/intel/bin/mpirun_rsh
> >
> > service0 /store/sshaw> mpicc mpi_test.c -o mpi_test
> > -L/store/sshaw/nmcac/usr/mpi/mvapich-0.9.9/intel/lib
> >
> > service0 /store/sshaw>
> > /store/sshaw/nmcac/usr/mpi/mvapich-0.9.9/intel/bin/mpirun_rsh -np 2
> > -hostfile ./hfile ./mpi_test Rank=0 present and calling MPI_Finalize
> > Rank=0 present and calling MPI_Finalize Rank=0 bailing, nicely
> > Termination socket read failed: Bad file descriptor Rank=0 bailing,
> > nicely
> >
> >
> > Thanks,
> > Scott
> >
> >
> >
> >
> > > -----Original Message-----
> > > From: Dhabaleswar Panda [mailto:panda at cse.ohio-state.edu]
> > > Sent: Saturday, January 12, 2008 9:25 AM
> > > To: Scott Shaw
> > > Cc: mvapich-discuss at cse.ohio-state.edu
> > > Subject: Re: [mvapich-discuss] error closing socket at end of
> > mpirun_rsh
> > > original posted Oct 11
> > >
> > > Hi Scott,
> > >
> > > As we discussed off-line, you have access to a solution to this
> > problem.
> > > Let us know how it works.  This solution is also being available
with
> > the
> > > enhanced and strengthened mpirun_rsh of mvapich 1.0 version.
> > >
> > > Thanks,
> > >
> > > DK
> > >
> > >
> > > On Wed, 9 Jan 2008, Scott Shaw wrote:
> > >
> > > > Hi,
> > > > On several clusters we are experiencing the same issues
originally
> > > > posted on Oct 11, 2007 regarding "error closing socket at end of
> > > > mpirun_rsh" job. Running the mpi test with one core works,  no
error
> > is
> > > > generated but n+1 cores error is generated.
> > > >
> > > > Is there a patch available which addresses the "Termination
socket
> > read
> > > > failed" error message?  I have tested three different clusters
and
> > each
> > > > cluster exhibits the same error.  I also check the
"mvapich-discuss"
> > > > archives and still did not see a resolution.
> > > >
> > > > I am currently running mvapich v0.9.9 which is bundled with ofed
> > v1.2.
> > > >
> > > > r1i0n0 /store/sshaw> mpirun_rsh -np 1 -hostfile ./hfile
./mpi_test
> > > > Rank=0 present and calling MPI_Finalize
> > > > Rank=0 bailing, nicely
> > > >
> > > > r1i0n0 /store/sshaw> mpirun_rsh -np 2 -hostfile ./hfile
./mpi_test
> > > > Rank=1 present and calling MPI_Finalize
> > > > Rank=0 present and calling MPI_Finalize
> > > > Rank=0 bailing, nicely
> > > > Termination socket read failed: Bad file descriptor
> > > > Rank=1 bailing, nicely
> > > >
> > > > r1i0n0 /store/sshaw> mpirun_rsh -np 4 -hostfile ./hfile
./mpi_test
> > > > Rank=1 present and calling MPI_Finalize
> > > > Rank=3 present and calling MPI_Finalize
> > > > Rank=0 present and calling MPI_Finalize
> > > > Rank=2 present and calling MPI_Finalize
> > > > Rank=0 bailing, nicely
> > > > Termination socket read failed: Bad file descriptor
> > > > Rank=3 bailing, nicely
> > > > Rank=1 bailing, nicely
> > > > Rank=2 bailing, nicely
> > > >
> > > > Thanks,
> > > > Scott
> > > >
> > > >
> > > > _______________________________________________
> > > > mvapich-discuss mailing list
> > > > mvapich-discuss at cse.ohio-state.edu
> > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > > >
> > >
> >
> >




More information about the mvapich-discuss mailing list