[mvapich-discuss] error closing socket at end of mpirun_rsh original posted Oct 11

Scott Shaw sshaw at sgi.com
Mon Jan 14 10:11:21 EST 2008


Hi DK,
I installed and tested the NMCAC mvapich patches. Rerunning simple MPI
tests still a problem. What seems interesting is the "termination
failed" message does not happen on cluster nodes with drives only our
diskless clusters. Another interesting data point is that this error can
occur when just using "-np 2", two cores, on the same node so this might
rule out networking issues?  

Following is an email I sent to Michel and Kevin regarding this issue.
Would it help if I provide you access to a cluster for testing purposes?


Thanks,
Scott

Thursday, January 10, 2008 3:07 PM
Hi Michel, Kevin -
I have downloaded the rpms from the location Michel provided. I
extracted the rpms in my home directory instead of messing with what's
currently installed on orbit6.americas. I recompiled the application and
linked against the new/revised mvapich libs and I still get the
termination failed message.  Several applications like NEMO which are
built against mvapich showed the same failure which prompted me to post
the question to the mvapich mail alias. A customer reviewing the result
files will be suspicious of this error messages and _if_ the analysis
completed successfully.  So this could be a potential issue to customers
review benchmark results. Any ideas how to proceed? 

service0 /store/sshaw> pwd
/nas/store/sshaw

rpm2cpio
mvapich_intel-test-SGINoShip-0.9.9-1326sgi503rp2michel.x86_64.rpm | cpio
-ivd rpm2cpio mvapich_intel-0.9.9-1326sgi503rp2michel.x86_64.rpm | cpio
-ivd


service0 /store/sshaw> setenv LD_LIBRARY_PATH
/store/sshaw/nmcac/usr/mpi/mvapich-0.9.9/intel/lib
service0 /store/sshaw> module load intel-compilers-9 service0
/store/sshaw> module list Currently Loaded Modulefiles:
  1) intel-cc-9/9.1.052   2) intel-fc-9/9.1.052   3) intel-compilers-9

service0 /store/sshaw> setenv PATH
/store/sshaw/nmcac/usr/mpi/mvapich-0.9.9/intel/bin:${PATH}
service0 /store/sshaw> which mpirun_rsh
/store/sshaw/nmcac/usr/mpi/mvapich-0.9.9/intel/bin/mpirun_rsh

service0 /store/sshaw> mpicc mpi_test.c -o mpi_test
-L/store/sshaw/nmcac/usr/mpi/mvapich-0.9.9/intel/lib

service0 /store/sshaw>
/store/sshaw/nmcac/usr/mpi/mvapich-0.9.9/intel/bin/mpirun_rsh -np 2
-hostfile ./hfile ./mpi_test Rank=0 present and calling MPI_Finalize
Rank=0 present and calling MPI_Finalize Rank=0 bailing, nicely
Termination socket read failed: Bad file descriptor Rank=0 bailing,
nicely


Thanks,
Scott




> -----Original Message-----
> From: Dhabaleswar Panda [mailto:panda at cse.ohio-state.edu]
> Sent: Saturday, January 12, 2008 9:25 AM
> To: Scott Shaw
> Cc: mvapich-discuss at cse.ohio-state.edu
> Subject: Re: [mvapich-discuss] error closing socket at end of
mpirun_rsh
> original posted Oct 11
> 
> Hi Scott,
> 
> As we discussed off-line, you have access to a solution to this
problem.
> Let us know how it works.  This solution is also being available with
the
> enhanced and strengthened mpirun_rsh of mvapich 1.0 version.
> 
> Thanks,
> 
> DK
> 
> 
> On Wed, 9 Jan 2008, Scott Shaw wrote:
> 
> > Hi,
> > On several clusters we are experiencing the same issues originally
> > posted on Oct 11, 2007 regarding "error closing socket at end of
> > mpirun_rsh" job. Running the mpi test with one core works,  no error
is
> > generated but n+1 cores error is generated.
> >
> > Is there a patch available which addresses the "Termination socket
read
> > failed" error message?  I have tested three different clusters and
each
> > cluster exhibits the same error.  I also check the "mvapich-discuss"
> > archives and still did not see a resolution.
> >
> > I am currently running mvapich v0.9.9 which is bundled with ofed
v1.2.
> >
> > r1i0n0 /store/sshaw> mpirun_rsh -np 1 -hostfile ./hfile ./mpi_test
> > Rank=0 present and calling MPI_Finalize
> > Rank=0 bailing, nicely
> >
> > r1i0n0 /store/sshaw> mpirun_rsh -np 2 -hostfile ./hfile ./mpi_test
> > Rank=1 present and calling MPI_Finalize
> > Rank=0 present and calling MPI_Finalize
> > Rank=0 bailing, nicely
> > Termination socket read failed: Bad file descriptor
> > Rank=1 bailing, nicely
> >
> > r1i0n0 /store/sshaw> mpirun_rsh -np 4 -hostfile ./hfile ./mpi_test
> > Rank=1 present and calling MPI_Finalize
> > Rank=3 present and calling MPI_Finalize
> > Rank=0 present and calling MPI_Finalize
> > Rank=2 present and calling MPI_Finalize
> > Rank=0 bailing, nicely
> > Termination socket read failed: Bad file descriptor
> > Rank=3 bailing, nicely
> > Rank=1 bailing, nicely
> > Rank=2 bailing, nicely
> >
> > Thanks,
> > Scott
> >
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
> 




More information about the mvapich-discuss mailing list