[mvapich-discuss] error closing socket at end of mpirun_rsh original posted Oct 11

Dhabaleswar Panda panda at cse.ohio-state.edu
Mon Jan 14 10:26:22 EST 2008


Hi Scott,

Sorry to know that you are still encountering the problem on some systems.
Thanks for offering to provide us access to a test cluster. This will be
very helpful. Please send me the remote access information and one of my
team members will work closely with you to resolve this problem.

Thanks,

DK

On Mon, 14 Jan 2008, Scott Shaw wrote:

> Hi DK,
> I installed and tested the NMCAC mvapich patches. Rerunning simple MPI
> tests still a problem. What seems interesting is the "termination
> failed" message does not happen on cluster nodes with drives only our
> diskless clusters. Another interesting data point is that this error can
> occur when just using "-np 2", two cores, on the same node so this might
> rule out networking issues?
>
> Following is an email I sent to Michel and Kevin regarding this issue.
> Would it help if I provide you access to a cluster for testing purposes?
>
>
> Thanks,
> Scott
>
> Thursday, January 10, 2008 3:07 PM
> Hi Michel, Kevin -
> I have downloaded the rpms from the location Michel provided. I
> extracted the rpms in my home directory instead of messing with what's
> currently installed on orbit6.americas. I recompiled the application and
> linked against the new/revised mvapich libs and I still get the
> termination failed message.  Several applications like NEMO which are
> built against mvapich showed the same failure which prompted me to post
> the question to the mvapich mail alias. A customer reviewing the result
> files will be suspicious of this error messages and _if_ the analysis
> completed successfully.  So this could be a potential issue to customers
> review benchmark results. Any ideas how to proceed?
>
> service0 /store/sshaw> pwd
> /nas/store/sshaw
>
> rpm2cpio
> mvapich_intel-test-SGINoShip-0.9.9-1326sgi503rp2michel.x86_64.rpm | cpio
> -ivd rpm2cpio mvapich_intel-0.9.9-1326sgi503rp2michel.x86_64.rpm | cpio
> -ivd
>
>
> service0 /store/sshaw> setenv LD_LIBRARY_PATH
> /store/sshaw/nmcac/usr/mpi/mvapich-0.9.9/intel/lib
> service0 /store/sshaw> module load intel-compilers-9 service0
> /store/sshaw> module list Currently Loaded Modulefiles:
>   1) intel-cc-9/9.1.052   2) intel-fc-9/9.1.052   3) intel-compilers-9
>
> service0 /store/sshaw> setenv PATH
> /store/sshaw/nmcac/usr/mpi/mvapich-0.9.9/intel/bin:${PATH}
> service0 /store/sshaw> which mpirun_rsh
> /store/sshaw/nmcac/usr/mpi/mvapich-0.9.9/intel/bin/mpirun_rsh
>
> service0 /store/sshaw> mpicc mpi_test.c -o mpi_test
> -L/store/sshaw/nmcac/usr/mpi/mvapich-0.9.9/intel/lib
>
> service0 /store/sshaw>
> /store/sshaw/nmcac/usr/mpi/mvapich-0.9.9/intel/bin/mpirun_rsh -np 2
> -hostfile ./hfile ./mpi_test Rank=0 present and calling MPI_Finalize
> Rank=0 present and calling MPI_Finalize Rank=0 bailing, nicely
> Termination socket read failed: Bad file descriptor Rank=0 bailing,
> nicely
>
>
> Thanks,
> Scott
>
>
>
>
> > -----Original Message-----
> > From: Dhabaleswar Panda [mailto:panda at cse.ohio-state.edu]
> > Sent: Saturday, January 12, 2008 9:25 AM
> > To: Scott Shaw
> > Cc: mvapich-discuss at cse.ohio-state.edu
> > Subject: Re: [mvapich-discuss] error closing socket at end of
> mpirun_rsh
> > original posted Oct 11
> >
> > Hi Scott,
> >
> > As we discussed off-line, you have access to a solution to this
> problem.
> > Let us know how it works.  This solution is also being available with
> the
> > enhanced and strengthened mpirun_rsh of mvapich 1.0 version.
> >
> > Thanks,
> >
> > DK
> >
> >
> > On Wed, 9 Jan 2008, Scott Shaw wrote:
> >
> > > Hi,
> > > On several clusters we are experiencing the same issues originally
> > > posted on Oct 11, 2007 regarding "error closing socket at end of
> > > mpirun_rsh" job. Running the mpi test with one core works,  no error
> is
> > > generated but n+1 cores error is generated.
> > >
> > > Is there a patch available which addresses the "Termination socket
> read
> > > failed" error message?  I have tested three different clusters and
> each
> > > cluster exhibits the same error.  I also check the "mvapich-discuss"
> > > archives and still did not see a resolution.
> > >
> > > I am currently running mvapich v0.9.9 which is bundled with ofed
> v1.2.
> > >
> > > r1i0n0 /store/sshaw> mpirun_rsh -np 1 -hostfile ./hfile ./mpi_test
> > > Rank=0 present and calling MPI_Finalize
> > > Rank=0 bailing, nicely
> > >
> > > r1i0n0 /store/sshaw> mpirun_rsh -np 2 -hostfile ./hfile ./mpi_test
> > > Rank=1 present and calling MPI_Finalize
> > > Rank=0 present and calling MPI_Finalize
> > > Rank=0 bailing, nicely
> > > Termination socket read failed: Bad file descriptor
> > > Rank=1 bailing, nicely
> > >
> > > r1i0n0 /store/sshaw> mpirun_rsh -np 4 -hostfile ./hfile ./mpi_test
> > > Rank=1 present and calling MPI_Finalize
> > > Rank=3 present and calling MPI_Finalize
> > > Rank=0 present and calling MPI_Finalize
> > > Rank=2 present and calling MPI_Finalize
> > > Rank=0 bailing, nicely
> > > Termination socket read failed: Bad file descriptor
> > > Rank=3 bailing, nicely
> > > Rank=1 bailing, nicely
> > > Rank=2 bailing, nicely
> > >
> > > Thanks,
> > > Scott
> > >
> > >
> > > _______________________________________________
> > > mvapich-discuss mailing list
> > > mvapich-discuss at cse.ohio-state.edu
> > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > >
> >
>
>



More information about the mvapich-discuss mailing list