[Fwd: Re: [mvapich-discuss] error closing socket at end of mpirun_rsh]

Jonathan Perkins perkinjo at cse.ohio-state.edu
Fri Oct 12 17:49:42 EDT 2007


Mark Potts wrote:
> Re-send to include mvapich-discuss.
>          regards,
> 
> -------- Original Message --------
> Subject: Re: [mvapich-discuss] error closing socket at end of mpirun_rsh
> Date: Fri, 12 Oct 2007 16:20:25 -0400
> From: Mark Potts <potts at hpcapplications.com>
> Reply-To: potts at hpcapplications.com
> Organization: HPC Applications Inc.
> To: Jonathan L. Perkins <perkinjo at cse.ohio-state.edu>
> References: <470D3936.1050403 at hpcapplications.com> 
> <470E4CFB.8000102 at cse.ohio-state.edu>
> 
> Jonathan,
>     I've checked and the "Termination socket read failed: Bad file
>     descriptor" message is emitted even in the previous OS version.
>     So it appears there was an intermittent message (if not an actual
>     error) that our patched mpirun_rsh was generating for some time
>     now.
> 
>     To close the loop on whether we have the correct patches correctly
>     installed I've attached the patched mpirun_rsh.c used to build our
>     installation RPM.  This patched file is used together with the
>     MVAPICH 0.9.9 source contained in OFED 1.2 build(?) 1326 -- in case
>     there are any cross conflicts with other OFED MVAPICH bits.
> 
>     You asked about example code to generate a failure... below is a
>     really simple code that fairly consistently leads to a single
>     error message when run with say -np 10 but much less often
>     with -np 2 .
> 
> #include <stdio.h>
> #include <mpi.h>
> int main (int argc, char **argv)
> {
>     int rank;
>     MPI_Init(&argc, &argv);
>     MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>     printf ("Rank=%d present and calling MPI_Finalize\n", rank);
>     MPI_Finalize();
>     printf ("Rank=%d bailing, nicely\n", rank);
>     return (0);
> }
> 
>             regards,
> 
> Jonathan L. Perkins wrote:
>> Mark Potts wrote:
>>> Hi,
>>>    Can you explain the functioning of the wait_for_errors() function
>>>    in .../mpid/ch_gen2/processes/mpirun_rsh.c in MVAPICH 0.9.9 and
>>>    what might be happening to cause even small (2 process jobs) to
>>>    frequently fail with the message
>>>       "Termination socket read failed: Bad file descriptor" .
>>>    I'm not clear what the socket s/s1 does and therefore how we
>>>    could be getting the above error message upon reading either
>>>    "flag" or "local_id" in this code.
>>>
>>>    This error occurs frequently but not for every job and is
>>>    emitted following full, proper termination of the processes on
>>>    the client nodes. We are using MVAPICH 0.9.9 ch_gen2.
>>>
>>>    Thanks.
>>>             regards,
>>
>> This function provides information about which host an abort 
>> originated from.  You shouldn't get this error unless one of the 
>> clients (MPI processes) tried to open up a connection to tell 
>> mpirun_rsh about an abort.
>>
>> We haven't seen this issue during internal testing.  Is there a 
>> particular base case program that you could send us that should 
>> reproduce the problem?
>>
>> Also, when did you first start experiencing this problem.  Was it 
>> after applying one of mpirun_rsh patches that we sent you?
>>
> 

Thanks for sending these files.  We'll take a look at it and get back to 
you with our findings.

-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo



More information about the mvapich-discuss mailing list