[mvapich-discuss] error closing socket at end of mpirun_rsh

Jonathan L. Perkins perkinjo at cse.ohio-state.edu
Thu Oct 11 12:19:07 EDT 2007


Mark Potts wrote:
> Hi,
>    Can you explain the functioning of the wait_for_errors() function
>    in .../mpid/ch_gen2/processes/mpirun_rsh.c in MVAPICH 0.9.9 and
>    what might be happening to cause even small (2 process jobs) to
>    frequently fail with the message
>       "Termination socket read failed: Bad file descriptor" .
>    I'm not clear what the socket s/s1 does and therefore how we
>    could be getting the above error message upon reading either
>    "flag" or "local_id" in this code.
> 
>    This error occurs frequently but not for every job and is
>    emitted following full, proper termination of the processes on
>    the client nodes. We are using MVAPICH 0.9.9 ch_gen2.
> 
>    Thanks.
>             regards,

This function provides information about which host an abort originated 
from.  You shouldn't get this error unless one of the clients (MPI 
processes) tried to open up a connection to tell mpirun_rsh about an abort.

We haven't seen this issue during internal testing.  Is there a 
particular base case program that you could send us that should 
reproduce the problem?

Also, when did you first start experiencing this problem.  Was it after 
applying one of mpirun_rsh patches that we sent you?

-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo


More information about the mvapich-discuss mailing list