[mvapich-discuss] error closing socket at end of mpirun_rsh
Jonathan L. Perkins
perkinjo at cse.ohio-state.edu
Thu Oct 11 12:19:07 EDT 2007
Mark Potts wrote:
> Hi,
> Can you explain the functioning of the wait_for_errors() function
> in .../mpid/ch_gen2/processes/mpirun_rsh.c in MVAPICH 0.9.9 and
> what might be happening to cause even small (2 process jobs) to
> frequently fail with the message
> "Termination socket read failed: Bad file descriptor" .
> I'm not clear what the socket s/s1 does and therefore how we
> could be getting the above error message upon reading either
> "flag" or "local_id" in this code.
>
> This error occurs frequently but not for every job and is
> emitted following full, proper termination of the processes on
> the client nodes. We are using MVAPICH 0.9.9 ch_gen2.
>
> Thanks.
> regards,
This function provides information about which host an abort originated
from. You shouldn't get this error unless one of the clients (MPI
processes) tried to open up a connection to tell mpirun_rsh about an abort.
We haven't seen this issue during internal testing. Is there a
particular base case program that you could send us that should
reproduce the problem?
Also, when did you first start experiencing this problem. Was it after
applying one of mpirun_rsh patches that we sent you?
--
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo
More information about the mvapich-discuss
mailing list