[mvapich-discuss] error closing socket at end of mpirun_rsh

Mark Potts potts at hpcapplications.com
Wed Oct 10 16:42:30 EDT 2007


Hi,
    Can you explain the functioning of the wait_for_errors() function
    in .../mpid/ch_gen2/processes/mpirun_rsh.c in MVAPICH 0.9.9 and
    what might be happening to cause even small (2 process jobs) to
    frequently fail with the message
       "Termination socket read failed: Bad file descriptor" .
    I'm not clear what the socket s/s1 does and therefore how we
    could be getting the above error message upon reading either
    "flag" or "local_id" in this code.

    This error occurs frequently but not for every job and is
    emitted following full, proper termination of the processes on
    the client nodes. We are using MVAPICH 0.9.9 ch_gen2.

    Thanks.
             regards,
-- 
***********************************
 >> Mark J. Potts, PhD
 >>
 >> HPC Applications Inc.
 >> phone: 410-992-8360 Bus
 >>        410-313-9318 Home
 >>        443-418-4375 Cell
 >> email: potts at hpcapplications.com
 >>        potts at excray.com
***********************************


More information about the mvapich-discuss mailing list