[mvapich-discuss] error closing socket at end of mpirun_rsh

Mark Potts potts at hpcapplications.com
Thu Oct 11 20:39:05 EDT 2007


Jonathan,
     Thanks for the quick repsonse.

     The error message is from a MVAPICH build with the patches.

     Without extensive testing it appears that the nature of the
     application is not an issue.  The error message can emit from
     even very simple hello world type programs w/o any message
     exchanges.

     Tests at the time each patch was added did not show this error.
     It is possible that it was infrequent enough that
     we did not catch it after a patch addition.  At about the
     same time we've done OS upgrades which now cloud the issue
     as to what induced the error.  I'll be trying to find a
     system with the previous OS to determine if the problem
     emerged before or after the OS upgrade.
        regards,

Jonathan L. Perkins wrote:
> Mark Potts wrote:
>> Hi,
>>    Can you explain the functioning of the wait_for_errors() function
>>    in .../mpid/ch_gen2/processes/mpirun_rsh.c in MVAPICH 0.9.9 and
>>    what might be happening to cause even small (2 process jobs) to
>>    frequently fail with the message
>>       "Termination socket read failed: Bad file descriptor" .
>>    I'm not clear what the socket s/s1 does and therefore how we
>>    could be getting the above error message upon reading either
>>    "flag" or "local_id" in this code.
>>
>>    This error occurs frequently but not for every job and is
>>    emitted following full, proper termination of the processes on
>>    the client nodes. We are using MVAPICH 0.9.9 ch_gen2.
>>
>>    Thanks.
>>             regards,
> 
> This function provides information about which host an abort originated 
> from.  You shouldn't get this error unless one of the clients (MPI 
> processes) tried to open up a connection to tell mpirun_rsh about an abort.
> 
> We haven't seen this issue during internal testing.  Is there a 
> particular base case program that you could send us that should 
> reproduce the problem?
> 
> Also, when did you first start experiencing this problem.  Was it after 
> applying one of mpirun_rsh patches that we sent you?
> 

-- 
***********************************
 >> Mark J. Potts, PhD
 >>
 >> HPC Applications Inc.
 >> phone: 410-992-8360 Bus
 >>        410-313-9318 Home
 >>        443-418-4375 Cell
 >> email: potts at hpcapplications.com
 >>        potts at excray.com
***********************************


More information about the mvapich-discuss mailing list