[Fwd: Re: [mvapich-discuss] error closing socket at end of
mpirun_rsh]
Jonathan Perkins
perkinjo at cse.ohio-state.edu
Fri Oct 12 17:49:42 EDT 2007
Mark Potts wrote:
> Re-send to include mvapich-discuss.
> regards,
>
> -------- Original Message --------
> Subject: Re: [mvapich-discuss] error closing socket at end of mpirun_rsh
> Date: Fri, 12 Oct 2007 16:20:25 -0400
> From: Mark Potts <potts at hpcapplications.com>
> Reply-To: potts at hpcapplications.com
> Organization: HPC Applications Inc.
> To: Jonathan L. Perkins <perkinjo at cse.ohio-state.edu>
> References: <470D3936.1050403 at hpcapplications.com>
> <470E4CFB.8000102 at cse.ohio-state.edu>
>
> Jonathan,
> I've checked and the "Termination socket read failed: Bad file
> descriptor" message is emitted even in the previous OS version.
> So it appears there was an intermittent message (if not an actual
> error) that our patched mpirun_rsh was generating for some time
> now.
>
> To close the loop on whether we have the correct patches correctly
> installed I've attached the patched mpirun_rsh.c used to build our
> installation RPM. This patched file is used together with the
> MVAPICH 0.9.9 source contained in OFED 1.2 build(?) 1326 -- in case
> there are any cross conflicts with other OFED MVAPICH bits.
>
> You asked about example code to generate a failure... below is a
> really simple code that fairly consistently leads to a single
> error message when run with say -np 10 but much less often
> with -np 2 .
>
> #include <stdio.h>
> #include <mpi.h>
> int main (int argc, char **argv)
> {
> int rank;
> MPI_Init(&argc, &argv);
> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> printf ("Rank=%d present and calling MPI_Finalize\n", rank);
> MPI_Finalize();
> printf ("Rank=%d bailing, nicely\n", rank);
> return (0);
> }
>
> regards,
>
> Jonathan L. Perkins wrote:
>> Mark Potts wrote:
>>> Hi,
>>> Can you explain the functioning of the wait_for_errors() function
>>> in .../mpid/ch_gen2/processes/mpirun_rsh.c in MVAPICH 0.9.9 and
>>> what might be happening to cause even small (2 process jobs) to
>>> frequently fail with the message
>>> "Termination socket read failed: Bad file descriptor" .
>>> I'm not clear what the socket s/s1 does and therefore how we
>>> could be getting the above error message upon reading either
>>> "flag" or "local_id" in this code.
>>>
>>> This error occurs frequently but not for every job and is
>>> emitted following full, proper termination of the processes on
>>> the client nodes. We are using MVAPICH 0.9.9 ch_gen2.
>>>
>>> Thanks.
>>> regards,
>>
>> This function provides information about which host an abort
>> originated from. You shouldn't get this error unless one of the
>> clients (MPI processes) tried to open up a connection to tell
>> mpirun_rsh about an abort.
>>
>> We haven't seen this issue during internal testing. Is there a
>> particular base case program that you could send us that should
>> reproduce the problem?
>>
>> Also, when did you first start experiencing this problem. Was it
>> after applying one of mpirun_rsh patches that we sent you?
>>
>
Thanks for sending these files. We'll take a look at it and get back to
you with our findings.
--
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo
More information about the mvapich-discuss
mailing list