[mvapich-discuss] Re: Connection timed out

Jonathan Perkins perkinjo at cse.ohio-state.edu
Sun Oct 14 19:31:24 EDT 2007


Mark Potts wrote:
> Jonathan,
>    Probably completely unrelated to the current mpirun_rsh.c
>    "Termination socket read error", I have a user that is attempting
>    some fairly large MVAPICH jobs and is encountering a slew of
>    "connect: Connection timed out" error messages when attempting to
>    run with ~4000 processes or more.  He can successfully run two
>    simultaneous 2048 MVAPICH process jobs of the same code, but has
>    encountered those error messages when attempting to run them as
>    a single mpirun_rsh job.  No ~4000+ process MVAPICH job has actually
>    started execution.  The user also reports that he can ssh into all
>    512 nodes of his target 4096 core cluster.
>
>    Do you know if this is an mpirun_rsh client message or an ssh
>    message and is there a known way around this timeout issue?
>          regards,
>
This is a mpirun_rsh client message.  This message is coming because its 
taking mpirun_rsh too long to accept the client's connect request.  I'm 
not sure of a quick work around but I'll find out if there is anything 
that can be done.

Just as a note.  We're working on making mpirun_rsh a bit more scalable 
and may have something that you could try out sometime in the near future.



More information about the mvapich-discuss mailing list