[mvapich-discuss] Re: Connection timed out
Jonathan Perkins
perkinjo at cse.ohio-state.edu
Sun Oct 14 19:31:24 EDT 2007
Mark Potts wrote:
> Jonathan,
> Probably completely unrelated to the current mpirun_rsh.c
> "Termination socket read error", I have a user that is attempting
> some fairly large MVAPICH jobs and is encountering a slew of
> "connect: Connection timed out" error messages when attempting to
> run with ~4000 processes or more. He can successfully run two
> simultaneous 2048 MVAPICH process jobs of the same code, but has
> encountered those error messages when attempting to run them as
> a single mpirun_rsh job. No ~4000+ process MVAPICH job has actually
> started execution. The user also reports that he can ssh into all
> 512 nodes of his target 4096 core cluster.
>
> Do you know if this is an mpirun_rsh client message or an ssh
> message and is there a known way around this timeout issue?
> regards,
>
This is a mpirun_rsh client message. This message is coming because its
taking mpirun_rsh too long to accept the client's connect request. I'm
not sure of a quick work around but I'll find out if there is anything
that can be done.
Just as a note. We're working on making mpirun_rsh a bit more scalable
and may have something that you could try out sometime in the near future.
More information about the mvapich-discuss
mailing list