[mvapich-discuss] Connection timed out
Mark Potts
potts at hpcapplications.com
Sun Oct 14 19:05:47 EDT 2007
Jonathan,
Probably completely unrelated to the current mpirun_rsh.c
"Termination socket read error", I have a user that is attempting
some fairly large MVAPICH jobs and is encountering a slew of
"connect: Connection timed out" error messages when attempting to
run with ~4000 processes or more. He can successfully run two
simultaneous 2048 MVAPICH process jobs of the same code, but has
encountered those error messages when attempting to run them as
a single mpirun_rsh job. No ~4000+ process MVAPICH job has actually
started execution. The user also reports that he can ssh into all
512 nodes of his target 4096 core cluster.
Do you know if this is an mpirun_rsh client message or an ssh
message and is there a known way around this timeout issue?
regards,
--
***********************************
>> Mark J. Potts, PhD
>>
>> HPC Applications Inc.
>> phone: 410-992-8360 Bus
>> 410-313-9318 Home
>> 443-418-4375 Cell
>> email: potts at hpcapplications.com
>> potts at excray.com
***********************************
More information about the mvapich-discuss
mailing list