[mvapich-discuss] Connection timed out

Mark Potts potts at hpcapplications.com
Sun Oct 14 19:05:47 EDT 2007


Jonathan,
    Probably completely unrelated to the current mpirun_rsh.c
    "Termination socket read error", I have a user that is attempting
    some fairly large MVAPICH jobs and is encountering a slew of
    "connect: Connection timed out" error messages when attempting to
    run with ~4000 processes or more.  He can successfully run two
    simultaneous 2048 MVAPICH process jobs of the same code, but has
    encountered those error messages when attempting to run them as
    a single mpirun_rsh job.  No ~4000+ process MVAPICH job has actually
    started execution.  The user also reports that he can ssh into all
    512 nodes of his target 4096 core cluster.

    Do you know if this is an mpirun_rsh client message or an ssh
    message and is there a known way around this timeout issue?
          regards,

-- 
***********************************
 >> Mark J. Potts, PhD
 >>
 >> HPC Applications Inc.
 >> phone: 410-992-8360 Bus
 >>        410-313-9318 Home
 >>        443-418-4375 Cell
 >> email: potts at hpcapplications.com
 >>        potts at excray.com
***********************************


More information about the mvapich-discuss mailing list