[mvapich-discuss] MVAPICH2-1.0.1 crashes with TCP/IP

Devesh Sharma devesh28 at gmail.com
Wed Oct 15 00:55:16 EDT 2008


hello all,
I am trying to run linpack_10.0.4 on TCP/IP of MVAPICH2-1.0.1with 80% of
memory usage i.e. N=331776
we have a 16 node cluster with 64GB of physical memory in each node and
having quad socket quad core configration
when I am running 256 process on this cluster over ethernet I receive
following error after a run of 1 hour approximately.
please help me figuring out the issues.

[cli_239]: aborting job:
Fatal error in MPI_Send:
Other MPI error, error stack:
MPI_Send(190).............................: MPI_Send(buf=0x2b6cb1e040,
count=1, dtype=USER<struct>, dest=16, tag=2857, comm=0x84000001) failed
MPIDI_CH3_Progress_wait(212)..............: an error occurred while handling
an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(651):
MPIDU_Socki_handle_pollhup(402)...........: connection closed by peer
(set=0,sock=7)

rank 239 in job 2  tf00_50378   caused collective abort of all ranks
  exit status of rank 239: killed by signal 9
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20081015/71055986/attachment.html


More information about the mvapich-discuss mailing list