[mvapich-discuss] caused collective abort of all ranks + signal 9

Sangamesh B forum.san at gmail.com
Tue May 6 00:27:49 EDT 2008


Hi all,


*I got some problem, can someone help me on this issue.*

*The scenario is : We have a Rocks(4.2) cluster with 12 nodes. We installed
Infiniband cards newly in 5 nodes(Masternode doesn't have IB card).
Installation of OFED is successful and IP got assigned.*

*I installed Mvapich2 in that and created password free environment from
computenode-0-8 to 12(the nodes which have IB card).  So far everything is
fine, And the MPD is booting up also. *

*I've compiled a sample MPI program and tried to execute it and I got the
following kind of results:*

Scenario 1: Using root to execute Hellow.o (compiled with mvapich2-mpicc)

[root at compute-0-8 test]# /opt/mvapich2_ps/bin/mpiexec -np 2 /test/Hellow.o
Hello world from process 0 of 2
Hello world from process 1 of 2
rank 1 in job 8  compute-0-8.local_34399   caused collective abort of all
ranks
  exit status of rank 1: killed by signal 9
rank 0 in job 8  compute-0-8.local_34399   caused collective abort of all
ranks
  exit status of rank 0: killed by signal 9

Scenario 2: Using user id (srinu) to execute the same file.

[srinu at compute-0-8 test]$ /opt/mvapich2_ps/bin/mpiexec -np 2 /test/Hellow.o
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
    This will severely limit memory registrations.
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
    This will severely limit memory registrations.
Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(259)....: Initialization failed
MPID_Init(102)...........: channel initialization failed
MPIDI_CH3_Init(178)......:
MPIDI_CH3I_RMDA_init(208): Failed to Initialize HCA type
rdma_iba_hca_init(645)...: cannot create cq
Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(259)....: Initialization failed
MPID_Init(102)...........: channel initialization failed
MPIDI_CH3_Init(178)......:
MPIDI_CH3I_RMDA_init(208): Failed to Initialize HCA type
rdma_iba_hca_init(645)...: cannot create cq
rank 1 in job 9  compute-0-8.local_34399   caused collective abort of all
ranks
  exit status of rank 1: return code 1

For 2nd scenario,  I found solution from net such as ulimit –l unlimited.
But further, this also produced same error as of 1st scenario.
Can someone solve this error?

Thanks in advance,

Sangamesh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080506/a68ccbda/attachment.html


More information about the mvapich-discuss mailing list