[mvapich-discuss] Execution problem

Srikanth Gumma srikanth.gumma at Locuz.com
Tue May 6 22:49:25 EDT 2008


Hello,
I got some problem, can somebody help on this issue.
The scenario is : We have a Rocks(4.2) cluster with 12 nodes. We installed
PCI-Ex Infiniband cards newly in 5 nodes(Masternode doesn’t have IB card).
Installation of OFED is successful and IP got assigned.
I installed Mvapich2 in that and created password free environment from
computenode-0-8 to 12(the nodes which have IB card).  So far everything is
fine, And the MPD is booting up also. 
I’ve compiled a sample MPI program and tried to execute it and I got the
following kind of results:
Scenario 1: Using root to execute Hellow.o (compiled with mvapich2-mpicc)

[root at compute-0-8 test]# /opt/mvapich2_ps/bin/mpiexec -np 2 /test/Hellow.o
Hello world from process 0 of 2
Hello world from process 1 of 2
rank 1 in job 8  compute-0-8.local_34399   caused collective abort of all
ranks
  exit status of rank 1: killed by signal 9
rank 0 in job 8  compute-0-8.local_34399   caused collective abort of all
ranks
  exit status of rank 0: killed by signal 9
Scenario 2: Using user id (srinu) to execute the same file.

[srinu at compute-0-8 test]$ /opt/mvapich2_ps/bin/mpiexec -np 2 /test/Hellow.o
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
    This will severely limit memory registrations.
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
    This will severely limit memory registrations.
Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(259)....: Initialization failed
MPID_Init(102)...........: channel initialization failed
MPIDI_CH3_Init(178)......:
MPIDI_CH3I_RMDA_init(208): Failed to Initialize HCA type
rdma_iba_hca_init(645)...: cannot create cq
Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(259)....: Initialization failed
MPID_Init(102)...........: channel initialization failed
MPIDI_CH3_Init(178)......:
MPIDI_CH3I_RMDA_init(208): Failed to Initialize HCA type
rdma_iba_hca_init(645)...: cannot create cq
rank 1 in job 9  compute-0-8.local_34399   caused collective abort of all
ranks
  exit status of rank 1: return code 1
I believe there should be some kind of permissions or configuration to be
changed such that the problem will get resolved. So far I’m unable to
resolve the issue If some of you knows and fix/workaround for the above
situation please suggest me the same.
Any Help is appreciated.


Srikanth Gumma
Locuz Enterprise Solutions Pvt Ltd
#20, Alfa Centere, VS Layout
Intermediate Ring Road
Bangalore 560047
Ph: +91-80-41314747





More information about the mvapich-discuss mailing list