[mvapich-discuss] one node misbehaving

Frank Leers Frank.Leers at Sun.COM
Thu Aug 30 18:52:01 EDT 2007


I've just built and started to test MVAPAICH2-1.0 on top of OFED 1.2.5
and connectX HCA's.  I've run through some of the benchmarks in the
osu_benchmarks dir.  One node is giving me trouble, could someone pleas
lend advise?

On the misbehaving node, I can do something like the ib_read_bw that
comes with OFED just fine between this node and another, ipoib is fine.

I can also run through the standalone mpd tests - 

$ export MVAPICH2_HOME=/usr/mvapich2
$ export MPD_BIN=$MVAPICH2_HOME/bin
$ export PATH=$MVAPICH2_HOME/bin:$PATH
$ which mpd
$ mpd &
[1] 7643
$ mpdtrace
$ mpdringtest
time for 1 loops = 6.79492950439e-05 seconds
$ mpiexec -l -n 1 hostname
0: dingus-c2.local

so far so good...

any other mpiexec's fail, either standalone:

$ mpiexec -l -n 1 ./cpi   
0: libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
0:     This will severely limit memory registrations.
0: Fatal error in MPI_Init:
0: Other MPI error, error stack:
0: MPIR_Init_thread(259)....: Initialization failed
0: MPID_Init(102)...........: channel initialization failed
0: MPIDI_CH3_Init(178)......: 
0: MPIDI_CH3I_RMDA_init(203): Failed to Initialize HCA type
0: rdma_iba_hca_init(637)...: cannot create cq

...or as part of a larger group:

login-1 $ mpdboot -n 15 -f ~/mpd_hosts

...this is OK
login-1 $ mpiexec -l -n 14  hostname
1: dingus-c2.local
2: dingus-c14.local
3: dingus-c13.local
5: dingus-c4.local
4: dingus-c1.local
6: dingus-c12.local
10: dingus-c3.local
11: dingus-c6.local
8: dingus-c11.local
12: dingus-c8.local
9: dingus-c9.local
13: dingus-c7.local
7: dingus-c10.local
0: dingus-login1.local

...this fails
login-1 $ mpiexec -l -n 14  ./cpi
1: libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
1:     This will severely limit memory registrations.
1: Fatal error in MPI_Init:
1: Other MPI error, error stack:
1: MPIR_Init_thread(259)....: Initialization failed
1: MPID_Init(102)...........: channel initialization failed
1: MPIDI_CH3_Init(178)......: 
1: MPIDI_CH3I_RMDA_init(203): Failed to Initialize HCA type
1: rdma_iba_hca_init(637)...: cannot create cq
rank 1 in job 4  dingus-login1.local_37010   caused collective abort of
all ranks
  exit status of rank 1: return code 1 

If I comment out the offending node from ~/mpd_hosts, things work OK:

login-1 $ vi ~/mpd_hosts
login-1 $ mpdallexit
login-1 $ mpdboot -n 14 -f ~/mpd_hosts
login-1 $ mpiexec -l -n 14  ./cpi
0: pi is approximately 3.1416009869231249, Error is 0.0000083333333318
0: wall clock time = 0.000401
0: Process 0 on dingus-login1.local
1: Process 1 on dingus-c5.local
5: Process 5 on dingus-c12.local
3: Process 3 on dingus-c14.local
2: Process 2 on dingus-c1.local
9: Process 9 on dingus-c3.local
12: Process 12 on dingus-c6.local
8: Process 8 on dingus-c11.local
13: Process 13 on dingus-c9.local
4: Process 4 on dingus-c4.local
11: Process 11 on dingus-c8.local
7: Process 7 on dingus-c10.local
6: Process 6 on dingus-c13.local
10: Process 10 on dingus-c7.local



More information about the mvapich-discuss mailing list