[mvapich-discuss] one node misbehaving

LEI CHAI chai.15 at osu.edu
Thu Aug 30 20:31:11 EDT 2007


Hi Frank,

Thanks for trying mvapich2-1.0. It seems the limit of the amount of memory that can be locked is not correctly set. Can you try this, add "ulimit -l unlimited" to your shell startup file (e.g. .bashrc)

Lei


----- Original Message -----
From: Frank Leers <Frank.Leers at Sun.COM>
Date: Thursday, August 30, 2007 3:52 pm
Subject: [mvapich-discuss] one node misbehaving

> Hi,
> 
> I've just built and started to test MVAPAICH2-1.0 on top of OFED 1.2.5
> and connectX HCA's.  I've run through some of the benchmarks in the
> osu_benchmarks dir.  One node is giving me trouble, could someone 
> pleaslend advise?
> 
> On the misbehaving node, I can do something like the ib_read_bw that
> comes with OFED just fine between this node and another, ipoib is 
> fine.
> I can also run through the standalone mpd tests - 
> 
> $ export MVAPICH2_HOME=/usr/mvapich2
> $ export MPD_BIN=$MVAPICH2_HOME/bin
> $ export PATH=$MVAPICH2_HOME/bin:$PATH
> $ which mpd
> /usr/mvapich2/bin/mpd
> $ mpd &
> [1] 7643
> $ mpdtrace
> dingus-c2
> $ mpdringtest
> time for 1 loops = 6.79492950439e-05 seconds
> $ mpiexec -l -n 1 hostname
> 0: dingus-c2.local
> 
> 
> so far so good...
> 
> 
> any other mpiexec's fail, either standalone:
> 
> $ mpiexec -l -n 1 ./cpi   
> 0: libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
> 0:     This will severely limit memory registrations.
> 0: Fatal error in MPI_Init:
> 0: Other MPI error, error stack:
> 0: MPIR_Init_thread(259)....: Initialization failed
> 0: MPID_Init(102)...........: channel initialization failed
> 0: MPIDI_CH3_Init(178)......: 
> 0: MPIDI_CH3I_RMDA_init(203): Failed to Initialize HCA type
> 0: rdma_iba_hca_init(637)...: cannot create cq
> 
> 
> ...or as part of a larger group:
> 
> login-1 $ mpdboot -n 15 -f ~/mpd_hosts
> 
> ...this is OK
> login-1 $ mpiexec -l -n 14  hostname
> 1: dingus-c2.local
> 2: dingus-c14.local
> 3: dingus-c13.local
> 5: dingus-c4.local
> 4: dingus-c1.local
> 6: dingus-c12.local
> 10: dingus-c3.local
> 11: dingus-c6.local
> 8: dingus-c11.local
> 12: dingus-c8.local
> 9: dingus-c9.local
> 13: dingus-c7.local
> 7: dingus-c10.local
> 0: dingus-login1.local
> 
> 
> ...this fails
> login-1 $ mpiexec -l -n 14  ./cpi
> 1: libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
> 1:     This will severely limit memory registrations.
> 1: Fatal error in MPI_Init:
> 1: Other MPI error, error stack:
> 1: MPIR_Init_thread(259)....: Initialization failed
> 1: MPID_Init(102)...........: channel initialization failed
> 1: MPIDI_CH3_Init(178)......: 
> 1: MPIDI_CH3I_RMDA_init(203): Failed to Initialize HCA type
> 1: rdma_iba_hca_init(637)...: cannot create cq
> rank 1 in job 4  dingus-login1.local_37010   caused collective 
> abort of
> all ranks
>  exit status of rank 1: return code 1 
> 
> 
> If I comment out the offending node from ~/mpd_hosts, things work OK:
> 
> login-1 $ vi ~/mpd_hosts
> login-1 $ mpdallexit
> login-1 $ mpdboot -n 14 -f ~/mpd_hosts
> login-1 $ mpiexec -l -n 14  ./cpi
> 0: pi is approximately 3.1416009869231249, Error is 0.0000083333333318
> 0: wall clock time = 0.000401
> 0: Process 0 on dingus-login1.local
> 1: Process 1 on dingus-c5.local
> 5: Process 5 on dingus-c12.local
> 3: Process 3 on dingus-c14.local
> 2: Process 2 on dingus-c1.local
> 9: Process 9 on dingus-c3.local
> 12: Process 12 on dingus-c6.local
> 8: Process 8 on dingus-c11.local
> 13: Process 13 on dingus-c9.local
> 4: Process 4 on dingus-c4.local
> 11: Process 11 on dingus-c8.local
> 7: Process 7 on dingus-c10.local
> 6: Process 6 on dingus-c13.local
> 10: Process 10 on dingus-c7.local
> 
> 
> thanks,
> 
> -frank
> 
> 
> 
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> 



More information about the mvapich-discuss mailing list