[mvapich-discuss] one node misbehaving

Frank Leers Frank.Leers at Sun.COM
Thu Aug 30 21:36:29 EDT 2007


Hi Lei,

Thanks, that was it.  I am setting these in /etc/security/limits.conf
and this node apparently didn't get the change when we updated.

-frank

On Thu, 2007-08-30 at 17:31 -0700, LEI CHAI wrote:
> Hi Frank,
> 
> Thanks for trying mvapich2-1.0. It seems the limit of the amount of memory that can be locked is not correctly set. Can you try this, add "ulimit -l unlimited" to your shell startup file (e.g. .bashrc)
> 
> Lei
> 
> 
> ----- Original Message -----
> From: Frank Leers <Frank.Leers at Sun.COM>
> Date: Thursday, August 30, 2007 3:52 pm
> Subject: [mvapich-discuss] one node misbehaving
> 
> > Hi,
> > 
> > I've just built and started to test MVAPAICH2-1.0 on top of OFED 1.2.5
> > and connectX HCA's.  I've run through some of the benchmarks in the
> > osu_benchmarks dir.  One node is giving me trouble, could someone 
> > pleaslend advise?
> > 
> > On the misbehaving node, I can do something like the ib_read_bw that
> > comes with OFED just fine between this node and another, ipoib is 
> > fine.
> > I can also run through the standalone mpd tests - 
> > 
> > $ export MVAPICH2_HOME=/usr/mvapich2
> > $ export MPD_BIN=$MVAPICH2_HOME/bin
> > $ export PATH=$MVAPICH2_HOME/bin:$PATH
> > $ which mpd
> > /usr/mvapich2/bin/mpd
> > $ mpd &
> > [1] 7643
> > $ mpdtrace
> > dingus-c2
> > $ mpdringtest
> > time for 1 loops = 6.79492950439e-05 seconds
> > $ mpiexec -l -n 1 hostname
> > 0: dingus-c2.local
> > 
> > 
> > so far so good...
> > 
> > 
> > any other mpiexec's fail, either standalone:
> > 
> > $ mpiexec -l -n 1 ./cpi   
> > 0: libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
> > 0:     This will severely limit memory registrations.
> > 0: Fatal error in MPI_Init:
> > 0: Other MPI error, error stack:
> > 0: MPIR_Init_thread(259)....: Initialization failed
> > 0: MPID_Init(102)...........: channel initialization failed
> > 0: MPIDI_CH3_Init(178)......: 
> > 0: MPIDI_CH3I_RMDA_init(203): Failed to Initialize HCA type
> > 0: rdma_iba_hca_init(637)...: cannot create cq
> > 
> > 
> > ...or as part of a larger group:
> > 
> > login-1 $ mpdboot -n 15 -f ~/mpd_hosts
> > 
> > ...this is OK
> > login-1 $ mpiexec -l -n 14  hostname
> > 1: dingus-c2.local
> > 2: dingus-c14.local
> > 3: dingus-c13.local
> > 5: dingus-c4.local
> > 4: dingus-c1.local
> > 6: dingus-c12.local
> > 10: dingus-c3.local
> > 11: dingus-c6.local
> > 8: dingus-c11.local
> > 12: dingus-c8.local
> > 9: dingus-c9.local
> > 13: dingus-c7.local
> > 7: dingus-c10.local
> > 0: dingus-login1.local
> > 
> > 
> > ...this fails
> > login-1 $ mpiexec -l -n 14  ./cpi
> > 1: libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
> > 1:     This will severely limit memory registrations.
> > 1: Fatal error in MPI_Init:
> > 1: Other MPI error, error stack:
> > 1: MPIR_Init_thread(259)....: Initialization failed
> > 1: MPID_Init(102)...........: channel initialization failed
> > 1: MPIDI_CH3_Init(178)......: 
> > 1: MPIDI_CH3I_RMDA_init(203): Failed to Initialize HCA type
> > 1: rdma_iba_hca_init(637)...: cannot create cq
> > rank 1 in job 4  dingus-login1.local_37010   caused collective 
> > abort of
> > all ranks
> >  exit status of rank 1: return code 1 
> > 
> > 
> > If I comment out the offending node from ~/mpd_hosts, things work OK:
> > 
> > login-1 $ vi ~/mpd_hosts
> > login-1 $ mpdallexit
> > login-1 $ mpdboot -n 14 -f ~/mpd_hosts
> > login-1 $ mpiexec -l -n 14  ./cpi
> > 0: pi is approximately 3.1416009869231249, Error is 0.0000083333333318
> > 0: wall clock time = 0.000401
> > 0: Process 0 on dingus-login1.local
> > 1: Process 1 on dingus-c5.local
> > 5: Process 5 on dingus-c12.local
> > 3: Process 3 on dingus-c14.local
> > 2: Process 2 on dingus-c1.local
> > 9: Process 9 on dingus-c3.local
> > 12: Process 12 on dingus-c6.local
> > 8: Process 8 on dingus-c11.local
> > 13: Process 13 on dingus-c9.local
> > 4: Process 4 on dingus-c4.local
> > 11: Process 11 on dingus-c8.local
> > 7: Process 7 on dingus-c10.local
> > 6: Process 6 on dingus-c13.local
> > 10: Process 10 on dingus-c7.local
> > 
> > 
> > thanks,
> > 
> > -frank
> > 
> > 
> > 
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > 
> 



More information about the mvapich-discuss mailing list