[mvapich-discuss] slurm+mvapich2 -> cannot create cq ?

Evren Yurtesen IB eyurtese at abo.fi
Thu Sep 6 10:56:05 EDT 2012


Hello,
I am trying to run simple hello.c program with slurm+mvapich2.
I have searched a lot but could not find the solution to my problem. The 
closest match was:
http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2008-December/002075.html
and
http://www.mail-archive.com/ewg@lists.openfabrics.org/msg11087.html
but I already have the memlock limits set to unlimited...

Any ideas on what might be going wrong?

running on 2 processes on same node works fine...


-bash-4.1$ mpich2version
MVAPICH2 Version:     	1.8
MVAPICH2 Release date:	Mon Apr 30 14:50:19 EDT 2012
MVAPICH2 Device:      	ch3:mrail
MVAPICH2 configure:   	--prefix=/export/modules/apps/mvapich2/1.8/gnu 
--with-valgrind=/export/modules/tools/valgrind/3.8.0/include/valgrind 
--enable-fast=O3 --enable-shared --enable-mpe --with-pmi=slurm 
--with-pm=no
MVAPICH2 CC:  	gcc -O3 -march=corei7 -mtune=corei7   -O3
MVAPICH2 CXX: 	c++ -O3 -march=corei7 -mtune=corei7  -O3
MVAPICH2 F77: 	gfortran -O3 -march=corei7 -mtune=corei7  -O3
MVAPICH2 FC:  	gfortran -O3 -march=corei7 -mtune=corei7  -O3

-bash-4.1$ mpicc hello.c -lpmi -lslurm

-bash-4.1$ srun -n 2  ./a.out
srun: job 23458 queued and waiting for resources
srun: job 23458 has been allocated resources
Hello World from process 0 running on asg1
Hello World from process 1 running on asg1
Ready

-bash-4.1$ srun -N 2  ./a.out
srun: job 23454 queued and waiting for resources
srun: job 23454 has been allocated resources
In: PMI_Abort(1, Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(408).......:
MPID_Init(296)..............: channel initialization failed
MPIDI_CH3_Init(283).........:
MPIDI_CH3I_RDMA_init(172)...:
rdma_setup_startup_ring(431): cannot create cq
)
In: PMI_Abort(1, Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(408).......:
MPID_Init(296)..............: channel initialization failed
MPIDI_CH3_Init(283).........:
MPIDI_CH3I_RDMA_init(172)...:
rdma_setup_startup_ring(431): cannot create cq
)
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
slurmd[asg1]: error: *** STEP 23454.0 KILLED AT 2012-09-06T17:52:06 WITH 
SIGNAL 9 ***
slurmd[asg2]: error: *** STEP 23454.0 KILLED AT 2012-09-06T17:52:06 WITH 
SIGNAL 9 ***
slurmd[asg1]: error: *** STEP 23454.0 KILLED AT 2012-09-06T17:52:06 WITH 
SIGNAL 9 ***
slurmd[asg2]: error: *** STEP 23454.0 KILLED AT 2012-09-06T17:52:06 WITH 
SIGNAL 9 ***
slurmd[asg2]: error: *** STEP 23454.0 KILLED AT 2012-09-06T17:52:06 WITH 
SIGNAL 9 ***
slurmd[asg1]: error: *** STEP 23454.0 KILLED AT 2012-09-06T17:52:06 WITH 
SIGNAL 9 ***
srun: error: asg1: task 0: Exited with exit code 1
srun: Terminating job step 23454.0
srun: error: asg2: task 1: Exited with exit code 1
slurmd[asg1]: error: *** STEP 23454.0 KILLED AT 2012-09-06T17:52:06 WITH 
SIGNAL 9 ***
slurmd[asg2]: error: *** STEP 23454.0 KILLED AT 2012-09-06T17:52:06 WITH 
SIGNAL 9 ***

-bash-4.1$ ulimit -l
unlimited
-bash-4.1$ ssh asg1 -C 'ulimit -l'
unlimited
-bash-4.1$ ssh asg2 -C 'ulimit -l'
unlimited
-bash-4.1$


More information about the mvapich-discuss mailing list