[mvapich-discuss] slurm+mvapich2 -> cannot create cq ?
Evren Yurtesen IB
eyurtese at abo.fi
Thu Sep 6 10:56:05 EDT 2012
Hello,
I am trying to run simple hello.c program with slurm+mvapich2.
I have searched a lot but could not find the solution to my problem. The
closest match was:
http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2008-December/002075.html
and
http://www.mail-archive.com/ewg@lists.openfabrics.org/msg11087.html
but I already have the memlock limits set to unlimited...
Any ideas on what might be going wrong?
running on 2 processes on same node works fine...
-bash-4.1$ mpich2version
MVAPICH2 Version: 1.8
MVAPICH2 Release date: Mon Apr 30 14:50:19 EDT 2012
MVAPICH2 Device: ch3:mrail
MVAPICH2 configure: --prefix=/export/modules/apps/mvapich2/1.8/gnu
--with-valgrind=/export/modules/tools/valgrind/3.8.0/include/valgrind
--enable-fast=O3 --enable-shared --enable-mpe --with-pmi=slurm
--with-pm=no
MVAPICH2 CC: gcc -O3 -march=corei7 -mtune=corei7 -O3
MVAPICH2 CXX: c++ -O3 -march=corei7 -mtune=corei7 -O3
MVAPICH2 F77: gfortran -O3 -march=corei7 -mtune=corei7 -O3
MVAPICH2 FC: gfortran -O3 -march=corei7 -mtune=corei7 -O3
-bash-4.1$ mpicc hello.c -lpmi -lslurm
-bash-4.1$ srun -n 2 ./a.out
srun: job 23458 queued and waiting for resources
srun: job 23458 has been allocated resources
Hello World from process 0 running on asg1
Hello World from process 1 running on asg1
Ready
-bash-4.1$ srun -N 2 ./a.out
srun: job 23454 queued and waiting for resources
srun: job 23454 has been allocated resources
In: PMI_Abort(1, Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(408).......:
MPID_Init(296)..............: channel initialization failed
MPIDI_CH3_Init(283).........:
MPIDI_CH3I_RDMA_init(172)...:
rdma_setup_startup_ring(431): cannot create cq
)
In: PMI_Abort(1, Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(408).......:
MPID_Init(296)..............: channel initialization failed
MPIDI_CH3_Init(283).........:
MPIDI_CH3I_RDMA_init(172)...:
rdma_setup_startup_ring(431): cannot create cq
)
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
slurmd[asg1]: error: *** STEP 23454.0 KILLED AT 2012-09-06T17:52:06 WITH
SIGNAL 9 ***
slurmd[asg2]: error: *** STEP 23454.0 KILLED AT 2012-09-06T17:52:06 WITH
SIGNAL 9 ***
slurmd[asg1]: error: *** STEP 23454.0 KILLED AT 2012-09-06T17:52:06 WITH
SIGNAL 9 ***
slurmd[asg2]: error: *** STEP 23454.0 KILLED AT 2012-09-06T17:52:06 WITH
SIGNAL 9 ***
slurmd[asg2]: error: *** STEP 23454.0 KILLED AT 2012-09-06T17:52:06 WITH
SIGNAL 9 ***
slurmd[asg1]: error: *** STEP 23454.0 KILLED AT 2012-09-06T17:52:06 WITH
SIGNAL 9 ***
srun: error: asg1: task 0: Exited with exit code 1
srun: Terminating job step 23454.0
srun: error: asg2: task 1: Exited with exit code 1
slurmd[asg1]: error: *** STEP 23454.0 KILLED AT 2012-09-06T17:52:06 WITH
SIGNAL 9 ***
slurmd[asg2]: error: *** STEP 23454.0 KILLED AT 2012-09-06T17:52:06 WITH
SIGNAL 9 ***
-bash-4.1$ ulimit -l
unlimited
-bash-4.1$ ssh asg1 -C 'ulimit -l'
unlimited
-bash-4.1$ ssh asg2 -C 'ulimit -l'
unlimited
-bash-4.1$
More information about the mvapich-discuss
mailing list