[mvapich-discuss] cannot allocate CQ

Amit H Kumar AHKumar at odu.edu
Tue Jun 24 13:51:09 EDT 2008


Hi MVAPICH2-1.0.3,

HCA: Mellanox InfiniHost III Lx HCA
IB Stack: Qlogic(SilverStorm)
Compiled MVAPICH2-1.0.3 using the Verbs API interface.

Reading from user guide, I have made changes to /etc/security/limits.conf
file by adding: * soft memlock unlimited
And by adding the following line in /etc/init.d/sshd on all compute nodes,
then restarted sshd on all of nodes.
ulimit -l unlimited

I can run simple hello world and OSU benchmarks: Only when I run locally on
the computed nodes as a regular user/root. But when I run the same programs
as a user SGE job, it fails with error messages attached below:

Also appended is the the output of ulimit -a on the compute node...

Seems like this has been discussed in the forum previously, but for some
reason I don't understand the difference in running it as an SGE job as
opposed to running it locally on the compute node. Could it be the shell?
Can anyone please help me dig into this issue?

Thank you,
«Amit»

<<<<<<<snip>>>>>>>>>>
Tracing mpd's ... (this is a check to see mpd's have strated as expected)
zorka-0-8
zorka-0-8
Now Executing the my program ...

ALL TO ALL
0: [rdma_iba_priv.c:624] error(-253): cannot allocate CQ
1: [rdma_iba_priv.c:624] error(-253): cannot allocate CQ
rank 1 in job 1  zorka-0-8.local_35062   caused collective abort of all
ranks
  exit status of rank 1: return code 1
rank 0 in job 1  zorka-0-8.local_35062   caused collective abort of all
ranks
  exit status of rank 0: return code 1

Bcast
0: [rdma_iba_priv.c:624] error(-253): cannot allocate CQ
1: [rdma_iba_priv.c:624] error(-253): cannot allocate CQ
rank 0 in job 2  zorka-0-8.local_35062   caused collective abort of all
ranks
  exit status of rank 0: return code 1

Bi directional BW
1: [rdma_iba_priv.c:624] error(-253): cannot allocate CQ
0: [rdma_iba_priv.c:624] error(-253): cannot allocate CQ
rank 1 in job 3  zorka-0-8.local_35062   caused collective abort of all
ranks
  exit status of rank 1: return code 1
rank 0 in job 3  zorka-0-8.local_35062   caused collective abort of all
ranks
  exit status of rank 0: return code 1

BW
1: [rdma_iba_priv.c:624] error(-253): cannot allocate CQ
0: [rdma_iba_priv.c:624] error(-253): cannot allocate CQ
rank 1 in job 4  zorka-0-8.local_35062   caused collective abort of all
ranks
  exit status of rank 1: return code 1
rank 0 in job 4  zorka-0-8.local_35062   caused collective abort of all
ranks
  exit status of rank 0: return code 1

Latency
0: [rdma_iba_priv.c:624] error(-253): cannot allocate CQ
rank 0 in job 5  zorka-0-8.local_35062   caused collective abort of all
ranks
  exit status of rank 0: return code 1

MBW MR
1: [rdma_iba_priv.c:624] error(-253): cannot allocate CQ
0: [rdma_iba_priv.c:624] error(-253): cannot allocate CQ
rank 1 in job 6  zorka-0-8.local_35062   caused collective abort of all
ranks
  exit status of rank 1: return code 1
rank 0 in job 6  zorka-0-8.local_35062   caused collective abort of all
ranks
  exit status of rank 0: killed by signal 9
<<<<<<</snip>>>>>>>>>>


[ahkumar at zorka-0-8 ~]$ sh
sh-3.1$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
max nice                        (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 71680
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
max rt priority                 (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 71680
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited




More information about the mvapich-discuss mailing list