[mvapich-discuss] cannot allocate CQ

Christian Guggenberger christian.guggenberger at rzg.mpg.de
Tue Jun 24 15:49:42 EDT 2008


On Tue, Jun 24, 2008 at 01:51:09PM -0400, Amit H Kumar wrote:
> 
> Hi MVAPICH2-1.0.3,
> 
> HCA: Mellanox InfiniHost III Lx HCA
> IB Stack: Qlogic(SilverStorm)
> Compiled MVAPICH2-1.0.3 using the Verbs API interface.
> 
> Reading from user guide, I have made changes to /etc/security/limits.conf
> file by adding: * soft memlock unlimited
> And by adding the following line in /etc/init.d/sshd on all compute nodes,
> then restarted sshd on all of nodes.
> ulimit -l unlimited
> 
> I can run simple hello world and OSU benchmarks: Only when I run locally on
> the computed nodes as a regular user/root. But when I run the same programs
> as a user SGE job, it fails with error messages attached below:

you'll have to add
ulimit -l unlimited

in your sge_execd startup script, as well, and restart that daemon.

cheers.
 - Christian
> 
> Also appended is the the output of ulimit -a on the compute node...
> 
> Seems like this has been discussed in the forum previously, but for some
> reason I don't understand the difference in running it as an SGE job as
> opposed to running it locally on the compute node. Could it be the shell?
> Can anyone please help me dig into this issue?
> 
> Thank you,
> «Amit»
> 
> <<<<<<<snip>>>>>>>>>>
> Tracing mpd's ... (this is a check to see mpd's have strated as expected)
> zorka-0-8
> zorka-0-8
> Now Executing the my program ...
> 
> ALL TO ALL
> 0: [rdma_iba_priv.c:624] error(-253): cannot allocate CQ
> 1: [rdma_iba_priv.c:624] error(-253): cannot allocate CQ
> rank 1 in job 1  zorka-0-8.local_35062   caused collective abort of all
> ranks
>   exit status of rank 1: return code 1
> rank 0 in job 1  zorka-0-8.local_35062   caused collective abort of all
> ranks
>   exit status of rank 0: return code 1
> 
> Bcast
> 0: [rdma_iba_priv.c:624] error(-253): cannot allocate CQ
> 1: [rdma_iba_priv.c:624] error(-253): cannot allocate CQ
> rank 0 in job 2  zorka-0-8.local_35062   caused collective abort of all
> ranks
>   exit status of rank 0: return code 1
> 
> Bi directional BW
> 1: [rdma_iba_priv.c:624] error(-253): cannot allocate CQ
> 0: [rdma_iba_priv.c:624] error(-253): cannot allocate CQ
> rank 1 in job 3  zorka-0-8.local_35062   caused collective abort of all
> ranks
>   exit status of rank 1: return code 1
> rank 0 in job 3  zorka-0-8.local_35062   caused collective abort of all
> ranks
>   exit status of rank 0: return code 1
> 
> BW
> 1: [rdma_iba_priv.c:624] error(-253): cannot allocate CQ
> 0: [rdma_iba_priv.c:624] error(-253): cannot allocate CQ
> rank 1 in job 4  zorka-0-8.local_35062   caused collective abort of all
> ranks
>   exit status of rank 1: return code 1
> rank 0 in job 4  zorka-0-8.local_35062   caused collective abort of all
> ranks
>   exit status of rank 0: return code 1
> 
> Latency
> 0: [rdma_iba_priv.c:624] error(-253): cannot allocate CQ
> rank 0 in job 5  zorka-0-8.local_35062   caused collective abort of all
> ranks
>   exit status of rank 0: return code 1
> 
> MBW MR
> 1: [rdma_iba_priv.c:624] error(-253): cannot allocate CQ
> 0: [rdma_iba_priv.c:624] error(-253): cannot allocate CQ
> rank 1 in job 6  zorka-0-8.local_35062   caused collective abort of all
> ranks
>   exit status of rank 1: return code 1
> rank 0 in job 6  zorka-0-8.local_35062   caused collective abort of all
> ranks
>   exit status of rank 0: killed by signal 9
> <<<<<<</snip>>>>>>>>>>
> 
> 
> [ahkumar at zorka-0-8 ~]$ sh
> sh-3.1$ ulimit -a
> core file size          (blocks, -c) 0
> data seg size           (kbytes, -d) unlimited
> max nice                        (-e) 0
> file size               (blocks, -f) unlimited
> pending signals                 (-i) 71680
> max locked memory       (kbytes, -l) unlimited
> max memory size         (kbytes, -m) unlimited
> open files                      (-n) 1024
> pipe size            (512 bytes, -p) 8
> POSIX message queues     (bytes, -q) 819200
> max rt priority                 (-r) 0
> stack size              (kbytes, -s) 10240
> cpu time               (seconds, -t) unlimited
> max user processes              (-u) 71680
> virtual memory          (kbytes, -v) unlimited
> file locks                      (-x) unlimited
> 
> 
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss



More information about the mvapich-discuss mailing list