[mvapich-discuss] MVAPICH2 "cannot create cq" error

Marc Noguera marc at klingon.uab.es
Wed Oct 31 11:06:59 EDT 2007


Dear all,
thanks again for the suggestions,
I did reboot both test nodes after modifying settings, and the "cannot 
create cq" error disappeared.
Now I can compile test applications, like hellow.c and obtain the a 
hello world output from all the process:
However I want to use the NWchem application, which I can compile with 
not many troubles now. When I try to run this application using two 
nodes I obtain the following:

borg70.uab.es:/tmp/T3>~/mvapich2/bin/mpiexec -n 2 
~/soft/nwchem-5.0/bin/LINUX64/nwchem
ARMCI configured for 2 cluster nodes. Network protocol is 'OpenIB Verbs 
API'.
1:create qp returned NULL: 0
1:create qp returned NULL: 0
Last System Error Message from Task 1:: Invalid argument
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 10:create qp 
returned NULL: 0
0:create qp returned NULL: 0
Last System Error Message from Task 0:: Invalid argument
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 0rank 1 in job 
4  borg70.uab.es_50642   caused collective abort of all ranks
  exit status of rank 1: killed by signal 9
borg70.uab.es:/tmp/T3>

Of course, the mpdring is up. I understand from the userguide that this 
error is also related to the memlock issue, similary to the "cannot 
create cq" one. Is that correct?
Ulimit -l gives "unlimited" output in both nodes, so I am stuck here.
Any clue?
Thanks in advance
Marc

En/na Matthew Koop ha escrit:
> Marc,
>
> Did you perhaps update the lockable memory settings after starting the MPD
> ring? If so, try exiting the ring using mpdallexit and then booting it
> again with mpdboot so that mpd gets the new ulimit settings.
>
> Also, have you tried the ibv_rc_pingpong test that comes with the OFED
> distribution? It will allow you to verify that your IB installation is
> correct.
>
> Let us know if restarting the ring helps at all.
>
> Matt
>
>
> On Tue, 30 Oct 2007, Marc Noguera wrote:
>
>   
>> Dear list,
>> I am trying to use mvapich2 on our cluster. I am making some tests on
>> two dual opteron nodes running fedora core 6, using mvapich2 and
>> portland compilers.
>> I have successfully compiled mvapich2 using these compilers, at least I
>> think so. I have used make.mvapich.ofa script as I have OFED 1.2.5
>> software stack installed on infiniband hardware.
>> Environment at mvapich2 compile time was:
>> CC=pgcc
>> CXX=pgCC
>> F77=pgf77
>> F90=pgf90
>> OPEN_IB_HOME=/usr/local/ofed
>> PREFIX=~/mvapich2
>> RDMA_CM_SUPPORT="no"
>>
>> After that, I have compiled the pi3f90.f test program (mpif90 pi3f90)
>> and I am trying to execute the a.out binary using mpdboot and mpiexec.
>>
>> I have done as said userguide, and have the .mpd.conf  file (wiht 600
>> permissions) in $HOME. I have also created a mpd.hosts in my workdir,
>> with these two lines containing:
>>
>> 10.10.1.170 ifhn=10.10.1.170
>> 10.10.1.171 ifhn=10.10.1.171
>>
>> Moreover, I have modified /etc/security/limits.conf and /etc/init.d/sshd
>> to ensure unlimited mem_lock values, also as mentioned by the userguide.
>> That is, "ulimit -l" command gives a "unlimited" output on both test
>> machines.
>> Finally when trying to run the a.out test application, I obtain:
>>
>> borg70.uab.es:/users/sysuser/test/T3>~/mvapich2/bin/mpdboot -n 2
>> --ifhn=10.10.1.170
>> borg70.uab.es:/users/sysuser/test/T3>~/mvapich2/bin/mpdtrace -l
>> borg70.uab.es_43715 (10.10.1.170)
>> borg71.uab.es_37091 (10.10.1.171)
>> borg70.uab.es:/users/sysuser/test/T3>~/mvapich2/bin/mpiexec -n 2 ./a.out
>> Fatal error in MPI_Init:
>> Other MPI error, error stack:
>> MPIR_Init_thread(259)....: Initialization failed
>> MPID_Init(102)...........: channel initialization failed
>> MPIDI_CH3_Init(178)......:
>> MPIDI_CH3I_RMDA_init(203): Failed to Initialize HCA type
>> rdma_iba_hca_init(639)...: cannot create cq
>> rank 1 in job 1  borg70.uab.es_43715   caused collective abort of all ranks
>>   exit status of rank 1: killed by signal 9
>> borg70.uab.es:/users/sysuser/test/T3>
>>
>>
>>
>> In the troubleshooting section of the userguid I find that "cannot
>> create cq" are possibly due to mem_lock limits, but I think I have fixed
>> these, or at least I think so.
>> I am really stuck at this point.
>> Can you give me any hint on what am I doing wrong?
>>
>> Thanks in advance
>> Marc
>>
>>
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>     
>
>
>
>   



More information about the mvapich-discuss mailing list