[mvapich-discuss] MVAPICH2 "cannot create cq" error

Marc Noguera marc at klingon.uab.es
Tue Nov 6 06:01:38 EST 2007


Matt, list
sorry for the delayed answer. I have been out these days.
I have just run the osu benchmarks. I attach the output for all of them.

At first sight, everything is ok except for the Multi-Threaded Latency 
Test which gives no output. Could you confirm this?

If that is the case, then I guess the problem is coming from the 
application ARMCI libs, so I'll try to get the newest version of Global 
Arrays an recompile.

Thank you
Marc
En/na Matthew Koop ha escrit:
> Marc,
>
> Good to hear simple programs are now working. Can you try the included
> osu_benchmarks and verify those are running as well? That will verify that
> the memlock issues are solved.
>
> It appears that the "create qp returned NULL" is coming from the ARMCI
> library rather than MVAPICH2, which suggests that the problem may be there
> instead.
>
> Matt
>
> On Wed, 31 Oct 2007, Marc Noguera wrote:
>
>   
>> Dear all,
>> thanks again for the suggestions,
>> I did reboot both test nodes after modifying settings, and the "cannot
>> create cq" error disappeared.
>> Now I can compile test applications, like hellow.c and obtain the a
>> hello world output from all the process:
>> However I want to use the NWchem application, which I can compile with
>> not many troubles now. When I try to run this application using two
>> nodes I obtain the following:
>>
>> borg70.uab.es:/tmp/T3>~/mvapich2/bin/mpiexec -n 2
>> ~/soft/nwchem-5.0/bin/LINUX64/nwchem
>> ARMCI configured for 2 cluster nodes. Network protocol is 'OpenIB Verbs
>> API'.
>> 1:create qp returned NULL: 0
>> 1:create qp returned NULL: 0
>> Last System Error Message from Task 1:: Invalid argument
>> application called MPI_Abort(MPI_COMM_WORLD, 0) - process 10:create qp
>> returned NULL: 0
>> 0:create qp returned NULL: 0
>> Last System Error Message from Task 0:: Invalid argument
>> application called MPI_Abort(MPI_COMM_WORLD, 0) - process 0rank 1 in job
>> 4  borg70.uab.es_50642   caused collective abort of all ranks
>>   exit status of rank 1: killed by signal 9
>> borg70.uab.es:/tmp/T3>
>>
>> Of course, the mpdring is up. I understand from the userguide that this
>> error is also related to the memlock issue, similary to the "cannot
>> create cq" one. Is that correct?
>> Ulimit -l gives "unlimited" output in both nodes, so I am stuck here.
>> Any clue?
>> Thanks in advance
>> Marc
>>
>> En/na Matthew Koop ha escrit:
>>     
>>> Marc,
>>>
>>> Did you perhaps update the lockable memory settings after starting the MPD
>>> ring? If so, try exiting the ring using mpdallexit and then booting it
>>> again with mpdboot so that mpd gets the new ulimit settings.
>>>
>>> Also, have you tried the ibv_rc_pingpong test that comes with the OFED
>>> distribution? It will allow you to verify that your IB installation is
>>> correct.
>>>
>>> Let us know if restarting the ring helps at all.
>>>
>>> Matt
>>>
>>>
>>> On Tue, 30 Oct 2007, Marc Noguera wrote:
>>>
>>>
>>>       
>>>> Dear list,
>>>> I am trying to use mvapich2 on our cluster. I am making some tests on
>>>> two dual opteron nodes running fedora core 6, using mvapich2 and
>>>> portland compilers.
>>>> I have successfully compiled mvapich2 using these compilers, at least I
>>>> think so. I have used make.mvapich.ofa script as I have OFED 1.2.5
>>>> software stack installed on infiniband hardware.
>>>> Environment at mvapich2 compile time was:
>>>> CC=pgcc
>>>> CXX=pgCC
>>>> F77=pgf77
>>>> F90=pgf90
>>>> OPEN_IB_HOME=/usr/local/ofed
>>>> PREFIX=~/mvapich2
>>>> RDMA_CM_SUPPORT="no"
>>>>
>>>> After that, I have compiled the pi3f90.f test program (mpif90 pi3f90)
>>>> and I am trying to execute the a.out binary using mpdboot and mpiexec.
>>>>
>>>> I have done as said userguide, and have the .mpd.conf  file (wiht 600
>>>> permissions) in $HOME. I have also created a mpd.hosts in my workdir,
>>>> with these two lines containing:
>>>>
>>>> 10.10.1.170 ifhn=10.10.1.170
>>>> 10.10.1.171 ifhn=10.10.1.171
>>>>
>>>> Moreover, I have modified /etc/security/limits.conf and /etc/init.d/sshd
>>>> to ensure unlimited mem_lock values, also as mentioned by the userguide.
>>>> That is, "ulimit -l" command gives a "unlimited" output on both test
>>>> machines.
>>>> Finally when trying to run the a.out test application, I obtain:
>>>>
>>>> borg70.uab.es:/users/sysuser/test/T3>~/mvapich2/bin/mpdboot -n 2
>>>> --ifhn=10.10.1.170
>>>> borg70.uab.es:/users/sysuser/test/T3>~/mvapich2/bin/mpdtrace -l
>>>> borg70.uab.es_43715 (10.10.1.170)
>>>> borg71.uab.es_37091 (10.10.1.171)
>>>> borg70.uab.es:/users/sysuser/test/T3>~/mvapich2/bin/mpiexec -n 2 ./a.out
>>>> Fatal error in MPI_Init:
>>>> Other MPI error, error stack:
>>>> MPIR_Init_thread(259)....: Initialization failed
>>>> MPID_Init(102)...........: channel initialization failed
>>>> MPIDI_CH3_Init(178)......:
>>>> MPIDI_CH3I_RMDA_init(203): Failed to Initialize HCA type
>>>> rdma_iba_hca_init(639)...: cannot create cq
>>>> rank 1 in job 1  borg70.uab.es_43715   caused collective abort of all ranks
>>>>   exit status of rank 1: killed by signal 9
>>>> borg70.uab.es:/users/sysuser/test/T3>
>>>>
>>>>
>>>>
>>>> In the troubleshooting section of the userguid I find that "cannot
>>>> create cq" are possibly due to mem_lock limits, but I think I have fixed
>>>> these, or at least I think so.
>>>> I am really stuck at this point.
>>>> Can you give me any hint on what am I doing wrong?
>>>>
>>>> Thanks in advance
>>>> Marc
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> mvapich-discuss mailing list
>>>> mvapich-discuss at cse.ohio-state.edu
>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>
>>>>
>>>>         
>>>
>>>
>>>       
>
>
>
>
>   

-------------- next part --------------
A non-text attachment was scrubbed...
Name: osu_output.tgz
Type: application/octet-stream
Size: 2358 bytes
Desc: not available
Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20071106/c62fa0aa/osu_output.obj


More information about the mvapich-discuss mailing list