[mvapich-discuss] MVAPICH2 "cannot create cq" error

Matthew Koop koop at cse.ohio-state.edu
Wed Oct 31 13:10:05 EDT 2007


Marc,

Good to hear simple programs are now working. Can you try the included
osu_benchmarks and verify those are running as well? That will verify that
the memlock issues are solved.

It appears that the "create qp returned NULL" is coming from the ARMCI
library rather than MVAPICH2, which suggests that the problem may be there
instead.

Matt

On Wed, 31 Oct 2007, Marc Noguera wrote:

> Dear all,
> thanks again for the suggestions,
> I did reboot both test nodes after modifying settings, and the "cannot
> create cq" error disappeared.
> Now I can compile test applications, like hellow.c and obtain the a
> hello world output from all the process:
> However I want to use the NWchem application, which I can compile with
> not many troubles now. When I try to run this application using two
> nodes I obtain the following:
>
> borg70.uab.es:/tmp/T3>~/mvapich2/bin/mpiexec -n 2
> ~/soft/nwchem-5.0/bin/LINUX64/nwchem
> ARMCI configured for 2 cluster nodes. Network protocol is 'OpenIB Verbs
> API'.
> 1:create qp returned NULL: 0
> 1:create qp returned NULL: 0
> Last System Error Message from Task 1:: Invalid argument
> application called MPI_Abort(MPI_COMM_WORLD, 0) - process 10:create qp
> returned NULL: 0
> 0:create qp returned NULL: 0
> Last System Error Message from Task 0:: Invalid argument
> application called MPI_Abort(MPI_COMM_WORLD, 0) - process 0rank 1 in job
> 4  borg70.uab.es_50642   caused collective abort of all ranks
>   exit status of rank 1: killed by signal 9
> borg70.uab.es:/tmp/T3>
>
> Of course, the mpdring is up. I understand from the userguide that this
> error is also related to the memlock issue, similary to the "cannot
> create cq" one. Is that correct?
> Ulimit -l gives "unlimited" output in both nodes, so I am stuck here.
> Any clue?
> Thanks in advance
> Marc
>
> En/na Matthew Koop ha escrit:
> > Marc,
> >
> > Did you perhaps update the lockable memory settings after starting the MPD
> > ring? If so, try exiting the ring using mpdallexit and then booting it
> > again with mpdboot so that mpd gets the new ulimit settings.
> >
> > Also, have you tried the ibv_rc_pingpong test that comes with the OFED
> > distribution? It will allow you to verify that your IB installation is
> > correct.
> >
> > Let us know if restarting the ring helps at all.
> >
> > Matt
> >
> >
> > On Tue, 30 Oct 2007, Marc Noguera wrote:
> >
> >
> >> Dear list,
> >> I am trying to use mvapich2 on our cluster. I am making some tests on
> >> two dual opteron nodes running fedora core 6, using mvapich2 and
> >> portland compilers.
> >> I have successfully compiled mvapich2 using these compilers, at least I
> >> think so. I have used make.mvapich.ofa script as I have OFED 1.2.5
> >> software stack installed on infiniband hardware.
> >> Environment at mvapich2 compile time was:
> >> CC=pgcc
> >> CXX=pgCC
> >> F77=pgf77
> >> F90=pgf90
> >> OPEN_IB_HOME=/usr/local/ofed
> >> PREFIX=~/mvapich2
> >> RDMA_CM_SUPPORT="no"
> >>
> >> After that, I have compiled the pi3f90.f test program (mpif90 pi3f90)
> >> and I am trying to execute the a.out binary using mpdboot and mpiexec.
> >>
> >> I have done as said userguide, and have the .mpd.conf  file (wiht 600
> >> permissions) in $HOME. I have also created a mpd.hosts in my workdir,
> >> with these two lines containing:
> >>
> >> 10.10.1.170 ifhn=10.10.1.170
> >> 10.10.1.171 ifhn=10.10.1.171
> >>
> >> Moreover, I have modified /etc/security/limits.conf and /etc/init.d/sshd
> >> to ensure unlimited mem_lock values, also as mentioned by the userguide.
> >> That is, "ulimit -l" command gives a "unlimited" output on both test
> >> machines.
> >> Finally when trying to run the a.out test application, I obtain:
> >>
> >> borg70.uab.es:/users/sysuser/test/T3>~/mvapich2/bin/mpdboot -n 2
> >> --ifhn=10.10.1.170
> >> borg70.uab.es:/users/sysuser/test/T3>~/mvapich2/bin/mpdtrace -l
> >> borg70.uab.es_43715 (10.10.1.170)
> >> borg71.uab.es_37091 (10.10.1.171)
> >> borg70.uab.es:/users/sysuser/test/T3>~/mvapich2/bin/mpiexec -n 2 ./a.out
> >> Fatal error in MPI_Init:
> >> Other MPI error, error stack:
> >> MPIR_Init_thread(259)....: Initialization failed
> >> MPID_Init(102)...........: channel initialization failed
> >> MPIDI_CH3_Init(178)......:
> >> MPIDI_CH3I_RMDA_init(203): Failed to Initialize HCA type
> >> rdma_iba_hca_init(639)...: cannot create cq
> >> rank 1 in job 1  borg70.uab.es_43715   caused collective abort of all ranks
> >>   exit status of rank 1: killed by signal 9
> >> borg70.uab.es:/users/sysuser/test/T3>
> >>
> >>
> >>
> >> In the troubleshooting section of the userguid I find that "cannot
> >> create cq" are possibly due to mem_lock limits, but I think I have fixed
> >> these, or at least I think so.
> >> I am really stuck at this point.
> >> Can you give me any hint on what am I doing wrong?
> >>
> >> Thanks in advance
> >> Marc
> >>
> >>
> >>
> >> _______________________________________________
> >> mvapich-discuss mailing list
> >> mvapich-discuss at cse.ohio-state.edu
> >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>
> >>
> >
> >
> >
> >
>




More information about the mvapich-discuss mailing list