[mvapich-discuss] RDMA CM fails to boot in multiple HCA settings (and some minor problems)

Devendar Bureddy bureddy at cse.ohio-state.edu
Thu Mar 28 15:41:29 EDT 2013


Hi Akihiro

On Thu, Mar 28, 2013 at 6:16 AM, Akihiro Nomura
<nomura.a.ac at m.titech.ac.jp>wrote:

> Hello.
>
> I'm trying to use RDMA CM in machines who have two HCAs connected to
> different networks. (i.e. dual-rail configuration)
> During this procedure, I encountered three problems.
>
> 1) Path to /etc/mv2.conf is hardcoded.
> As I don't have root access to this machine, I had to replace this part.
> If I can give this path via environmental variable, it would be great.
>
> 2) Possible memory corruption
> On line 1106 of src/mpid/ch3/channels/common/src/rdma_cm/rdma_cm.c ,
> the program reads all lines from mv2.conf and stores to
> rdma_cm_local_ips[].
> If this file is longer than rdma_num_hcas*rdma_num_ports, it writes data
> to unallocated area.
>

Thanks for the reports. We will fix above two issues in coming releases.



>
> 3) MPI program fails to boot in dual-rail configuration with RDMA CM.
> When I try to boot my MPI program, it failed with following message:
> > nomura-a-ac at t2a006161:~/mpibench/mpibench> MV2_NUM_HCAS=2
> MV2_USE_RDMA_CM=1 ~/mpi-inst/mvapich2-1.9b-gcc/bin/mpirun -np 2
> -machinefile ~/machines.local ./mpibench.mvapich
> > [t2a006169:mpi_rank_1][rdma_cm_create_qp]
> src/mpid/ch3/channels/common/src/rdma_cm/rdma_cm.c:867: Error creating qp
> on hca 1 using rdma_cm. -1 [cmid: 0x679b00, pd: 0x6152f0, send_cq: (nil),
> recv_cq: (nil)]
> > : Invalid argument (22)
>


We have known issues in supporting RDMA_CM with multi-rail.  Do you have
any reason for using RDMA_CM here?  If not, you can use default IB CM which
is better optimized.

Thanks
Devendar


My configure option is as follows:
> > ./configure CC=gcc CXX=g++ FC=gfortran
> --prefix=/home/usr1/nomura-a-ac/mpi-inst/mvapich2-1.9b-gcc --with-rdma=gen2
> --enable-rdma-cm --enable-romio --enable-static --enable-shared
> --enable-sharedlibs=gcc --with-pm=hydra,mpirun,mpd
>
> Best regards,
> --
> Akihiro Nomura
> nomura.a.ac at m.titech.ac.jp
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



-- 
Devendar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20130328/1c439b83/attachment.html


More information about the mvapich-discuss mailing list