[mvapich-discuss] RDMA CM fails to boot in multiple HCA settings (and some minor problems)

Akihiro Nomura nomura.a.ac at m.titech.ac.jp
Thu Mar 28 06:16:06 EDT 2013


Hello.

I'm trying to use RDMA CM in machines who have two HCAs connected to different networks. (i.e. dual-rail configuration)
During this procedure, I encountered three problems.

1) Path to /etc/mv2.conf is hardcoded.
As I don't have root access to this machine, I had to replace this part.
If I can give this path via environmental variable, it would be great.

2) Possible memory corruption
On line 1106 of src/mpid/ch3/channels/common/src/rdma_cm/rdma_cm.c ,
the program reads all lines from mv2.conf and stores to rdma_cm_local_ips[].
If this file is longer than rdma_num_hcas*rdma_num_ports, it writes data to unallocated area.

3) MPI program fails to boot in dual-rail configuration with RDMA CM.
When I try to boot my MPI program, it failed with following message:
> nomura-a-ac at t2a006161:~/mpibench/mpibench> MV2_NUM_HCAS=2 MV2_USE_RDMA_CM=1 ~/mpi-inst/mvapich2-1.9b-gcc/bin/mpirun -np 2 -machinefile ~/machines.local ./mpibench.mvapich
> [t2a006169:mpi_rank_1][rdma_cm_create_qp] src/mpid/ch3/channels/common/src/rdma_cm/rdma_cm.c:867: Error creating qp on hca 1 using rdma_cm. -1 [cmid: 0x679b00, pd: 0x6152f0, send_cq: (nil), recv_cq: (nil)] 
> : Invalid argument (22)
> 
> ===================================================================================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   EXIT CODE: 253
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ===================================================================================
How do I produce more verbose output to know why it fails?
When I omit one of MV2_NUM_HCAS or MV2_USE_RDMA_CM, it worked fine.

My configure option is as follows:
> ./configure CC=gcc CXX=g++ FC=gfortran --prefix=/home/usr1/nomura-a-ac/mpi-inst/mvapich2-1.9b-gcc --with-rdma=gen2 --enable-rdma-cm --enable-romio --enable-static --enable-shared --enable-sharedlibs=gcc --with-pm=hydra,mpirun,mpd

Best regards,
-- 
Akihiro Nomura
nomura.a.ac at m.titech.ac.jp


More information about the mvapich-discuss mailing list