[mvapich-discuss] RDMA CM fails to boot in multiple HCA settings
(and some minor problems)
Akihiro Nomura
nomura.a.ac at m.titech.ac.jp
Thu Mar 28 06:16:06 EDT 2013
Hello.
I'm trying to use RDMA CM in machines who have two HCAs connected to different networks. (i.e. dual-rail configuration)
During this procedure, I encountered three problems.
1) Path to /etc/mv2.conf is hardcoded.
As I don't have root access to this machine, I had to replace this part.
If I can give this path via environmental variable, it would be great.
2) Possible memory corruption
On line 1106 of src/mpid/ch3/channels/common/src/rdma_cm/rdma_cm.c ,
the program reads all lines from mv2.conf and stores to rdma_cm_local_ips[].
If this file is longer than rdma_num_hcas*rdma_num_ports, it writes data to unallocated area.
3) MPI program fails to boot in dual-rail configuration with RDMA CM.
When I try to boot my MPI program, it failed with following message:
> nomura-a-ac at t2a006161:~/mpibench/mpibench> MV2_NUM_HCAS=2 MV2_USE_RDMA_CM=1 ~/mpi-inst/mvapich2-1.9b-gcc/bin/mpirun -np 2 -machinefile ~/machines.local ./mpibench.mvapich
> [t2a006169:mpi_rank_1][rdma_cm_create_qp] src/mpid/ch3/channels/common/src/rdma_cm/rdma_cm.c:867: Error creating qp on hca 1 using rdma_cm. -1 [cmid: 0x679b00, pd: 0x6152f0, send_cq: (nil), recv_cq: (nil)]
> : Invalid argument (22)
>
> ===================================================================================
> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> = EXIT CODE: 253
> = CLEANING UP REMAINING PROCESSES
> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ===================================================================================
How do I produce more verbose output to know why it fails?
When I omit one of MV2_NUM_HCAS or MV2_USE_RDMA_CM, it worked fine.
My configure option is as follows:
> ./configure CC=gcc CXX=g++ FC=gfortran --prefix=/home/usr1/nomura-a-ac/mpi-inst/mvapich2-1.9b-gcc --with-rdma=gen2 --enable-rdma-cm --enable-romio --enable-static --enable-shared --enable-sharedlibs=gcc --with-pm=hydra,mpirun,mpd
Best regards,
--
Akihiro Nomura
nomura.a.ac at m.titech.ac.jp
More information about the mvapich-discuss
mailing list