[mvapich-discuss] MVAPICH2-1.2, MVAPICH2-1.4 do not work with specified PKEYs. Proposed patch included

Mike Heinz michael.heinz at qlogic.com
Fri Aug 14 12:06:57 EDT 2009


My testers are reporting further problems with mvapich2. On a fabric where the use of pkeys is required, mvapich2 is failing. This has two causes:

1) The MV2_DEFAULT_PKEY parameter does not appear to be supported when using mpirun_rsh. Actually, it does not appear that mpirun_rsh supports any MV2 parameters. 

2) When using mpd and mpiexec, the MV2_DEFAULT_PKEY parameter gets passed, but then fails. For example:

[root at homer mpi_apps]#  export MV2_DEFAULT_PKEY=0xffff
[root at homer mpi_apps]#  /usr/mpi/gcc/mvapich2-1.2p1/bin/mpiexec -machinefile /opt/iba/src/mpi_apps/mpi_hosts -n 2 osu2/osu_bw
 [0] Abort: Can't find PKEY INDEX according to given PKEY
 at line 1190 in file rdma_iba_priv.c
rank 0 in job 6  homer.dev.silverstorm.com_33133   caused collective abort of all ranks
  exit status of rank 0: killed by signal 9

(Note that 0xffff is actually the default PKEY).

A quick saquery reveals that the pkey is, in fact in the table:

[root at homer mpi_apps]# iba_saquery -o pkey -l 1
LID: 0x0001 PortNum:  1 BlockNum:  0
      0-   7:  0x9001  0xffff  0x9002  0x0000  0x0000  0x0000  0x0000  0x0000
      8-  15:  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000
     16-  23:  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000
     24-  31:  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000  0x0000

When I examined ibv_param.c to see what was going on, here is what I found:

    if ((value = getenv("MV2_DEFAULT_PKEY")) != NULL) {
        rdma_default_pkey = (uint16_t)strtol(value, (char **) NULL,0) & PKEY_MASK;
    }
And.

    #define PKEY_MASK 0x7fff /* the last bit is reserved */

This makes it clear that mpiexec is doing bad things to the pkey - if nothing else, the high bit must be set in order for the connection to have full membership in an Infiniband partition. Without setting this bit, a node will only have "limited membership", and limited nodes are not permitted to talk to each other.

The following patch fixes the errors in masking and comparing pkeys in mvapich2-1.2p1. The patch also works for mvapich2-1.4rc1, but with considerable fuzz.

################################################################################################3
diff -rwud mvapich2-1.2p1.orig/src/mpid/ch3/channels/mrail/src/gen2/ibv_param.c mvapich2-1.2p1/src/mpid/ch3/channels/mrail/src/gen2/ibv_param.c
--- mvapich2-1.2p1.orig/src/mpid/ch3/channels/mrail/src/gen2/ibv_param.c       2008-11-02 14:44:32.000000000 -0500
+++ mvapich2-1.2p1/src/mpid/ch3/channels/mrail/src/gen2/ibv_param.c     2009-08-14 09:35:07.000000000 -0400
@@ -984,7 +984,7 @@
     }

     if ((value = getenv("MV2_DEFAULT_PKEY")) != NULL) {
-        rdma_default_pkey = (uint16_t)strtol(value, (char **) NULL,0) & PKEY_MASK;
+        rdma_default_pkey = (uint16_t)strtol(value, (char **) NULL,0) | PKEY_FULL_MEMBERSHIP;
     }

     if ((value = getenv("MV2_DEFAULT_MIN_RNR_TIMER")) != NULL) {
diff -rwud mvapich2-1.2p1.orig/src/mpid/ch3/channels/mrail/src/gen2/ibv_param.h mvapich2-1.2p1/src/mpid/ch3/channels/mrail/src/gen2/ibv_param.h
--- mvapich2-1.2p1.orig/src/mpid/ch3/channels/mrail/src/gen2/ibv_param.h       2008-10-29 12:55:43.000000000 -0400
+++ mvapich2-1.2p1/src/mpid/ch3/channels/mrail/src/gen2/ibv_param.h     2009-08-12 12:24:12.000000000 -0400
@@ -99,7 +99,8 @@
 extern unsigned long        rdma_spin_count;
 extern int                  USE_SMP;

-#define PKEY_MASK 0x7fff /* the last bit is reserved */
+#define PKEY_MASK 0x7fff /* don't use the high bit when looking up pkeys. */
+#define PKEY_FULL_MEMBERSHIP 0x8000 /* MPI apps must be full members. */
 #define RDMA_PIN_POOL_SIZE              (2*1024*1024)
 #define RDMA_DEFAULT_MAX_CQ_SIZE        (40000)
 #define RDMA_DEFAULT_PORT               (-1)
diff -rwud mvapich2-1.2p1.orig/src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c mvapich2-1.2p1/src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c
--- mvapich2-1.2p1.orig/src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c   2008-10-29 12:55:43.000000000 -0400
+++ mvapich2-1.2p1/src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c 2009-08-14 09:35:25.000000000 -0400
@@ -1161,7 +1161,7 @@
         uint16_t curr_pkey;
         ibv_query_pkey(MPIDI_CH3I_RDMA_Process.nic_context[hca_num],
                 (uint8_t)port_num, (int)i ,&curr_pkey);
-        if (pkey == ntohs(curr_pkey) & PKEY_MASK) {
+        if ((pkey & PKEY_MASK) == (ntohs(curr_pkey) & PKEY_MASK)) {
             *index = i;
             return 1;
         }
###################################################################

On the subject of mpirun_rsh, it would be easy enough to patch it so that it respects MV2_* variables the way it currently respects VIADEV_* variables, but I'd like to understand why it doesn't already do that - is there a reason mpirun_rsh requires you to specify MV2_* variables on the command line instead of in environment or the parameter file?

--
Michael Heinz
Principal Engineer, Qlogic Corporation
King of Prussia, Pennsylvania



More information about the mvapich-discuss mailing list