[mvapich-discuss] mvapich2 (1.8) infiniband programs do not
communicate between some nodes....
Dark Charlot
jcldc13 at gmail.com
Mon Jun 25 18:14:59 EDT 2012
2012/6/25 Devendar Bureddy <bureddy at cse.ohio-state.edu>
> Hi Jean-Charles
>
Hi
>
> - Did you check if ib verb level tests ( ib_send_lat , ib_send_bw
> ..etc) working fine between these two groups (z800_04 and amos)
>
- Can you please get the output of following two commands
> $mpirun -np 2 -hosts z800_04,amos -env MV2_SHOW_ENV_INFO 1 ./osu_latency
>
mpirun -np 2 -hosts z800_04,amos -env MV2_SHOW_ENV_INFO 1 ./osu_latency
MVAPICH2-1.8 Parameters
---------------------------------------------------------------------
PROCESSOR ARCH NAME : MV2_ARCH_INTEL_XEON_X5650_12
HCA NAME : MV2_HCA_MLX_CX_QDR
HETEROGENEOUS HCA : YES
MV2_VBUF_TOTAL_SIZE : 8192
MV2_IBA_EAGER_THRESHOLD : 8192
MV2_RDMA_FAST_PATH_BUF_SIZE : 8192
MV2_EAGERSIZE_1SC : 8192
MV2_PUT_FALLBACK_THRESHOLD : 4096
MV2_GET_FALLBACK_THRESHOLD : 196608
MV2_SMP_EAGERSIZE : 16385
MV2_SMPI_LENGTH_QUEUE : 65536
MV2_SMP_NUM_SEND_BUFFER : 128
MV2_SMP_BATCH_SIZE : 8
---------------------------------------------------------------------
---------------------------------------------------------------------
# OSU MPI Latency Test v3.6
# Size Latency (us)
( program hangs)
$mpirun -np 2 -hosts amos,z800_04 -env MV2_SHOW_ENV_INFO 1 ./osu_latency
>
mpirun -np 2 -hosts amos,z800_04 -env MV2_SHOW_ENV_INFO 1 ./osu_latency
MVAPICH2-1.8 Parameters
---------------------------------------------------------------------
PROCESSOR ARCH NAME : MV2_ARCH_INTEL_GENERIC
HCA NAME : MV2_HCA_MLX_CX_QDR
HETEROGENEOUS HCA : YES
MV2_VBUF_TOTAL_SIZE : 8192
MV2_IBA_EAGER_THRESHOLD : 8192
MV2_RDMA_FAST_PATH_BUF_SIZE : 8192
MV2_EAGERSIZE_1SC : 4096
MV2_PUT_FALLBACK_THRESHOLD : 4096
MV2_GET_FALLBACK_THRESHOLD : 196608
MV2_SMP_EAGERSIZE : 65537
MV2_SMPI_LENGTH_QUEUE : 262144
MV2_SMP_NUM_SEND_BUFFER : 256
MV2_SMP_BATCH_SIZE : 8
---------------------------------------------------------------------
---------------------------------------------------------------------
( program hangs)
>
> This might helps us to see if there is any thing wrong in the
> settings of the parameters
>
> -Devendar
>
Thanks for your help,
Jean-Charles
>
> On Mon, Jun 25, 2012 at 9:55 AM, Dark Charlot <jcldc13 at gmail.com> wrote:
> >
> > Dear experts,
> >
> > I built a diskless infiniband cluster composed of 16 computers. All the
> > infiniband cards are set up correctly.
> >
> > Here is the report of the command "ibnodes":
> >
> > Ca : 0x0002c903000b5fac ports 1 "atlas01 HCA-1"
> > Ca : 0x0002c903000b5634 ports 1 "atlas05 HCA-1"
> > Ca : 0x0002c903000b60e0 ports 1 "atlas04 HCA-1"
> > Ca : 0x0002c903000b5684 ports 1 "z800_07 HCA-1"
> > Ca : 0x0002c903000b56a0 ports 1 "z800_02 HCA-1"
> > Ca : 0x0002c9030009d1b2 ports 1 "kerkira HCA-1"
> > Ca : 0x0002c903000bb098 ports 1 "dodoni HCA-1"
> > Ca : 0x0002c903000b5fc8 ports 1 "atlas02 HCA-1"
> > Ca : 0x0002c903000b5fc4 ports 1 "z800_03 HCA-1"
> > Ca : 0x0002c903000b60e4 ports 1 "atlas03 HCA-1"
> > Ca : 0x0002c903000b56b4 ports 1 "z800_05 HCA-1"
> > Ca : 0x0002c903000b3a82 ports 1 "z800_06 HCA-1"
> > Ca : 0x0002c903000b5690 ports 1 "z800_01 HCA-1"
> > Ca : 0x0002c903000b3a92 ports 1 "z800_04 HCA-1"
> > Ca : 0x0002c903000b5688 ports 1 "zagori HCA-1"
> > Ca : 0x0002c903000b3a52 ports 1 "amos HCA-1"
> >
> > I installed mvapich2 1.8 with the following compilation options :
> >
> > ./mpich2version
> > MVAPICH2 Version: 1.8
> > MVAPICH2 Release date: Mon Apr 30 14:50:19 EDT 2012
> > MVAPICH2 Device: ch3:mrail
> > MVAPICH2 configure: --with-device=ch3:mrail --with-rdma=gen2
> > --prefix=/rsdata/local/SHARED/Linux64/mvapich2-1.8-IB
> > MVAPICH2 CC: gcc -DNDEBUG -DNVALGRIND -O2
> > MVAPICH2 CXX: c++ -DNDEBUG -DNVALGRIND -O2
> > MVAPICH2 F77: gfortran -O2
> > MVAPICH2 FC: gfortran -O2
> >
> > Now the crazy stuff :
> >
> > It seems like my infiniband network is "automatically" separated in two
> > groups of computers, one composed of 12 computers, the second composed
> of 4
> > computers.
> >
> > Every computers inside the group can communicate using mpi programs, but
> > computers in different groups can't. Mpi programs hangs (actually each
> mpi
> > programs starts on every nodes but does not communicate...)
> >
> > I rebooted the switch and the entire cluster several times, and I get
> > always the same result...
> >
> > All the 16 computers have the same kind of Mellanox card connected to
> the
> > same Infiniband switch.
> >
> > The only difference remains in the computers architecture.
> >
> > a) The first group of 12 computers is made of :
> > - 4 computers with quad cores Intel(R) Core(TM)2 Extreme CPU Q6850 @
> > 3.00GHz ( amos, dodoni, kerkira and zagori)
> > - 5 computers with octo cores Intel(R) Xeon(R) CPU E5472 @
> > 3.00GHz (atlas01-02-03-04-05)
> > - 3 computers with octo cores Intel(R) Xeon(R) CPU E5540 @
> > 2.53GHz (z800_01 z800-02 z800_03)
> >
> > b) The second group of 4 computers is made of:
> > - 4 computers with twelves cores Intel(R) Xeon(R) CPU X5650 @
> > 2.67GHz (z800_04, z800_05, z800_06, z800_07)
> >
> > Then if I run mpi programs between machine of the group a) it works,
> > example :
> >
> > mpirun -np 2 -hosts amos,atlas02 ./osu_get_bw
> > # OSU MPI One Sided MPI_Get Bandwidth Test v3.6
> > # Size Bandwidth (MB/s)
> > 1 0.85
> > 2 1.81
> > 4 3.61
> > 8 7.10
> > 16 14.05
> > 32 28.37
> > 64 56.37
> > 128 106.09
> > 256 202.58
> > 512 366.86
> > 1024 669.81
> > 2048 1088.80
> > 4096 1603.16
> > 8192 2099.51
> > 16384 2172.16
> > 32768 2395.28
> > 65536 2514.46
> > 131072 2529.64
> > 262144 2556.51
> > 524288 2488.96
> > 1048576 2488.18
> > 2097152 2488.77
> > 4194304 2489.10
> >
> > Running MPI programs between machines of the group b) also works :
> >
> > mpirun -np 2 -hosts z800_04,z800_07 ./osu_get_bw
> > # OSU MPI One Sided MPI_Get Bandwidth Test v3.6
> > # Size Bandwidth (MB/s)
> > 1 0.95
> > 2 1.91
> > 4 3.81
> > 8 7.67
> > 16 15.00
> > 32 28.29
> > 64 53.52
> > 128 106.83
> > 256 209.41
> > 512 410.54
> > 1024 753.46
> > 2048 1347.97
> > 4096 2151.18
> > 8192 2777.53
> > 16384 2749.81
> > 32768 3132.88
> > 65536 3289.49
> > 131072 3334.38
> > 262144 3300.31
> > 524288 3118.11
> > 1048576 3112.29
> > 2097152 3111.10
> > 4194304 3112.28
> >
> > BUT running mpi programs between machines of the 2 groups hangs :
> >
> > mpirun -np 2 -hosts z800_04,amos ./osu_get_bw
> > # OSU MPI One Sided MPI_Get Bandwidth Test v3.6
> > # Size Bandwidth (MB/s)
> > ( program hang)
> >
> >
> > whatever MPI programs I run (from osu_benchmark or others) hang....
> >
> > Any ideas ? I am lost.....
> >
> > Thanks in advance.
> >
> > Jean-Charles
> >
> >
> >
> >
> >
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
>
>
>
> --
> Devendar
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20120626/35467542/attachment.html
More information about the mvapich-discuss
mailing list