[mvapich-discuss] mvapich2 (1.8) infiniband programs do not
communicate between some nodes....
Devendar Bureddy
bureddy at cse.ohio-state.edu
Mon Jun 25 12:32:42 EDT 2012
Hi Jean-Charles
- Did you check if ib verb level tests ( ib_send_lat , ib_send_bw
..etc) working fine between these two groups (z800_04 and amos) ?
- Can you please get the output of following two commands
$mpirun -np 2 -hosts z800_04,amos -env MV2_SHOW_ENV_INFO 1 ./osu_latency
$mpirun -np 2 -hosts amos,z800_04 -env MV2_SHOW_ENV_INFO 1 ./osu_latency
This might helps us to see if there is any thing wrong in the
settings of the parameters
-Devendar
On Mon, Jun 25, 2012 at 9:55 AM, Dark Charlot <jcldc13 at gmail.com> wrote:
>
> Dear experts,
>
> I built a diskless infiniband cluster composed of 16 computers. All the
> infiniband cards are set up correctly.
>
> Here is the report of the command "ibnodes":
>
> Ca : 0x0002c903000b5fac ports 1 "atlas01 HCA-1"
> Ca : 0x0002c903000b5634 ports 1 "atlas05 HCA-1"
> Ca : 0x0002c903000b60e0 ports 1 "atlas04 HCA-1"
> Ca : 0x0002c903000b5684 ports 1 "z800_07 HCA-1"
> Ca : 0x0002c903000b56a0 ports 1 "z800_02 HCA-1"
> Ca : 0x0002c9030009d1b2 ports 1 "kerkira HCA-1"
> Ca : 0x0002c903000bb098 ports 1 "dodoni HCA-1"
> Ca : 0x0002c903000b5fc8 ports 1 "atlas02 HCA-1"
> Ca : 0x0002c903000b5fc4 ports 1 "z800_03 HCA-1"
> Ca : 0x0002c903000b60e4 ports 1 "atlas03 HCA-1"
> Ca : 0x0002c903000b56b4 ports 1 "z800_05 HCA-1"
> Ca : 0x0002c903000b3a82 ports 1 "z800_06 HCA-1"
> Ca : 0x0002c903000b5690 ports 1 "z800_01 HCA-1"
> Ca : 0x0002c903000b3a92 ports 1 "z800_04 HCA-1"
> Ca : 0x0002c903000b5688 ports 1 "zagori HCA-1"
> Ca : 0x0002c903000b3a52 ports 1 "amos HCA-1"
>
> I installed mvapich2 1.8 with the following compilation options :
>
> ./mpich2version
> MVAPICH2 Version: 1.8
> MVAPICH2 Release date: Mon Apr 30 14:50:19 EDT 2012
> MVAPICH2 Device: ch3:mrail
> MVAPICH2 configure: --with-device=ch3:mrail --with-rdma=gen2
> --prefix=/rsdata/local/SHARED/Linux64/mvapich2-1.8-IB
> MVAPICH2 CC: gcc -DNDEBUG -DNVALGRIND -O2
> MVAPICH2 CXX: c++ -DNDEBUG -DNVALGRIND -O2
> MVAPICH2 F77: gfortran -O2
> MVAPICH2 FC: gfortran -O2
>
> Now the crazy stuff :
>
> It seems like my infiniband network is "automatically" separated in two
> groups of computers, one composed of 12 computers, the second composed of 4
> computers.
>
> Every computers inside the group can communicate using mpi programs, but
> computers in different groups can't. Mpi programs hangs (actually each mpi
> programs starts on every nodes but does not communicate...)
>
> I rebooted the switch and the entire cluster several times, and I get
> always the same result...
>
> All the 16 computers have the same kind of Mellanox card connected to the
> same Infiniband switch.
>
> The only difference remains in the computers architecture.
>
> a) The first group of 12 computers is made of :
> - 4 computers with quad cores Intel(R) Core(TM)2 Extreme CPU Q6850 @
> 3.00GHz ( amos, dodoni, kerkira and zagori)
> - 5 computers with octo cores Intel(R) Xeon(R) CPU E5472 @
> 3.00GHz (atlas01-02-03-04-05)
> - 3 computers with octo cores Intel(R) Xeon(R) CPU E5540 @
> 2.53GHz (z800_01 z800-02 z800_03)
>
> b) The second group of 4 computers is made of:
> - 4 computers with twelves cores Intel(R) Xeon(R) CPU X5650 @
> 2.67GHz (z800_04, z800_05, z800_06, z800_07)
>
> Then if I run mpi programs between machine of the group a) it works,
> example :
>
> mpirun -np 2 -hosts amos,atlas02 ./osu_get_bw
> # OSU MPI One Sided MPI_Get Bandwidth Test v3.6
> # Size Bandwidth (MB/s)
> 1 0.85
> 2 1.81
> 4 3.61
> 8 7.10
> 16 14.05
> 32 28.37
> 64 56.37
> 128 106.09
> 256 202.58
> 512 366.86
> 1024 669.81
> 2048 1088.80
> 4096 1603.16
> 8192 2099.51
> 16384 2172.16
> 32768 2395.28
> 65536 2514.46
> 131072 2529.64
> 262144 2556.51
> 524288 2488.96
> 1048576 2488.18
> 2097152 2488.77
> 4194304 2489.10
>
> Running MPI programs between machines of the group b) also works :
>
> mpirun -np 2 -hosts z800_04,z800_07 ./osu_get_bw
> # OSU MPI One Sided MPI_Get Bandwidth Test v3.6
> # Size Bandwidth (MB/s)
> 1 0.95
> 2 1.91
> 4 3.81
> 8 7.67
> 16 15.00
> 32 28.29
> 64 53.52
> 128 106.83
> 256 209.41
> 512 410.54
> 1024 753.46
> 2048 1347.97
> 4096 2151.18
> 8192 2777.53
> 16384 2749.81
> 32768 3132.88
> 65536 3289.49
> 131072 3334.38
> 262144 3300.31
> 524288 3118.11
> 1048576 3112.29
> 2097152 3111.10
> 4194304 3112.28
>
> BUT running mpi programs between machines of the 2 groups hangs :
>
> mpirun -np 2 -hosts z800_04,amos ./osu_get_bw
> # OSU MPI One Sided MPI_Get Bandwidth Test v3.6
> # Size Bandwidth (MB/s)
> ( program hang)
>
>
> whatever MPI programs I run (from osu_benchmark or others) hang....
>
> Any ideas ? I am lost.....
>
> Thanks in advance.
>
> Jean-Charles
>
>
>
>
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
--
Devendar
More information about the mvapich-discuss
mailing list