[mvapich-discuss] Problem with linpack/mvapich2/BLCR (fwd)

Wed Oct 3 16:53:04 EDT 2007

Hi Patrice,

Forgot to mention in my last email. We are using BLCR-0.6.1 (the latest
version) and OFED-1.2.5.1.

Thanks.

Regards,
Wei Huang

774 Dreese Lab, 2015 Neil Ave,
Dept. of Computer Science and Engineering
Ohio State University
OH 43210
Tel: (614)292-8501

---------- Forwarded message ----------
Date: Wed, 3 Oct 2007 16:31:11 -0400 (EDT)
From: wei huang <huanwei at cse.ohio-state.edu>
To: Patrice Martinez <patrice.martinez at bull.net>
Cc: mvapich-discuss at cse.ohio-state.edu
Subject: Re: [mvapich-discuss] Problem with linpack/mvapich2/BLCR

Hi Patrice,

We tried running hpl with BLCR support and it looks fine. We have run the
test on two set of machines. One is dual processor Intel Xeon nodes, we
run 4 processes with 2 processes on each node. We also ran the test on 880
Opteron (quad dual-core), hosting all 4 processes on one node.

We try to use as similar HPL input as you are using, see below:

============================================================================
T/V                N    NB     P     Q               Time             Gflops
----------------------------------------------------------------------------
WR00C2L4        5000   112     4     1               6.64          1.255e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1  * N        ) =        0.0355903 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =        0.0234950 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        0.0045862 ...... PASSED

The difference is that we don't have intel-mkl installed. So we are using
HPL with goto library. Could you let us know if you can reproduce the
problem with goto?

Thanks.

Regards,
Wei Huang

774 Dreese Lab, 2015 Neil Ave,
Dept. of Computer Science and Engineering
Ohio State University
OH 43210
Tel: (614)292-8501

On Mon, 1 Oct 2007, Patrice Martinez wrote:

>
> Hello,
>
> I encounter problem running linpack benchmark with mvapich2 configured for BLCR support: computations  are sometimes right, sometimes wrong.
> Let me describe the context:
>
>
>             Hardware used:
>
>  1.
>
>     Bull Novascale R422, 2xXeon Core 2 Duo 5150@ 2.66 Ghz, 8Gb de RAM
>
>  2.
>
>      IB HCA Mellanox MT25208 dual-port
>
>             Software used
>
>  1.
>
>     RHEL4 U4, kernel 2.6.9.42-ELSmp,
>
>  2.
>
>     gcc-3.4.6
>
>  3.
>
>     intel mkl 9.1
>
>  4.
>
>     blcr-0.6.0,
>
>  5.
>
>     mvapich2-1.0,
>
>  6.
>
>     OFED-1.2.5.1,
>
>  7.
>
>     linpack-9.1
>
>
>
>             Tests
>
>
> -For this test, the two ports of the  IB HCA are connected together.
>
> -I made the following link to avoid problems forwarding environment variables:
>
> #l /lib64/libcr.so.0
> lrwxrwxrwx  1 root root 23 Sep 21 11:19 /lib64/libcr.so.0 -> /usr/local/lib/libcr.so
>
> - blcr modules are loaded:
>
> service blcr start
>
> - mpd daemon is run:
>
> mpdboot --ncpus=4
>
> - And finally, linpack is configured to invert a small matrix (N=5000), and linpack is executed:
>
> mpiexec -n 4 ./xhpl
>
>
> Analyse
>
>
> Depending on the parameters P and Q given in the HPL.dat file, computations are always right or always wrong...
> With  P=4, Q=1:
> ============================================================================
> T/V                N    NB     P     Q               Time             Gflops
> ----------------------------------------------------------------------------
> W00C2L4         5000   112     4     1               4.28          1.948e+01
> ----------------------------------------------------------------------------
> ||Ax-b||_oo / ( eps * ||A||_1  * N        ) = 25110713646301407346688.0000000 ...... FAILED
> ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) = 155458419119.8088379 ...... FAILED
> ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 17288875125.5442734 ...... FAILED
> ||Ax-b||_oo  . . . . . . . . . . . . . . . . . = 17973740643825.015625
> ||A||_oo . . . . . . . . . . . . . . . . . . . =        1283.266028
> ||A||_1  . . . . . . . . . . . . . . . . . . . =        1289.434188
> ||x||_oo . . . . . . . . . . . . . . . . . . . = 1459401545070.356201
> ||x||_1  . . . . . . . . . . . . . . . . . . . = 807634407595160.750000
> ============================================================================
>
> With  P=2, Q=2
>
> ============================================================================
> T/V                N    NB     P     Q               Time             Gflops
> ----------------------------------------------------------------------------
> W00C2L4         5000   112     2     2               3.39          2.459e+01
> ----------------------------------------------------------------------------
> ||Ax-b||_oo / ( eps * ||A||_1  * N        ) =        0.0420265 ...... PASSED
> ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =        0.0277438 ...... PASSED
> ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        0.0054156 ...... PASSED
> ============================================================================
>
>
> It is interesting to see that computations are faster when they're right...
>
> When using mvapich2 compiled without BLCR support, computations are always right, of course.
> Any idea?
>
>  --
>
> Cordialement/Best regards
>
> Patrice Martinez
>
> Linux Kernel Architect.
>
> OFFICE : B1-405
> PHONE  : +33 (0)4 76 29 74 69
> EMAIL  : Patrice.martinez at bull.net
> ADDR   : BULL, 1 rue de Provence, BP 208, 38432 Echirolles Cedex, FRANCE
>
>

_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu
http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss