[mvapich-discuss] Running code hanged using MVAPICH

Dhabaleswar Panda panda at cse.ohio-state.edu
Wed Oct 22 11:20:41 EDT 2008


Soon-Heum,

Thanks for your note. It is very strange that your MPI jobs are not able
to communicate across two nodes. Most probably, it seems to be some
installation/systems problem. Many organizations with large-scale IB
clusters with Barcelona-based AMD nodes (like your) are running MVAPICH
1.0.1 successfully.

Are you able to run the standard OSU benchmarks (latency, bandwidth,
bi-directional bandwidth, broadcast, etc.) across two nodes. You should
check these first before running your MPI applications.

You can refer to MVAPICH 1.0.1 user guide available from the following URL
to verify that your build and installation process is fine.

http://mvapich.cse.ohio-state.edu/support/mvapich_user_guide.html#x1-100004.4.1

DK

On Wed, 22 Oct 2008, Soon-Heum Ko wrote:

> Hi,
>
>
> I'm working at KISTI Supercomputing Center, Korea, as a member of SUN support team.
>
> I have some trouble using MVAPICH. So, I'd like to ask you the option for installing MVAPICH.
>
> Recently, we made a supercomputer which consists of Sun blade 6048 nodes (4 quadcore Barcelona CPUs at each node) and Voltaire ISR 2012 switch for infiniband network. Admins installed MVAPICH in this system and I tested the operation of MVAPICH library.
>
> While running various codes, I found that some codes don't work with MVAPICH though they don't make any trouble with other MPI libraries. Particularly, these codes work well with MVAPICH when I use several processors in the same node; but they don't work if more than 2 nodes cooperate. In detail, code is hung at the first MPI communication command. (i.e., when I use 16 processors in the same node, it works well... If I use 16 processors in 2 nodes, it pends in the MPI communication routine...)
>
> Mysterious thing is that, in some simple codes, MPI communication routines make no trouble in inter-node communication. On the other hand, some complex codes that spend vast size memory show communication tourble even when they transfer only 4 bytes data (one integer).
>
> Just in my imagination, this happened by one of these reasons:
>  - MVAPICH trouble : which is fixed in latest version. (Note that we currently use MVAPICH Version 1.0.1)
>  - MVAPICH trouble : which has not been reported or not been fixed.
>  - Installing option : our installation option is as follows.
>     ./configure --with-device=ch_gen2 --with-arch=LINUX -prefix=/usr/local/mvapich \
>     --enable-shared --enable-static --enable-debug --enable-sharedlib \
>     --enable-cxx --enable-f77 --enable-f90 --enable-f90modules \
>     --with-romio --without-mpe
>  - Bugs on users' codes : which cannot happen.
>
> Do you have any idea or comments? If you know the reason, please let me know.
>
> Thank you in advance.
>
>
> Best regards,
> Jeff
>
>
>
> Soon-Heum Ko,
> Ph.D, Computational Fluid Dynamics
> Parallel Optimization Analyst,
> SUN Support Team (InnoGrid) at KISTI



More information about the mvapich-discuss mailing list