[mvapich-discuss] Running code hanged using MVAPICH

Soon-Heum Ko floydfan at innogrid.com
Wed Oct 22 22:46:12 EDT 2008


Dear DK,

Thank you for your comments. Let me first re-install MVAPICH and report the 
result again.

Jeff (Soon-Heum)


----- Original Message ----- 
From: "Dhabaleswar Panda" <panda at cse.ohio-state.edu>
To: "Soon-Heum Ko" <floydfan at innogrid.com>
Cc: <mvapich-discuss at cse.ohio-state.edu>; "이승우차장님" <swlee at moasys.com>; 
"이민주과장님" <onunix at digitalhenge.com>; "윤정두과장님" 
<Jungdoo.Yoon at Sun.COM>
Sent: Thursday, October 23, 2008 12:20 AM
Subject: Re: [mvapich-discuss] Running code hanged using MVAPICH


> Soon-Heum,
>
> Thanks for your note. It is very strange that your MPI jobs are not able
> to communicate across two nodes. Most probably, it seems to be some
> installation/systems problem. Many organizations with large-scale IB
> clusters with Barcelona-based AMD nodes (like your) are running MVAPICH
> 1.0.1 successfully.
>
> Are you able to run the standard OSU benchmarks (latency, bandwidth,
> bi-directional bandwidth, broadcast, etc.) across two nodes. You should
> check these first before running your MPI applications.
>
> You can refer to MVAPICH 1.0.1 user guide available from the following URL
> to verify that your build and installation process is fine.
>
> http://mvapich.cse.ohio-state.edu/support/mvapich_user_guide.html#x1-100004.4.1
>
> DK
>
> On Wed, 22 Oct 2008, Soon-Heum Ko wrote:
>
>> Hi,
>>
>>
>> I'm working at KISTI Supercomputing Center, Korea, as a member of SUN 
>> support team.
>>
>> I have some trouble using MVAPICH. So, I'd like to ask you the option for 
>> installing MVAPICH.
>>
>> Recently, we made a supercomputer which consists of Sun blade 6048 nodes 
>> (4 quadcore Barcelona CPUs at each node) and Voltaire ISR 2012 switch for 
>> infiniband network. Admins installed MVAPICH in this system and I tested 
>> the operation of MVAPICH library.
>>
>> While running various codes, I found that some codes don't work with 
>> MVAPICH though they don't make any trouble with other MPI libraries. 
>> Particularly, these codes work well with MVAPICH when I use several 
>> processors in the same node; but they don't work if more than 2 nodes 
>> cooperate. In detail, code is hung at the first MPI communication 
>> command. (i.e., when I use 16 processors in the same node, it works 
>> well... If I use 16 processors in 2 nodes, it pends in the MPI 
>> communication routine...)
>>
>> Mysterious thing is that, in some simple codes, MPI communication 
>> routines make no trouble in inter-node communication. On the other hand, 
>> some complex codes that spend vast size memory show communication tourble 
>> even when they transfer only 4 bytes data (one integer).
>>
>> Just in my imagination, this happened by one of these reasons:
>>  - MVAPICH trouble : which is fixed in latest version. (Note that we 
>> currently use MVAPICH Version 1.0.1)
>>  - MVAPICH trouble : which has not been reported or not been fixed.
>>  - Installing option : our installation option is as follows.
>>     ./configure --with-device=ch_gen2 --with-arch=LINUX -prefix=/usr/local/mvapich 
>> \
>>     --enable-shared --enable-static --enable-debug --enable-sharedlib \
>>     --enable-cxx --enable-f77 --enable-f90 --enable-f90modules \
>>     --with-romio --without-mpe
>>  - Bugs on users' codes : which cannot happen.
>>
>> Do you have any idea or comments? If you know the reason, please let me 
>> know.
>>
>> Thank you in advance.
>>
>>
>> Best regards,
>> Jeff
>>
>>
>>
>> Soon-Heum Ko,
>> Ph.D, Computational Fluid Dynamics
>> Parallel Optimization Analyst,
>> SUN Support Team (InnoGrid) at KISTI
> 



More information about the mvapich-discuss mailing list