[mvapich-discuss] MVAPICH-0.9.9 & ConnectX problems

Andrey Slepuhin andrey.slepuhin at t-platforms.ru
Fri Jun 22 11:16:01 EDT 2007


Sayantan,

I'm running class B benchmarks and most often I see the problem with MG 
test. Please tell me what you mean by the "block" and "cyclic" 
distribution, It's not clear enough to me.
Also I can provide a remote access to the test system if needed.

Thanks,
Andrey

Sayantan Sur wrote:
> Hi Andrey,
>
> Andrey Slepuhin wrote:
>> The problem was seen even with 4 processes. BTW, my firmware is 
>> 2.0.158, not 2.0.156.
>
> Which benchmark do you see hanging most often? Also if you could let 
> us know the class of the test, it will be great. Are you running in 
> block distribution or cyclic?
>
> Thanks,
> Sayantan.
>
>>
>> Thanks,
>> Andrey
>>
>> Sayantan Sur wrote:
>>> Hello Andrey,
>>>
>>> Thanks for your email. A couple of months back we had seen some 
>>> erratic behavior, but of late using the 2.0.156 firmware I haven't 
>>> noticed any hangs for 30-40 runs. How many processes are you running?
>>>
>>> The platform description is given in:
>>>
>>> http://mvapich.cse.ohio-state.edu/performance/mvapich/em64t/MVAPICH-em64t-gen2-ConnectX.shtml 
>>>
>>>
>>> Thanks,
>>> Sayantan.
>>>
>>> Andrey Slepuhin wrote:
>>>> Dear folks,
>>>>
>>>> I have a test 4-node cluster using Intel Clovertown CPUs and new 
>>>> Mellanox ConnectX cards. While running NAS NPB benchmarks I see  
>>>> that several tests silently hang occasionally.  I tried to use 
>>>> different compilers (GCC and Intel) and different optimization 
>>>> options to avoid miscompiling MVAPICH itself and the benchmarks, 
>>>> but the situation didn't change. I do not see such problems with 
>>>> OpenMPI however.  The cluster nodes configuration is:
>>>> Intel S5000PSL motherboard
>>>> 2 x Intel Xeon 5345 2.33 GHz
>>>> 8GB RAM
>>>> Mellanox MHGH-XTC card, firmware revision 2.0.158
>>>> Stock SLES10 without updates (but I also had the same problems with 
>>>> 2.6.22-rc4 kernel from Rolan Dreier's git tree)
>>>> OFED-1.2.c-6 distribution from Mellanox
>>>>
>>>> Do you have any ideas what can cause the program freeze? I will 
>>>> much appreciate any help.
>>>>
>>>> Thanks in advance,
>>>> Andrey
>>>>
>>>> P.S. BTW, what was the system configuration where 1.39usec latency 
>>>> for ConnectX was achieved? At the moment my best result with 
>>>> MVAPICH is 1.67usec using one Mellanox switch...
>>>> _______________________________________________
>>>> mvapich-discuss mailing list
>>>> mvapich-discuss at cse.ohio-state.edu
>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>
>
>


More information about the mvapich-discuss mailing list