[mvapich-discuss] MVAPICH-0.9.9 & ConnectX problems

Andrey Slepuhin andrey.slepuhin at t-platforms.ru
Fri Jun 22 12:58:59 EDT 2007


Increasing VIADEV_SRQ_SIZE from default 512 to 2048 also helps. 
Decreasing VIADEV_SRQ_SIZE causes the application to hang more often

Best regards,
Andrey

Andrey Slepuhin wrote:
> Sayantan,
>
> I noticed that the problem seems to disappear after I turned off SRQ 
> usage in mvapich.conf. May be this can help to track down the problem.
>
> Best regards,
> Andrey
>
> Sayantan Sur wrote:
>> Hi Andrey,
>>
>> Andrey Slepuhin wrote:
>>> Sayantan,
>>>
>>> I'm running class B benchmarks and most often I see the problem with 
>>> MG test. Please tell me what you mean by the "block" and "cyclic" 
>>> distribution, It's not clear enough to me.
>>
>> Thanks for the info. I will try to run MG most often and see if the 
>> problem is reproduced on our end.
>>
>> Sorry, I should've been more clear -- suppose you have two machines 
>> n0, n1, then:
>>
>> Block: mpirun_rsh -np 4 n0 n0 n1 n1 ./mg.B.4
>> Cyclic: mpirun_rsh -np 4 n0 n1 n0 n1 ./mg.B.4
>>
>>> Also I can provide a remote access to the test system if needed.
>>
>> Thanks for the offer. I'll run the test several times and see if we 
>> can reproduce it.
>>
>> Thanks,
>> Sayantan.
>>
>>>
>>> Thanks,
>>> Andrey
>>>
>>> Sayantan Sur wrote:
>>>> Hi Andrey,
>>>>
>>>> Andrey Slepuhin wrote:
>>>>> The problem was seen even with 4 processes. BTW, my firmware is 
>>>>> 2.0.158, not 2.0.156.
>>>>
>>>> Which benchmark do you see hanging most often? Also if you could 
>>>> let us know the class of the test, it will be great. Are you 
>>>> running in block distribution or cyclic?
>>>>
>>>> Thanks,
>>>> Sayantan.
>>>>
>>>>>
>>>>> Thanks,
>>>>> Andrey
>>>>>
>>>>> Sayantan Sur wrote:
>>>>>> Hello Andrey,
>>>>>>
>>>>>> Thanks for your email. A couple of months back we had seen some 
>>>>>> erratic behavior, but of late using the 2.0.156 firmware I 
>>>>>> haven't noticed any hangs for 30-40 runs. How many processes are 
>>>>>> you running?
>>>>>>
>>>>>> The platform description is given in:
>>>>>>
>>>>>> http://mvapich.cse.ohio-state.edu/performance/mvapich/em64t/MVAPICH-em64t-gen2-ConnectX.shtml 
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Sayantan.
>>>>>>
>>>>>> Andrey Slepuhin wrote:
>>>>>>> Dear folks,
>>>>>>>
>>>>>>> I have a test 4-node cluster using Intel Clovertown CPUs and new 
>>>>>>> Mellanox ConnectX cards. While running NAS NPB benchmarks I see  
>>>>>>> that several tests silently hang occasionally.  I tried to use 
>>>>>>> different compilers (GCC and Intel) and different optimization 
>>>>>>> options to avoid miscompiling MVAPICH itself and the benchmarks, 
>>>>>>> but the situation didn't change. I do not see such problems with 
>>>>>>> OpenMPI however.  The cluster nodes configuration is:
>>>>>>> Intel S5000PSL motherboard
>>>>>>> 2 x Intel Xeon 5345 2.33 GHz
>>>>>>> 8GB RAM
>>>>>>> Mellanox MHGH-XTC card, firmware revision 2.0.158
>>>>>>> Stock SLES10 without updates (but I also had the same problems 
>>>>>>> with 2.6.22-rc4 kernel from Rolan Dreier's git tree)
>>>>>>> OFED-1.2.c-6 distribution from Mellanox
>>>>>>>
>>>>>>> Do you have any ideas what can cause the program freeze? I will 
>>>>>>> much appreciate any help.
>>>>>>>
>>>>>>> Thanks in advance,
>>>>>>> Andrey
>>>>>>>
>>>>>>> P.S. BTW, what was the system configuration where 1.39usec 
>>>>>>> latency for ConnectX was achieved? At the moment my best result 
>>>>>>> with MVAPICH is 1.67usec using one Mellanox switch...
>>>>>>> _______________________________________________
>>>>>>> mvapich-discuss mailing list
>>>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>>>
>>>>
>>>>
>>
>>


More information about the mvapich-discuss mailing list