[mvapich-discuss] MVAPICH-0.9.9 & ConnectX problems
Sayantan Sur
surs at cse.ohio-state.edu
Fri Jun 22 11:22:58 EDT 2007
Hi Andrey,
Andrey Slepuhin wrote:
> Sayantan,
>
> I'm running class B benchmarks and most often I see the problem with
> MG test. Please tell me what you mean by the "block" and "cyclic"
> distribution, It's not clear enough to me.
Thanks for the info. I will try to run MG most often and see if the
problem is reproduced on our end.
Sorry, I should've been more clear -- suppose you have two machines n0,
n1, then:
Block: mpirun_rsh -np 4 n0 n0 n1 n1 ./mg.B.4
Cyclic: mpirun_rsh -np 4 n0 n1 n0 n1 ./mg.B.4
> Also I can provide a remote access to the test system if needed.
Thanks for the offer. I'll run the test several times and see if we can
reproduce it.
Thanks,
Sayantan.
>
> Thanks,
> Andrey
>
> Sayantan Sur wrote:
>> Hi Andrey,
>>
>> Andrey Slepuhin wrote:
>>> The problem was seen even with 4 processes. BTW, my firmware is
>>> 2.0.158, not 2.0.156.
>>
>> Which benchmark do you see hanging most often? Also if you could let
>> us know the class of the test, it will be great. Are you running in
>> block distribution or cyclic?
>>
>> Thanks,
>> Sayantan.
>>
>>>
>>> Thanks,
>>> Andrey
>>>
>>> Sayantan Sur wrote:
>>>> Hello Andrey,
>>>>
>>>> Thanks for your email. A couple of months back we had seen some
>>>> erratic behavior, but of late using the 2.0.156 firmware I haven't
>>>> noticed any hangs for 30-40 runs. How many processes are you running?
>>>>
>>>> The platform description is given in:
>>>>
>>>> http://mvapich.cse.ohio-state.edu/performance/mvapich/em64t/MVAPICH-em64t-gen2-ConnectX.shtml
>>>>
>>>>
>>>> Thanks,
>>>> Sayantan.
>>>>
>>>> Andrey Slepuhin wrote:
>>>>> Dear folks,
>>>>>
>>>>> I have a test 4-node cluster using Intel Clovertown CPUs and new
>>>>> Mellanox ConnectX cards. While running NAS NPB benchmarks I see
>>>>> that several tests silently hang occasionally. I tried to use
>>>>> different compilers (GCC and Intel) and different optimization
>>>>> options to avoid miscompiling MVAPICH itself and the benchmarks,
>>>>> but the situation didn't change. I do not see such problems with
>>>>> OpenMPI however. The cluster nodes configuration is:
>>>>> Intel S5000PSL motherboard
>>>>> 2 x Intel Xeon 5345 2.33 GHz
>>>>> 8GB RAM
>>>>> Mellanox MHGH-XTC card, firmware revision 2.0.158
>>>>> Stock SLES10 without updates (but I also had the same problems
>>>>> with 2.6.22-rc4 kernel from Rolan Dreier's git tree)
>>>>> OFED-1.2.c-6 distribution from Mellanox
>>>>>
>>>>> Do you have any ideas what can cause the program freeze? I will
>>>>> much appreciate any help.
>>>>>
>>>>> Thanks in advance,
>>>>> Andrey
>>>>>
>>>>> P.S. BTW, what was the system configuration where 1.39usec latency
>>>>> for ConnectX was achieved? At the moment my best result with
>>>>> MVAPICH is 1.67usec using one Mellanox switch...
>>>>> _______________________________________________
>>>>> mvapich-discuss mailing list
>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>
>>
>>
--
http://www.cse.ohio-state.edu/~surs
More information about the mvapich-discuss
mailing list