[mvapich-discuss] MVAPICH-0.9.9 & ConnectX problems
Andrey Slepuhin
andrey.slepuhin at t-platforms.ru
Fri Jun 22 11:36:27 EDT 2007
Ok, I understood. I'm using block distribution.
BTW, I tried to downgrade the firmware to 2.0.156 with appropriate
libraries, but with the problem remains in this case too. GDB shows that
all the processes spend the time in viutil_spinandwaitcq() routine.
Best regards,
Andrey
Sayantan Sur wrote:
> Hi Andrey,
>
> Andrey Slepuhin wrote:
>> Sayantan,
>>
>> I'm running class B benchmarks and most often I see the problem with
>> MG test. Please tell me what you mean by the "block" and "cyclic"
>> distribution, It's not clear enough to me.
>
> Thanks for the info. I will try to run MG most often and see if the
> problem is reproduced on our end.
>
> Sorry, I should've been more clear -- suppose you have two machines
> n0, n1, then:
>
> Block: mpirun_rsh -np 4 n0 n0 n1 n1 ./mg.B.4
> Cyclic: mpirun_rsh -np 4 n0 n1 n0 n1 ./mg.B.4
>
>> Also I can provide a remote access to the test system if needed.
>
> Thanks for the offer. I'll run the test several times and see if we
> can reproduce it.
>
> Thanks,
> Sayantan.
>
>>
>> Thanks,
>> Andrey
>>
>> Sayantan Sur wrote:
>>> Hi Andrey,
>>>
>>> Andrey Slepuhin wrote:
>>>> The problem was seen even with 4 processes. BTW, my firmware is
>>>> 2.0.158, not 2.0.156.
>>>
>>> Which benchmark do you see hanging most often? Also if you could let
>>> us know the class of the test, it will be great. Are you running in
>>> block distribution or cyclic?
>>>
>>> Thanks,
>>> Sayantan.
>>>
>>>>
>>>> Thanks,
>>>> Andrey
>>>>
>>>> Sayantan Sur wrote:
>>>>> Hello Andrey,
>>>>>
>>>>> Thanks for your email. A couple of months back we had seen some
>>>>> erratic behavior, but of late using the 2.0.156 firmware I haven't
>>>>> noticed any hangs for 30-40 runs. How many processes are you running?
>>>>>
>>>>> The platform description is given in:
>>>>>
>>>>> http://mvapich.cse.ohio-state.edu/performance/mvapich/em64t/MVAPICH-em64t-gen2-ConnectX.shtml
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Sayantan.
>>>>>
>>>>> Andrey Slepuhin wrote:
>>>>>> Dear folks,
>>>>>>
>>>>>> I have a test 4-node cluster using Intel Clovertown CPUs and new
>>>>>> Mellanox ConnectX cards. While running NAS NPB benchmarks I see
>>>>>> that several tests silently hang occasionally. I tried to use
>>>>>> different compilers (GCC and Intel) and different optimization
>>>>>> options to avoid miscompiling MVAPICH itself and the benchmarks,
>>>>>> but the situation didn't change. I do not see such problems with
>>>>>> OpenMPI however. The cluster nodes configuration is:
>>>>>> Intel S5000PSL motherboard
>>>>>> 2 x Intel Xeon 5345 2.33 GHz
>>>>>> 8GB RAM
>>>>>> Mellanox MHGH-XTC card, firmware revision 2.0.158
>>>>>> Stock SLES10 without updates (but I also had the same problems
>>>>>> with 2.6.22-rc4 kernel from Rolan Dreier's git tree)
>>>>>> OFED-1.2.c-6 distribution from Mellanox
>>>>>>
>>>>>> Do you have any ideas what can cause the program freeze? I will
>>>>>> much appreciate any help.
>>>>>>
>>>>>> Thanks in advance,
>>>>>> Andrey
>>>>>>
>>>>>> P.S. BTW, what was the system configuration where 1.39usec
>>>>>> latency for ConnectX was achieved? At the moment my best result
>>>>>> with MVAPICH is 1.67usec using one Mellanox switch...
>>>>>> _______________________________________________
>>>>>> mvapich-discuss mailing list
>>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>>
>>>
>>>
>
>
More information about the mvapich-discuss
mailing list