[mvapich-discuss] MVAPICH-0.9.9 & ConnectX problems

Pavel Shamis (Pasha) pasha at dev.mellanox.co.il
Sun Jun 24 04:29:03 EDT 2007


In current version of  ConnectX software package the SRQ is not very 
stable and we recommend to
disable SRQ feature for mvapich. (in ompi it is disabled by default).
Copy/paste from OFED-1.2.c-8 release notes:

Limitations and known issues:
1. SRQ is not supported. One must use VIADEV_USE_SRQ=0 when lunching an 
MPI job
   otherwise the MPI job will hang.
    Examples:
        mpirun -np $NP -rsh -hostfile $HOSTFILE VIADEV_USE_SRQ=0 
$TEST_BIN_PATH
        or
        Add the line VIADEV_USE_SRQ=0 to mvapich.conf file
2. IPoIB is working in UD mode only; openibd.conf was changed to set the 
default to UD.
3. Query QP is not supported
4. Fork is not supported
5. Resize CQ is not supported
6. FMRs are not supported
7. ibstat does not present all entries. Use ibv_devinfo instead.
8. In order to work with RHEL5 with PPC one need to add the following 
line to the ini file:
   Under the [HCA] section:   log2_uar_bar_megabytes = 5


Pasha

Gilad Shainer wrote:
> Mellanox will check it and will get back to Andrey.
>
> Gilad.
>  
>
> -----Original Message-----
> From: mvapich-discuss-bounces at cse.ohio-state.edu
> [mailto:mvapich-discuss-bounces at cse.ohio-state.edu] On Behalf Of
> Sayantan Sur
> Sent: Friday, June 22, 2007 1:14 PM
> To: Andrey Slepuhin
> Cc: Erez Cohen; mvapich-discuss at cse.ohio-state.edu
> Subject: Re: [mvapich-discuss] MVAPICH-0.9.9 & ConnectX problems
>
> Hi Andrey,
>
> Thanks for the info. Using the block distribution, I'm able to see the
> hang too. The hang seems to be reproducible only when using SRQs.
>
> I'm not totally sure what the source of the bug is (MPI
> library/drivers/firmware). MVAPICH-0.9.9 was rigorously tested for
> OFED-1.1 on Mellanox Arbel SDR/DDR and Tavor (PCI-X).
>
> Maybe a comment from Mellanox folks about this issue would be
> appropriate.
>
> Thanks,
> Sayantan.
>
>
> Andrey Slepuhin wrote:
>   
>> Sayantan,
>>
>> I noticed that the problem seems to disappear after I turned off SRQ 
>> usage in mvapich.conf. May be this can help to track down the problem.
>>
>> Best regards,
>> Andrey
>>
>> Sayantan Sur wrote:
>>     
>>> Hi Andrey,
>>>
>>> Andrey Slepuhin wrote:
>>>       
>>>> Sayantan,
>>>>
>>>> I'm running class B benchmarks and most often I see the problem with
>>>>         
>
>   
>>>> MG test. Please tell me what you mean by the "block" and "cyclic"
>>>> distribution, It's not clear enough to me.
>>>>         
>>> Thanks for the info. I will try to run MG most often and see if the 
>>> problem is reproduced on our end.
>>>
>>> Sorry, I should've been more clear -- suppose you have two machines 
>>> n0, n1, then:
>>>
>>> Block: mpirun_rsh -np 4 n0 n0 n1 n1 ./mg.B.4
>>> Cyclic: mpirun_rsh -np 4 n0 n1 n0 n1 ./mg.B.4
>>>
>>>       
>>>> Also I can provide a remote access to the test system if needed.
>>>>         
>>> Thanks for the offer. I'll run the test several times and see if we 
>>> can reproduce it.
>>>
>>> Thanks,
>>> Sayantan.
>>>
>>>       
>>>> Thanks,
>>>> Andrey
>>>>
>>>> Sayantan Sur wrote:
>>>>         
>>>>> Hi Andrey,
>>>>>
>>>>> Andrey Slepuhin wrote:
>>>>>           
>>>>>> The problem was seen even with 4 processes. BTW, my firmware is 
>>>>>> 2.0.158, not 2.0.156.
>>>>>>             
>>>>> Which benchmark do you see hanging most often? Also if you could 
>>>>> let us know the class of the test, it will be great. Are you 
>>>>> running in block distribution or cyclic?
>>>>>
>>>>> Thanks,
>>>>> Sayantan.
>>>>>
>>>>>           
>>>>>> Thanks,
>>>>>> Andrey
>>>>>>
>>>>>> Sayantan Sur wrote:
>>>>>>             
>>>>>>> Hello Andrey,
>>>>>>>
>>>>>>> Thanks for your email. A couple of months back we had seen some 
>>>>>>> erratic behavior, but of late using the 2.0.156 firmware I 
>>>>>>> haven't noticed any hangs for 30-40 runs. How many processes are 
>>>>>>> you running?
>>>>>>>
>>>>>>> The platform description is given in:
>>>>>>>
>>>>>>> http://mvapich.cse.ohio-state.edu/performance/mvapich/em64t/MVAPI
>>>>>>> CH-em64t-gen2-ConnectX.shtml
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Sayantan.
>>>>>>>
>>>>>>> Andrey Slepuhin wrote:
>>>>>>>               
>>>>>>>> Dear folks,
>>>>>>>>
>>>>>>>> I have a test 4-node cluster using Intel Clovertown CPUs and new
>>>>>>>>                 
>
>   
>>>>>>>> Mellanox ConnectX cards. While running NAS NPB benchmarks I see 
>>>>>>>> that several tests silently hang occasionally.  I tried to use 
>>>>>>>> different compilers (GCC and Intel) and different optimization 
>>>>>>>> options to avoid miscompiling MVAPICH itself and the benchmarks,
>>>>>>>>                 
>
>   
>>>>>>>> but the situation didn't change. I do not see such problems with
>>>>>>>>                 
>
>   
>>>>>>>> OpenMPI however.  The cluster nodes configuration is:
>>>>>>>> Intel S5000PSL motherboard
>>>>>>>> 2 x Intel Xeon 5345 2.33 GHz
>>>>>>>> 8GB RAM
>>>>>>>> Mellanox MHGH-XTC card, firmware revision 2.0.158 Stock SLES10 
>>>>>>>> without updates (but I also had the same problems with 
>>>>>>>> 2.6.22-rc4 kernel from Rolan Dreier's git tree)
>>>>>>>> OFED-1.2.c-6 distribution from Mellanox
>>>>>>>>
>>>>>>>> Do you have any ideas what can cause the program freeze? I will 
>>>>>>>> much appreciate any help.
>>>>>>>>
>>>>>>>> Thanks in advance,
>>>>>>>> Andrey
>>>>>>>>
>>>>>>>> P.S. BTW, what was the system configuration where 1.39usec 
>>>>>>>> latency for ConnectX was achieved? At the moment my best result 
>>>>>>>> with MVAPICH is 1.67usec using one Mellanox switch...
>>>>>>>> _______________________________________________
>>>>>>>> mvapich-discuss mailing list
>>>>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>>>>>                 
>>>>>           
>>>       
>
>
> --
> http://www.cse.ohio-state.edu/~surs
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>   



More information about the mvapich-discuss mailing list