[mvapich-discuss] mvapich2 IB problems with transfers over ~10KB.

Choudhury, Durga Durga.Choudhury at drs-ss.com
Mon Apr 2 18:51:07 EDT 2007


This might be an irrelevant comment, but let me say this anyway.

Some gige device drivers have bugs that cannot handle jumbo frames above
8k correctly. Since you say your messages work below 8k, this rang a
bell in my mind. Since you say you are running IB interconnect and not
gigE, this should be irrelevant to you, but if you are running IPoIB,
please check your routing table to make sure you are actually going over
IB and not over GigE (in fact, another user had precisely expressed the
same concern recently.) If the latter, see if your card is set to handle
jumbo frames and try reducing the MTU (1500 bytes is guaranteed to work
with all drivers).

Best regards
Durga

-----Original Message-----
From: mvapich-discuss-bounces at cse.ohio-state.edu
[mailto:mvapich-discuss-bounces at cse.ohio-state.edu] On Behalf Of Mike
Houston
Sent: Monday, April 02, 2007 3:04 PM
To: wei huang
Cc: mvapich-discuss at cse.ohio-state.edu
Subject: Re: [mvapich-discuss] mvapich2 IB problems with transfers over
~10KB.



wei huang wrote:
> Hi Mike,
>
> Could you please run IMB test with -DCHECK (Compile IMB with this
flag)
> option? This checks all collective operations with data verification.
>   
Will do.  I need to finish some large runs on the cluster with the 
workarounds before I can run this.
> Also, what exactly do you mean by ``cannot reliably'' send message? Do
you
> see data corruption, or are there other error symptoms?
>   
The messages get corrupted or are corrupting data in other windows 
(windows as MPI_Win).  We don't see this behavior with mpich2 over GigE 
or IPoIB.  The later seems to be what we are generally seeing.  This may

well be a bug on our side, but we don't have issues with other mpi 
versions, but we could be getting lucky.
> Thanks.
>
> Regards,
> Wei Huang
>
> 774 Dreese Lab, 2015 Neil Ave,
> Dept. of Computer Science and Engineering
> Ohio State University
> OH 43210
> Tel: (614)292-8501
>
>
> On Mon, 2 Apr 2007, Mike Houston wrote:
>
>   
>> wei huang wrote:
>>     
>>> Hi Mike,
>>>
>>> Thanks for letting us know the problem. However, to us understand
more
>>> what is going on, would you please let us know the following?
>>>
>>> 1) Which version of mvapich2 are you using? The latest release
version now
>>> should be mvapich2-0.9.8.
>>>
>>>       
>> Yes, 0.9.8
>>     
>>> 2) Could you actually try running osu_benchmarks and see if they all
pass
>>> on your system? The benchmarks are distributed with the packet and
are in
>>> the `osu_benchmarks' directory. You should not experience problem
with
>>> that if your systems are setup correctly.
>>>
>>>       
>> All gives these a go, but they look like they don't verify results.
>>     
>>> Thanks.
>>>
>>> Regards,
>>> Wei Huang
>>>
>>> 774 Dreese Lab, 2015 Neil Ave,
>>> Dept. of Computer Science and Engineering
>>> Ohio State University
>>> OH 43210
>>> Tel: (614)292-8501
>>>
>>>
>>>
>>>       
>>>> ---------- Forwarded message ----------
>>>> Date: Sun, 01 Apr 2007 03:03:31 -0700
>>>> From: Mike Houston <mhouston at graphics.stanford.edu>
>>>> To: mvapich-discuss at cse.ohio-state.edu
>>>> Subject: [mvapich-discuss] mvapich2 IB problems with transfers over
~10KB.
>>>>
>>>> We've hit an odd snag with using mvapich2.  We can't seem to
reliably
>>>> send messages > 10KB.  If we break up all large messages into 8KB
blocks
>>>> and send, things work just fine, but as expected, performance is
awful.
>>>> Under mpich2 with GigE and IPoIB, large messages seem to work just
>>>> fine.  Both MPI_Send and MPI_Put seem to exhibit the same behavior.
I
>>>> should note that the one oddity of our system implementation is
that we
>>>> have a posted MPI_IRecv waiting while doing the large transfers.
>>>> Open-MPI flips out when we do this, even in tcp mode.
>>>>
>>>> We have PCI-X SDR 4X boards, running the latest IB Gold release
(1.8.3)
>>>> ontop of the latest RHEL4 SMP x86 kernel (32-bit).  The boards have
>>>> slightly older firmware, 3.3.3, but I'm hesitant to flash up unless
>>>> there are known issues with that firmware...  We built using the
>>>> defaults in make.mvapich2.vapi.  Any suggestions on where to look
or
>>>> what to update?  It seems *very* odd that large transfers aren't
working
>>>> for us...
>>>>
>>>> Thanks!
>>>>
>>>> -Mike
>>>> _______________________________________________
>>>> mvapich-discuss mailing list
>>>> mvapich-discuss at cse.ohio-state.edu
>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>
>>>>
>>>>         
>>>
>>>       
>
>
>   
_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu
http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss



More information about the mvapich-discuss mailing list