[mvapich-discuss] Error - vbuf not correct

Liu Jianyu jerry_leo at msn.com
Fri Mar 20 15:10:36 EDT 2015


Hi Hari,

Thanks for your reply

Here are the output of mpiname -a

MVAPICH2 2.0b Fri Nov  8 11:17:40 EST 2013 ch3:nemesis

Compilation
CC: gcc    -DNDEBUG -DNVALGRIND -O2
CXX: g++   -DNDEBUG -DNVALGRIND -O2
F77: gfortran   -O2
FC: gfortran   -O2

Configuration
--prefix=/nuist/p/data/app/mvapich2/2.0b/gnu/4.7.2 --with-ib-libpath=/usr/lib64 --with-ib-include=/usr/include --with-ibverbs-lib=/usr/lib64 --with-ibverbs-include=/usr/include --enable-f77 --enable-fc --with-device=ch3:nemesis:ib,tcp


WRF ran without any problem on OFA like this until a couple of days ago

    mpirun -np  64  -hostfile n064  ./wrf.exe


Just wanted to make sure it’s not the input data issue,  so tried with running on TCP/IP only.

Also tried to  run WRF on only ONE node, tested one node by one node, and failed to figure out the bad node.

Wondering more detailed instructions how to make further diagnosis.

Thanks for your time

Jianyu
From: Hari Subramoni 
Sent: Saturday, March 21, 2015 1:36 AM
To: Liu Jianyu 
Cc: mvapich-discuss at cse.ohio-state.edu 
Subject: Re: [mvapich-discuss] Error - vbuf not correct

Hello,


Could you please clarify which version of MVAPICH you are using and the build options used. Output of mpiname -a will help.


On a different note, I see that you are using nemesis. For best performance, we recommend that you use the support for OpenFabrics (OFA) IB/iWARP/RoCE available with the CH3 channel

Please refer to the following section of the userguide for more information on how to configure MVAPICH2 to use the CH3 channel.

http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.1rc2-userguide.html#x1-110004.4


Thx,

Hari.


On Fri, Mar 20, 2015 at 1:29 PM, Liu Jianyu <jerry_leo at msn.com> wrote:

  Hi,

  Recently WRF V3.6.1 aborted with these error messages on OFA

     recv desc error, 10934
     recv desc error, 10934
     [5] Abort: vbuf not correct.
     at line 410 in file src/mpid/ch3/channels/nemesis/netmod/ib/ib_vbuf.c

  Tried run WRF on TCP/IP with the same nodes like this without any problems

    MPICH_NEMESIS_NETMOD=tcp  mpirun -np 64 -ppn 8 -hostfile n064 ./wrf.exe

  Wondering it may be hardware issue of IB.   But no idea how to identify the problem node.

  Any comments ?

  Appreciating your kindly help

  Regards

  Jianyu




  _______________________________________________
  mvapich-discuss mailing list
  mvapich-discuss at cse.ohio-state.edu
  http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150321/9847cabb/attachment-0001.html>


More information about the mvapich-discuss mailing list