[mvapich-discuss] Need advice on Error code =12 problem only when running with MPIIO on lustre

Terrence.LIAO at total.com Terrence.LIAO at total.com
Tue Dec 23 07:26:12 EST 2008


Professor Panda,

We do NOT have the same problem on our newer Cluster which has Mellanx 
PCIe IB card.  Your mavpich works very nice on this cluster.
On the old cluster, we have finally be able to use IB on Lustre with OFED 
driver,  however, this error code=12, become a big problem.  I also see 
the MPI pingpong run hung with np 36 from time to time,  I guess this is 
also linked to the flow control issue you mentioned.
I recalled you mentioned you have a cluster with HTX card and running 
infinipath's driver with mvapich.  Is it better for me to try this?
Also,  is there IB parameter I can set to avoid this kind of flow control 
problem?

Thank you very much.

-- Terrence
--------------------------------------------------------
Terrence Liao, Ph.D.
Research Computer Scientist
TOTAL E&P RESEARCH & TECHNOLOGY USA, LLC
1201 Louisiana, Suite 1800, Houston, TX 77002 
Tel: 713.647.3498  Fax: 713.647.3638
Email: terrence.liao at total.com

Houston HPC site:  http://us-hou-spt01/sites/rt/hpc/default.aspx
Pau HPC site:  http://collaboratif.ep.corp.local/sites/hpc/hpc/RD.aspx




Dhabaleswar Panda <panda at cse.ohio-state.edu> 
12/22/2008 04:22 PM

To
Terrence.LIAO at total.com
cc
mvapich-discuss at cse.ohio-state.edu, Jing WEN <jing.wen at total.com>, Brian 
Stevens <brian at stevens.com>, <John.WANG at total.com>, Craig VERSHON 
<craig.vershon at total.com>
Subject
Re: [mvapich-discuss] Need advice on Error code =12 problem only when 
running with MPIIO on lustre






Terrence,

This error code signifies issues related to flow control in the IB
network. This could be coming from the OFED implementation + InfiniPath
SDR HTX. This particular adapter is an older one. Under high I/O load
(when usign Lustre), the flow control issues might be becoming critical
and you are getting this error code.  You may check with QLogic people on
this. Do you see the same error with any other recent IB adapters from
QLogic or Mellanox.

Thanks,

DK

> I have encountered a very strange  IBV_WC_RETRY_EXC_ERR code=12 problem
> and need your advise.
> This problem only happens when using MPI-IO calls such as
> mpi_file_write_all() on lustre.
> We are using ofed1.4rc3 on CentOS 5.2.  The IB is infinipath SDR HTX.
> lustre is running version 1.6.5.1 and mounted with rw,_netdev flags.
> The same code run fine on standard ethernet  type of storage, such as
> NetAPP (i.e. no IB to storage).  Also,  the code without using MPI-IO, 
has
> no problem to write into lustre.
>
> Thank you very much.
>
> -- Terrence
> --------------------------------------------------------
> Terrence Liao, Ph.D.
> Research Computer Scientist
> TOTAL E&P RESEARCH & TECHNOLOGY USA, LLC
> 1201 Louisiana, Suite 1800, Houston, TX 77002
> Tel: 713.647.3498  Fax: 713.647.3638
> Email: terrence.liao at total.com
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20081223/390b1c3c/attachment.html


More information about the mvapich-discuss mailing list