[mvapich-discuss] Need advice on Error code =12 problem only when
running with MPIIO on lustre
Terrence.LIAO at total.com
Terrence.LIAO at total.com
Tue Dec 23 07:26:12 EST 2008
Professor Panda,
We do NOT have the same problem on our newer Cluster which has Mellanx
PCIe IB card. Your mavpich works very nice on this cluster.
On the old cluster, we have finally be able to use IB on Lustre with OFED
driver, however, this error code=12, become a big problem. I also see
the MPI pingpong run hung with np 36 from time to time, I guess this is
also linked to the flow control issue you mentioned.
I recalled you mentioned you have a cluster with HTX card and running
infinipath's driver with mvapich. Is it better for me to try this?
Also, is there IB parameter I can set to avoid this kind of flow control
problem?
Thank you very much.
-- Terrence
--------------------------------------------------------
Terrence Liao, Ph.D.
Research Computer Scientist
TOTAL E&P RESEARCH & TECHNOLOGY USA, LLC
1201 Louisiana, Suite 1800, Houston, TX 77002
Tel: 713.647.3498 Fax: 713.647.3638
Email: terrence.liao at total.com
Houston HPC site: http://us-hou-spt01/sites/rt/hpc/default.aspx
Pau HPC site: http://collaboratif.ep.corp.local/sites/hpc/hpc/RD.aspx
Dhabaleswar Panda <panda at cse.ohio-state.edu>
12/22/2008 04:22 PM
To
Terrence.LIAO at total.com
cc
mvapich-discuss at cse.ohio-state.edu, Jing WEN <jing.wen at total.com>, Brian
Stevens <brian at stevens.com>, <John.WANG at total.com>, Craig VERSHON
<craig.vershon at total.com>
Subject
Re: [mvapich-discuss] Need advice on Error code =12 problem only when
running with MPIIO on lustre
Terrence,
This error code signifies issues related to flow control in the IB
network. This could be coming from the OFED implementation + InfiniPath
SDR HTX. This particular adapter is an older one. Under high I/O load
(when usign Lustre), the flow control issues might be becoming critical
and you are getting this error code. You may check with QLogic people on
this. Do you see the same error with any other recent IB adapters from
QLogic or Mellanox.
Thanks,
DK
> I have encountered a very strange IBV_WC_RETRY_EXC_ERR code=12 problem
> and need your advise.
> This problem only happens when using MPI-IO calls such as
> mpi_file_write_all() on lustre.
> We are using ofed1.4rc3 on CentOS 5.2. The IB is infinipath SDR HTX.
> lustre is running version 1.6.5.1 and mounted with rw,_netdev flags.
> The same code run fine on standard ethernet type of storage, such as
> NetAPP (i.e. no IB to storage). Also, the code without using MPI-IO,
has
> no problem to write into lustre.
>
> Thank you very much.
>
> -- Terrence
> --------------------------------------------------------
> Terrence Liao, Ph.D.
> Research Computer Scientist
> TOTAL E&P RESEARCH & TECHNOLOGY USA, LLC
> 1201 Louisiana, Suite 1800, Houston, TX 77002
> Tel: 713.647.3498 Fax: 713.647.3638
> Email: terrence.liao at total.com
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20081223/390b1c3c/attachment.html
More information about the mvapich-discuss
mailing list