[mvapich-discuss] MVAPICH2 not connecting at 64 nodes

Sayantan Sur surs at cse.ohio-state.edu
Fri Jun 4 10:17:34 EDT 2010


Hi Chris,

Thanks for reporting this issue. We have not seen it earlier. May we ask you
which device of MVAPICH2 are you using? Gen2 or uDAPL?

If you would be willing, could you try out MVAPICH2 1.5rc1? It has support
for hydra process launcher as well (alternate process launcher).

http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.5rc1.html#x1-200005.2.2

We will be making the final 1.5 release soon. If this works, it will help
you proceed with using MVAPICH2 on BG. In the meanwhile we can figure out
this issue.

Let us know.

Thanks.

On Fri, Jun 4, 2010 at 9:50 AM, TJC Ward <tjcw at us.ibm.com> wrote:

> Looking at 'strace' of the 'mpispawn' processes on rank 0 and rank1 gives a
> pattern like
>
> select(1024, [5 6 7 8 9 10], NULL, NULL, NULL) = 1 (in [6])
> read(6, "\0\0\0\35\0\0\0=\377\377\377\377", 12) = 12
> read(6, "cmd=get kvsname=kvs_340_gx0y0z0_2797_0 key=00000002-00000029\n",
> 61) = 61
> write(6, "\0\0\0\35\0\0\0=\377\377\377\377", 12) = 12
> write(6, "cmd=get kvsname=kvs_340_gx0y0z0_2797_0 key=00000002-00000029\n",
> 61) = 61
> select(1024, [5 6 7 8 9 10], NULL, NULL, NULL) = 1 (in [6])
> read(6, "\0\0\0*\0\0\0=\377\377\377\377", 12) = 12
> read(6, "cmd=get kvsname=kvs_340_gx0y0z0_2797_0 key=00000003-00000042\n",
> 61) = 61
> write(6, "\0\0\0*\0\0\0=\377\377\377\377", 12) = 12
> write(6, "cmd=get kvsname=kvs_340_gx0y0z0_2797_0 key=00000003-00000042\n",
> 61) = 61
>
> repeated with minor variations. This makes me wonder if one of the nodes
> has been given queries, such that node 0 thinks that node 1 has the answer,
> and node 1 thinks that node 0 has the answer; and so the queries are
> circulating forever.
>
> Does anyone know if this is what is going on, and if so how to fix it so
> that the answer gets found eventually ?
>  *T J (Chris) Ward, IBM Research
> Scalable Data-Centric Computing - Active Storage Fabrics - IBM System
> BlueGene
> IBM United Kingdom Ltd., Hursley Park, Winchester, Hants, SO21 2JN
> 011-44-1962-818679
> IBM Intranet **http://hurgsa.ibm.com/~tjcw/*<http://hurgsa.ibm.com/~tjcw/>
> *
> **
> **IBM System BlueGene Research* <http://www.research.ibm.com/bluegene/>* *
> *
> **IBM System BlueGene Marketing*<http://www-03.ibm.com/systems/deepcomputing/bluegene/>
> *
> **
> **IBM Resources for Global Servants*<http://hurgsa.ibm.com/~tjcw/Compete/>
> * **
> **IBM Branded Products* <http://www.ibm.com/shop>* **IBM Branded Swag*<http://logogear.americanid.com/>
> * *
> <http://www.ibm.com/developerworks/linux/library/l-bluegene/index.html>
>
>  UNIX in the Cloud - Find A Place Where There's Room To Grow, with the
> original Open Standard. *Free Trial Here Today*<http://sdf.lonestar.org/index.cgi?telnet>  New
> Lamps For Old - *Diskless Remote Boot Linux*<http://drbl.sourceforge.net/>from
> *National Center for High-Performance Computing, Taiwan*<http://www.nchc.org.tw/en/>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>


-- 
Sayantan Sur

Research Scientist
Department of Computer Science
The Ohio State University.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20100604/27baa965/attachment-0001.html


More information about the mvapich-discuss mailing list