[mvapich-discuss] MVAPICH2 issue with CH3 gen2 channel

sreeram potluri potluri at cse.ohio-state.edu
Wed May 4 15:42:52 EDT 2011


Dear Juan

I observed in (2), you are sending out a large ~2GB message.
Currently, the one-sided implementation of MVAPICH2 is not dealing
properly with very large messages. We will fix it soon and send you a
patch. BTW, is there a real world use-case for this?

There seems to be a bug in the application (3). The size of data in
communication calls is greater than the window size. The window
creation function takes size in terms of bytes while the communication
operations take size in terms of datatype elements. So in the code
snippet below from p2p.c:p2p_onesided the size in MPI_Win_create call
should bufsize*sizeof(long) rather than bufsize.

Another problem I see here is that you have used sizeof(MPI_LONG) to
get the size of the MPI datatype. This is erroneous as MPI_LONG itself
might just be a macro. One should rather use the MPI_Type_size
function.

With these two changes I was able to run the active one-sided cases
without any problem.

       err |= MPI_Win_create(
               info->b->mat,
               bufsize,
               sizeof(MPI_LONG),
               MPI_INFO_NULL,
               MPI_COMM_WORLD,
               &win);

             - - -

       err |= MPI_Get(
               buf[bufindex]->mat, bufsize, MPI_LONG,
               serverpid, 0, bufsize, MPI_LONG, win);

Thank you
Sreeram Potluri

On Tue, May 3, 2011 at 1:09 PM, Juan Vercellone <juanjov at gmail.com> wrote:

> Dear Sreeram,
> Thank you very much for your quick response. Here is some of the
> information requested:
>
> 1) Which information would be of relevance here? (i.e. Linux commands to
> issue).
> The hostfile, on the other hand, is set depending on the tests to be
> ran. I use 3 hostfiles. Here I provide some system information
> (network):
>
>  - /etc/hosts: http://pastebin.com/Ti3b7NDw
>  - hostfile 1: http://pastebin.com/nUsajf45
>  - hostfile 2: http://pastebin.com/z2QRqwTa
>  - hostfile 3: http://pastebin.com/5HKHWX84
>  - /sbin/ifconfig -a output: http://pastebin.com/RsTmTyKd
>  - /proc/version: http://pastebin.com/jgcGNys2
>
> As you can see, when using the 3rd hostfile, there are 2 processes per
> node (I mostly issue my MPI application with 4 processes).
>
> 2) The application effectively uses multiple threads, but none of them
> makes MPI calls. They are just created to perform some calculations,
> and are terminated before final calculation result is obtained.
>
> I could run the following application using MVAPICH2 1.5rc1, but not
> with MVAPICH2 1.6: http://pastebin.com/FsGjdfyj
> MVAPICH2 1.5rc1 output: http://pastebin.com/ryAtKYwu
> MVAPICH2 1.6 output: http://pastebin.com/0w53cHLY
>
>
> 3) The application I am running is a full MPI application built on my
> own. To do a checkout of the repository:
> svn checkout http://mat-prod.googlecode.com/svn/trunk/ mat-prod-read-only
>
> Makefiles are included:
>  - To compile: make clean && make
>  - To run:
>      - make erun (run with hosts as hostfile)
>      - make irun (run with hostsib as hostfile)
>      - make orun (run with hostso as hostfile)
>
>
> Thanks again for your prompt response!
> Regards,
>
> On Tue, May 3, 2011 at 9:12 AM, sreeram potluri
> <potluri at cse.ohio-state.edu> wrote:
> > Dear Juan,
> > Thanks for reporting your problem. We will need information about your
> > system, build and application to debug this further
> > 1) Configuration of the system (node and network) and organization of
> > the host file (1 process/node?)
> > 2) Regarding failures with v1.6, I see that you have built it
> > with --enable-threads=multiple. Does your application use multiple
> threads
> > that make MPI calls?
> > Were you able run other tests (a hello world or the OSU benchmarks ) with
> > this build? If not there could be a more basic issue with it.
> > 3) Regarding failures with 1.5rc1, is the test you are running an
> > application benchmark or a full application? Will it be possible for you
> to
> > send us your code? It will be easiest to debug if we can reproduce your
> > issue locally.
> > If not, excerpts from the code where it uses one-sided calls that are
> > causing the issue can give some insights (window creation, communication
> and
> > synchronization calls).
> > Thank you
> > Sreeram Potluri
> > On Tue, May 3, 2011 at 7:10 AM, Juan Vercellone <juanjov at gmail.com>
> wrote:
> >>
> >> Hello, list.
> >> I am having some trouble executing my MPI applications with MVAPICH2
> >> v1.6 (latest stable release).
> >>
> >> This is what I get when launching my programs:
> >> mpirun_rsh -np 4 -hostfile ./hostso mat_prod 3
> >> MPI process (rank: 1) terminated unexpectedly on compute-0-4.local
> >> MPI process (rank: 0) terminated unexpectedly on compute-0-3.local
> >> child_handler: Error in init phase...wait for cleanup! (1/2mpispawn
> >> connections)
> >> child_handler: Error in init phase...wait for cleanup! (1/2mpispawn
> >> connections)
> >> Failed in initilization phase, cleaned up all the mpispawn!
> >>
> >> The application doesn't even start.
> >>
> >> The MVAPICH2 instance was compiled using the following configuration:
> >> ./configure --enable-threads=multiple --disable-f90
> >>
> >> Using MVAPICH2 v1.5rc1 with the same configuration options, I can get
> >> my application to work with some communication schemes, but get error
> >> messages when attempting to use one-sided active synchronization
> >> calls.
> >> Here are these errors:
> >>
> >> (FOR ACTIVE SYNC WITH POST/WAIT/START/COMPLETE)
> >> mpirun_rsh -np 4 -hostfile ./hostso mat_prod 4
> >> send desc error
> >> [0] Abort: [] Got completion with error 10, vendor code=88, dest rank=1
> >>  at line 580 in file ibv_channel_manager.c
> >> send desc error
> >> [2] Abort: [] Got completion with error 10, vendor code=88, dest rank=3
> >>  at line 580 in file ibv_channel_manager.c
> >> [3] Abort: Got FATAL event 3
> >>  at line 935 in file ibv_channel_manager.c
> >> MPI process (rank: 0) terminated unexpectedly on compute-0-3.local
> >> Exit code -5 signaled from compute-0-3
> >> [1] Abort: Got FATAL event 3
> >>  at line 935 in file ibv_channel_manager.c
> >> MPI process (rank: 3) terminated unexpectedly on compute-0-4.local
> >> make: *** [orun] Error 1
> >>
> >>
> >> (FOR ACTIVE SYNC WITH FENCE)
> >> mpirun_rsh -np 4 -hostfile ./hostso mat_prod 6
> >> send desc error
> >> [2] Abort: [] Got completion with error 10, vendor code=88, dest rank=3
> >>  at line 580 in file ibv_channel_manager.c
> >> send desc error
> >> [0] Abort: [] Got completion with error 10, vendor code=88, dest rank=1
> >>  at line 580 in file ibv_channel_manager.c
> >> MPI process (rank: 2) terminated unexpectedly on compute-0-3.local
> >> Exit code -5 signaled from compute-0-3
> >> [3] Abort: Got FATAL event 3
> >>  at line 935 in file ibv_channel_manager.c
> >> [1] Abort: Got FATAL event 3
> >>  at line 935 in file ibv_channel_manager.c
> >> MPI process (rank: 3) terminated unexpectedly on compute-0-4.local
> >> make: *** [orun] Error 1
> >>
> >> In order to proceed, I need you to please specify a list of all the
> >> necessary information you would need me to provide in order to
> >> continue solving this issue.
> >>
> >> Thank you very much.
> >> Regards,
> >>
> >> P.S.: Everything is working fine with (and just with) Nemesis for
> >> InfiniBand (--with-device=ch3:nemesis:ib).
> >>
> >> --
> >> ---------- .-
> >> VERCELLONE, Juan.
> >> (also known as 1010ad1c97efb4734854b6ffd0899401)
> >> _______________________________________________
> >> mvapich-discuss mailing list
> >> mvapich-discuss at cse.ohio-state.edu
> >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
> >
>
>
>
> --
> ---------- .-
> VERCELLONE, Juan.
> (also known as 1010ad1c97efb4734854b6ffd0899401)
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20110504/d4ebc8df/attachment.html


More information about the mvapich-discuss mailing list