[mvapich-discuss] MVAPICH2 issue with CH3 gen2 channel

Juan Vercellone juanjov at gmail.com
Thu May 5 10:15:12 EDT 2011


Like a charm :)

Thank you very much, Sreeram.
I really appreciate your help.

My best regards,

On Wed, May 4, 2011 at 2:42 PM, sreeram potluri
<potluri at cse.ohio-state.edu> wrote:
> Dear Juan
> I observed in (2), you are sending out a large ~2GB message.
> Currently, the one-sided implementation of MVAPICH2 is not dealing
> properly with very large messages. We will fix it soon and send you a
> patch. BTW, is there a real world use-case for this?
>
> There seems to be a bug in the application (3). The size of data in
> communication calls is greater than the window size. The window
> creation function takes size in terms of bytes while the communication
> operations take size in terms of datatype elements. So in the code
> snippet below from p2p.c:p2p_onesided the size in MPI_Win_create call
> should bufsize*sizeof(long) rather than bufsize.
>
> Another problem I see here is that you have used sizeof(MPI_LONG) to
> get the size of the MPI datatype. This is erroneous as MPI_LONG itself
> might just be a macro. One should rather use the MPI_Type_size
> function.
>
> With these two changes I was able to run the active one-sided cases
> without any problem.
>
>        err |= MPI_Win_create(
>                info->b->mat,
>                bufsize,
>                sizeof(MPI_LONG),
>                MPI_INFO_NULL,
>                MPI_COMM_WORLD,
>                &win);
>
>              - - -
>
>        err |= MPI_Get(
>                buf[bufindex]->mat, bufsize, MPI_LONG,
>                serverpid, 0, bufsize, MPI_LONG, win);
>
> Thank you
> Sreeram Potluri
> On Tue, May 3, 2011 at 1:09 PM, Juan Vercellone <juanjov at gmail.com> wrote:
>>
>> Dear Sreeram,
>> Thank you very much for your quick response. Here is some of the
>> information requested:
>>
>> 1) Which information would be of relevance here? (i.e. Linux commands to
>> issue).
>> The hostfile, on the other hand, is set depending on the tests to be
>> ran. I use 3 hostfiles. Here I provide some system information
>> (network):
>>
>>  - /etc/hosts: http://pastebin.com/Ti3b7NDw
>>  - hostfile 1: http://pastebin.com/nUsajf45
>>  - hostfile 2: http://pastebin.com/z2QRqwTa
>>  - hostfile 3: http://pastebin.com/5HKHWX84
>>  - /sbin/ifconfig -a output: http://pastebin.com/RsTmTyKd
>>  - /proc/version: http://pastebin.com/jgcGNys2
>>
>> As you can see, when using the 3rd hostfile, there are 2 processes per
>> node (I mostly issue my MPI application with 4 processes).
>>
>> 2) The application effectively uses multiple threads, but none of them
>> makes MPI calls. They are just created to perform some calculations,
>> and are terminated before final calculation result is obtained.
>>
>> I could run the following application using MVAPICH2 1.5rc1, but not
>> with MVAPICH2 1.6: http://pastebin.com/FsGjdfyj
>> MVAPICH2 1.5rc1 output: http://pastebin.com/ryAtKYwu
>> MVAPICH2 1.6 output: http://pastebin.com/0w53cHLY
>>
>>
>> 3) The application I am running is a full MPI application built on my
>> own. To do a checkout of the repository:
>> svn checkout http://mat-prod.googlecode.com/svn/trunk/ mat-prod-read-only
>>
>> Makefiles are included:
>>  - To compile: make clean && make
>>  - To run:
>>      - make erun (run with hosts as hostfile)
>>      - make irun (run with hostsib as hostfile)
>>      - make orun (run with hostso as hostfile)
>>
>>
>> Thanks again for your prompt response!
>> Regards,
>>
>> On Tue, May 3, 2011 at 9:12 AM, sreeram potluri
>> <potluri at cse.ohio-state.edu> wrote:
>> > Dear Juan,
>> > Thanks for reporting your problem. We will need information about your
>> > system, build and application to debug this further
>> > 1) Configuration of the system (node and network) and organization of
>> > the host file (1 process/node?)
>> > 2) Regarding failures with v1.6, I see that you have built it
>> > with --enable-threads=multiple. Does your application use multiple
>> > threads
>> > that make MPI calls?
>> > Were you able run other tests (a hello world or the OSU benchmarks )
>> > with
>> > this build? If not there could be a more basic issue with it.
>> > 3) Regarding failures with 1.5rc1, is the test you are running an
>> > application benchmark or a full application? Will it be possible for you
>> > to
>> > send us your code? It will be easiest to debug if we can reproduce your
>> > issue locally.
>> > If not, excerpts from the code where it uses one-sided calls that are
>> > causing the issue can give some insights (window creation, communication
>> > and
>> > synchronization calls).
>> > Thank you
>> > Sreeram Potluri
>> > On Tue, May 3, 2011 at 7:10 AM, Juan Vercellone <juanjov at gmail.com>
>> > wrote:
>> >>
>> >> Hello, list.
>> >> I am having some trouble executing my MPI applications with MVAPICH2
>> >> v1.6 (latest stable release).
>> >>
>> >> This is what I get when launching my programs:
>> >> mpirun_rsh -np 4 -hostfile ./hostso mat_prod 3
>> >> MPI process (rank: 1) terminated unexpectedly on compute-0-4.local
>> >> MPI process (rank: 0) terminated unexpectedly on compute-0-3.local
>> >> child_handler: Error in init phase...wait for cleanup! (1/2mpispawn
>> >> connections)
>> >> child_handler: Error in init phase...wait for cleanup! (1/2mpispawn
>> >> connections)
>> >> Failed in initilization phase, cleaned up all the mpispawn!
>> >>
>> >> The application doesn't even start.
>> >>
>> >> The MVAPICH2 instance was compiled using the following configuration:
>> >> ./configure --enable-threads=multiple --disable-f90
>> >>
>> >> Using MVAPICH2 v1.5rc1 with the same configuration options, I can get
>> >> my application to work with some communication schemes, but get error
>> >> messages when attempting to use one-sided active synchronization
>> >> calls.
>> >> Here are these errors:
>> >>
>> >> (FOR ACTIVE SYNC WITH POST/WAIT/START/COMPLETE)
>> >> mpirun_rsh -np 4 -hostfile ./hostso mat_prod 4
>> >> send desc error
>> >> [0] Abort: [] Got completion with error 10, vendor code=88, dest rank=1
>> >>  at line 580 in file ibv_channel_manager.c
>> >> send desc error
>> >> [2] Abort: [] Got completion with error 10, vendor code=88, dest rank=3
>> >>  at line 580 in file ibv_channel_manager.c
>> >> [3] Abort: Got FATAL event 3
>> >>  at line 935 in file ibv_channel_manager.c
>> >> MPI process (rank: 0) terminated unexpectedly on compute-0-3.local
>> >> Exit code -5 signaled from compute-0-3
>> >> [1] Abort: Got FATAL event 3
>> >>  at line 935 in file ibv_channel_manager.c
>> >> MPI process (rank: 3) terminated unexpectedly on compute-0-4.local
>> >> make: *** [orun] Error 1
>> >>
>> >>
>> >> (FOR ACTIVE SYNC WITH FENCE)
>> >> mpirun_rsh -np 4 -hostfile ./hostso mat_prod 6
>> >> send desc error
>> >> [2] Abort: [] Got completion with error 10, vendor code=88, dest rank=3
>> >>  at line 580 in file ibv_channel_manager.c
>> >> send desc error
>> >> [0] Abort: [] Got completion with error 10, vendor code=88, dest rank=1
>> >>  at line 580 in file ibv_channel_manager.c
>> >> MPI process (rank: 2) terminated unexpectedly on compute-0-3.local
>> >> Exit code -5 signaled from compute-0-3
>> >> [3] Abort: Got FATAL event 3
>> >>  at line 935 in file ibv_channel_manager.c
>> >> [1] Abort: Got FATAL event 3
>> >>  at line 935 in file ibv_channel_manager.c
>> >> MPI process (rank: 3) terminated unexpectedly on compute-0-4.local
>> >> make: *** [orun] Error 1
>> >>
>> >> In order to proceed, I need you to please specify a list of all the
>> >> necessary information you would need me to provide in order to
>> >> continue solving this issue.
>> >>
>> >> Thank you very much.
>> >> Regards,
>> >>
>> >> P.S.: Everything is working fine with (and just with) Nemesis for
>> >> InfiniBand (--with-device=ch3:nemesis:ib).
>> >>
>> >> --
>> >> ---------- .-
>> >> VERCELLONE, Juan.
>> >> (also known as 1010ad1c97efb4734854b6ffd0899401)
>> >> _______________________________________________
>> >> mvapich-discuss mailing list
>> >> mvapich-discuss at cse.ohio-state.edu
>> >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>> >
>> >
>>
>>
>>
>> --
>> ---------- .-
>> VERCELLONE, Juan.
>> (also known as 1010ad1c97efb4734854b6ffd0899401)
>
>



-- 
---------- .-
VERCELLONE, Juan.
(also known as 1010ad1c97efb4734854b6ffd0899401)



More information about the mvapich-discuss mailing list