[mvapich-discuss] MVAPICH2 issue with CH3 gen2 channel

Juan Vercellone juanjov at gmail.com
Tue May 3 13:09:21 EDT 2011


Dear Sreeram,
Thank you very much for your quick response. Here is some of the
information requested:

1) Which information would be of relevance here? (i.e. Linux commands to issue).
The hostfile, on the other hand, is set depending on the tests to be
ran. I use 3 hostfiles. Here I provide some system information
(network):

  - /etc/hosts: http://pastebin.com/Ti3b7NDw
  - hostfile 1: http://pastebin.com/nUsajf45
  - hostfile 2: http://pastebin.com/z2QRqwTa
  - hostfile 3: http://pastebin.com/5HKHWX84
  - /sbin/ifconfig -a output: http://pastebin.com/RsTmTyKd
  - /proc/version: http://pastebin.com/jgcGNys2

As you can see, when using the 3rd hostfile, there are 2 processes per
node (I mostly issue my MPI application with 4 processes).

2) The application effectively uses multiple threads, but none of them
makes MPI calls. They are just created to perform some calculations,
and are terminated before final calculation result is obtained.

I could run the following application using MVAPICH2 1.5rc1, but not
with MVAPICH2 1.6: http://pastebin.com/FsGjdfyj
MVAPICH2 1.5rc1 output: http://pastebin.com/ryAtKYwu
MVAPICH2 1.6 output: http://pastebin.com/0w53cHLY


3) The application I am running is a full MPI application built on my
own. To do a checkout of the repository:
svn checkout http://mat-prod.googlecode.com/svn/trunk/ mat-prod-read-only

Makefiles are included:
  - To compile: make clean && make
  - To run:
      - make erun (run with hosts as hostfile)
      - make irun (run with hostsib as hostfile)
      - make orun (run with hostso as hostfile)


Thanks again for your prompt response!
Regards,

On Tue, May 3, 2011 at 9:12 AM, sreeram potluri
<potluri at cse.ohio-state.edu> wrote:
> Dear Juan,
> Thanks for reporting your problem. We will need information about your
> system, build and application to debug this further
> 1) Configuration of the system (node and network) and organization of
> the host file (1 process/node?)
> 2) Regarding failures with v1.6, I see that you have built it
> with --enable-threads=multiple. Does your application use multiple threads
> that make MPI calls?
> Were you able run other tests (a hello world or the OSU benchmarks ) with
> this build? If not there could be a more basic issue with it.
> 3) Regarding failures with 1.5rc1, is the test you are running an
> application benchmark or a full application? Will it be possible for you to
> send us your code? It will be easiest to debug if we can reproduce your
> issue locally.
> If not, excerpts from the code where it uses one-sided calls that are
> causing the issue can give some insights (window creation, communication and
> synchronization calls).
> Thank you
> Sreeram Potluri
> On Tue, May 3, 2011 at 7:10 AM, Juan Vercellone <juanjov at gmail.com> wrote:
>>
>> Hello, list.
>> I am having some trouble executing my MPI applications with MVAPICH2
>> v1.6 (latest stable release).
>>
>> This is what I get when launching my programs:
>> mpirun_rsh -np 4 -hostfile ./hostso mat_prod 3
>> MPI process (rank: 1) terminated unexpectedly on compute-0-4.local
>> MPI process (rank: 0) terminated unexpectedly on compute-0-3.local
>> child_handler: Error in init phase...wait for cleanup! (1/2mpispawn
>> connections)
>> child_handler: Error in init phase...wait for cleanup! (1/2mpispawn
>> connections)
>> Failed in initilization phase, cleaned up all the mpispawn!
>>
>> The application doesn't even start.
>>
>> The MVAPICH2 instance was compiled using the following configuration:
>> ./configure --enable-threads=multiple --disable-f90
>>
>> Using MVAPICH2 v1.5rc1 with the same configuration options, I can get
>> my application to work with some communication schemes, but get error
>> messages when attempting to use one-sided active synchronization
>> calls.
>> Here are these errors:
>>
>> (FOR ACTIVE SYNC WITH POST/WAIT/START/COMPLETE)
>> mpirun_rsh -np 4 -hostfile ./hostso mat_prod 4
>> send desc error
>> [0] Abort: [] Got completion with error 10, vendor code=88, dest rank=1
>>  at line 580 in file ibv_channel_manager.c
>> send desc error
>> [2] Abort: [] Got completion with error 10, vendor code=88, dest rank=3
>>  at line 580 in file ibv_channel_manager.c
>> [3] Abort: Got FATAL event 3
>>  at line 935 in file ibv_channel_manager.c
>> MPI process (rank: 0) terminated unexpectedly on compute-0-3.local
>> Exit code -5 signaled from compute-0-3
>> [1] Abort: Got FATAL event 3
>>  at line 935 in file ibv_channel_manager.c
>> MPI process (rank: 3) terminated unexpectedly on compute-0-4.local
>> make: *** [orun] Error 1
>>
>>
>> (FOR ACTIVE SYNC WITH FENCE)
>> mpirun_rsh -np 4 -hostfile ./hostso mat_prod 6
>> send desc error
>> [2] Abort: [] Got completion with error 10, vendor code=88, dest rank=3
>>  at line 580 in file ibv_channel_manager.c
>> send desc error
>> [0] Abort: [] Got completion with error 10, vendor code=88, dest rank=1
>>  at line 580 in file ibv_channel_manager.c
>> MPI process (rank: 2) terminated unexpectedly on compute-0-3.local
>> Exit code -5 signaled from compute-0-3
>> [3] Abort: Got FATAL event 3
>>  at line 935 in file ibv_channel_manager.c
>> [1] Abort: Got FATAL event 3
>>  at line 935 in file ibv_channel_manager.c
>> MPI process (rank: 3) terminated unexpectedly on compute-0-4.local
>> make: *** [orun] Error 1
>>
>> In order to proceed, I need you to please specify a list of all the
>> necessary information you would need me to provide in order to
>> continue solving this issue.
>>
>> Thank you very much.
>> Regards,
>>
>> P.S.: Everything is working fine with (and just with) Nemesis for
>> InfiniBand (--with-device=ch3:nemesis:ib).
>>
>> --
>> ---------- .-
>> VERCELLONE, Juan.
>> (also known as 1010ad1c97efb4734854b6ffd0899401)
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>



-- 
---------- .-
VERCELLONE, Juan.
(also known as 1010ad1c97efb4734854b6ffd0899401)



More information about the mvapich-discuss mailing list