[mvapich-discuss] mvapich2-0.9.8 blacs problems

Bas van der Vlies basv at sara.nl
Tue Mar 20 03:42:45 EDT 2007


amith rajith mamidala wrote:
> Hi Bas,
> 
Hi Amit,

> Can you please apply the one line patches below and let us know the
> outcome? I have tried a couple of cases and the patch is working fine.
> Also, can you let us know the nature of this application (scal.f).
> It seems to be using several hundred of MPI_Comm_split operations. Is this
> the typical application pattern?
> 
We have some users that do a lot of matrix calculations and runs for a 
long time. This programs hangs. So we made a small example scalapack 
program that behaves the same. MPI_Comm_split is called by the blacs 
library.

I have applied the patch for mvapich2-0.9.8 and it does not hang. But it 
still gives errors. It depends on the size of the matrix when the error 
occurs and what kind of error:

the outcome of 4 nodes with 2 CPU's:
  1 : 63
  2 : 63
  3 : 63
  4 : 16
  5 : 63
  6 : 16
  7 : 63
  8 : 16

after 16 runs (8 procs):
{{{

loop, n, mb, nprocs, nprow, npcol: 16 100 16 8 2 4
Fatal error in MPI_Comm_create: Other MPI error, error stack:
MPI_Comm_create(266): MPI_Comm_create(comm=0xc4000012, group=0xc80300ec, 
new_comm=0xbf9b2d64) failed
MPI_Comm_create(143): Too many communicatorsFatal error in 
MPI_Comm_create: Other MPI error, error stack:
MPI_Comm_create(266): MPI_Comm_create(comm=0xc4000012, group=0xc80300ec, 
new_comm=0xbfd450f4) failed
MPI_Comm_create(143): Too many communicatorsFatal error in 
MPI_Comm_create: Other MPI error, error stack:
MPI_Comm_create(266): MPI_Comm_create(comm=0xc4000002, group=0xc803006c, 
new_comm=0xbfdf39a4) failed
MPI_Comm_create(143): Too many communicatorsFatal error in 
MPI_Comm_create: Other MPI error, error stack:
MPI_Comm_create(266): MPI_Comm_create(comm=0xc4000002, group=0xc803006c, 
new_comm=0xbfc38ff4) failed
MPI_Comm_create(143): Too many communicatorsFatal error in 
MPI_Comm_create: Other MPI error, error stack:
MPI_Comm_create(266): MPI_Comm_create(comm=0xc4000012, group=0xc80300ec, 
new_comm=0xbfcfd0b4) failed
MPI_Comm_create(143): Too many communicatorsFatal error in 
MPI_Comm_create: Other MPI error, error stack:
MPI_Comm_create(266): MPI_Comm_create(comm=0xc4000002, group=0xc803006c, 
new_comm=0xbfcd6084) failed
MPI_Comm_create(143): Too many communicatorsFatal error in 
MPI_Comm_create: Other MPI error, error stack:
MPI_Comm_create(266): MPI_Comm_create(comm=0xc4000012, group=0xc80300ec, 
new_comm=0xbff62b14) failed
MPI_Comm_create(143): Too many communicatorsFatal error in 
MPI_Comm_create: Other MPI error, error stack:
MPI_Comm_create(266): MPI_Comm_create(comm=0xc4000002, group=0xc803006c, 
new_comm=0xbfb2fee4) failed
MPI_Comm_create(143): Too many communicatorsrank 7 in job 8 
ib-r6n20.irc.sara.nl_11382   caused collective abort of all ranks
   exit status of rank 7: killed by signal 9
rank 6 in job 8  ib-r6n20.irc.sara.nl_11382   caused collective abort of 
all ranks
   exit status of rank 6: killed by signal 9
rank 5 in job 8  ib-r6n20.irc.sara.nl_11382   caused collective abort of 
all ranks
   exit status of rank 5: killed by signal 9
rank 4 in job 8  ib-r6n20.irc.sara.nl_11382   caused collective abort of 
all ranks
   exit status of rank 4: killed by signal 9
end 8
}}}

after 63 runs (7 procs)
{{{
  loop, n, mb, nprocs, nprow, npcol: 63 100 16 7 1 7
Fatal error in MPI_Comm_split: Other MPI error, error stack:
MPI_Comm_split(290).: MPI_Comm_split(comm=0xc40300f4, color=0, key=0, 
new_comm=0x13535864) failed
MPIR_Comm_create(90): Too many communicatorsFatal error in 
MPI_Comm_split: Other MPI error, error stack:
MPI_Comm_split(290).: MPI_Comm_split(comm=0xc40300f4, color=2, key=0, 
new_comm=0x13533734) failed
MPIR_Comm_create(90): Too many communicatorsFatal error in 
MPI_Comm_split: Other MPI error, error stack:
MPI_Comm_split(290).: MPI_Comm_split(comm=0xc401003c, color=2, key=1, 
new_comm=0x134fea44) failed
MPIR_Comm_create(90): Too many communicatorsFatal error in 
MPI_Comm_split: Other MPI error, error stack:
MPI_Comm_split(290).: MPI_Comm_split(comm=0xc401003c, color=1, key=1, 
new_comm=0x134fe404) failed
MPIR_Comm_create(90): Too many communicatorsFatal error in 
MPI_Comm_split: Other MPI error, error stack:
MPI_Comm_split(290).: MPI_Comm_split(comm=0xc40300f4, color=3, key=0, 
new_comm=0x135f3de4) failed
MPIR_Comm_create(90): Too many communicatorsFatal error in 
MPI_Comm_split: Other MPI error, error stack:
MPI_Comm_split(290).: MPI_Comm_split(comm=0xc40300f4, color=1, key=0, 
new_comm=0x13532184) failed
MPIR_Comm_create(90): Too many communicatorsFatal error in 
MPI_Comm_split: Other MPI error, error stack:
MPI_Comm_split(290).: MPI_Comm_split(comm=0xc401003c, color=0, key=1, 
new_comm=0x134fe404) failed
MPIR_Comm_create(90): Too many communicatorsrank 6 in job 7 
ib-r6n20.irc.sara.nl_11382   caused collective abort of all ranks
   exit status of rank 6: killed by signal 9
rank 5 in job 7  ib-r6n20.irc.sara.nl_11382   caused collective abort of 
all ranks
   exit status of rank 5: killed by signal 9
rank 4 in job 7  ib-r6n20.irc.sara.nl_11382   caused collective abort of 
all ranks
   exit status of rank 4: killed by signal 9
end 7
}}}


> For mvapich-0.9.9-beta:
> 
> Index: create_2level_comm.c (In $HOME/src/context)
> ===================================================================
> --- create_2level_comm.c        (revision 1102)
> +++ create_2level_comm.c        (working copy)
> @@ -56,7 +56,6 @@
>      struct MPIR_COMMUNICATOR* comm_world_ptr;
>      comm_world_ptr = MPIR_GET_COMM_PTR(MPI_COMM_WORLD);
> 
> -    if (comm_count > MAX_ALLOWED_COMM) return;
> 
>      int* shmem_group = malloc(sizeof(int) * size);
>      if (NULL == shmem_group){
> 
> 
> 
> For mvapich2-0.9.8:
> 
> Index: create_2level_comm.c (In $HOME/src/mpi/comm)
> ===================================================================
> --- create_2level_comm.c        (revision 1104)
> +++ create_2level_comm.c        (working copy)
> @@ -33,7 +33,6 @@
>      MPID_Comm_get_ptr( comm, comm_ptr );
>      MPID_Comm_get_ptr( MPI_COMM_WORLD, comm_world_ptr );
> 
> -    if (comm_count > MAX_ALLOWED_COMM) return;
> 
>      MPIR_Nest_incr();
> 
> 
> 
> Thanks,
> Amith
> 
> 
> On Mon, 19 Mar 2007, Bas van der Vlies wrote:
> 
>> Dhabaleswar Panda wrote:
>>> Hi Bas,
>>>
>>>>   We have done some further testing with mvapich version:
>>>>    * 0.9.8 everything works
>>>>
>>>>    * 0.9.9-beta it very slow and it hangs also like mvapich2
>>> Thanks for reporting this. Just to check ... are you using the latest
>>> mvapich 0.9.9 from the trunk or the beta tarball released on 02/09/07.
>>> A lot of successive fixes and tunings have gone since the beta version
>>> was released. You can get the latest version of the trunk through SVN
>>> checkout or downloading the nightly tarballs of the trunk.
>>>
>> I downloaded the tarball. We will test the latest trunk version.
>>
>> Regards
>>
>>> Best Regards,
>>>
>>> DK
>>>
>>>> Regards
>>>>
>>>>> Thanks.
>>>>>
>>>>> Regards,
>>>>> Wei Huang
>>>>>
>>>>> 774 Dreese Lab, 2015 Neil Ave,
>>>>> Dept. of Computer Science and Engineering
>>>>> Ohio State University
>>>>> OH 43210
>>>>> Tel: (614)292-8501
>>>>>
>>>>>
>>>>> On Mon, 19 Mar 2007, Bas van der Vlies wrote:
>>>>>
>>>>>> wei huang wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Thanks for letting us know the problem. We have generated a patch to
>>>>>>> address this problem, and have applied it to both the trunk and our svn
>>>>>>> 0.9.8 branch.
>>>>>>>
>>>>>>>
>>>>>> We have done some more tests and found some other problem using mvapich2
>>>>>> and blacs. This problems are encountered by user programs. We get
>>>>>> reports from our users that they get wrong answers from their programs.
>>>>>>
>>>>>> We have made a small fortran (g77) to illustrate a problem.
>>>>>> The calls a number of times the same scalapack routine. Independent of
>>>>>> the size of the problem the program hangs after 8 or 31 iterations
>>>>>> except when number of processes is a square, eg 1x1, 2x2, ...
>>>>>>
>>>>>> How to compile the program:
>>>>>> mpif77 -Wall -g -O0 -o scal scal.f -lscalapack -lfblacs -lcblacs -lblacs
>>>>>> -llapack -latlas
>>>>>>
>>>>>> The program expects on standard input:
>>>>>> <size of matrix> <block size> <number of iterations>
>>>>>>
>>>>>> for example:
>>>>>>   echo '100 16 100' | mpiexec -n <np> ./scal
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>>
>>>>>> PS) This program behaves correctly with topspin/ciso software which is
>>>>>> based on their infiniband stack and bases on mvapich1 version.
>>>>>>
>>>>>> We gona test the program in mvapiach1 from OSU
>>>>>> --
>>>>>> ********************************************************************
>>>>>> *                                                                  *
>>>>>> *  Bas van der Vlies                     e-mail: basv at sara.nl      *
>>>>>> *  SARA - Academic Computing Services    phone:  +31 20 592 8012   *
>>>>>> *  Kruislaan 415                         fax:    +31 20 6683167    *
>>>>>> *  1098 SJ Amsterdam                                               *
>>>>>> *                                                                  *
>>>>>> ********************************************************************
>>>>>>
>>>> --
>>>> ********************************************************************
>>>> *                                                                  *
>>>> *  Bas van der Vlies                     e-mail: basv at sara.nl      *
>>>> *  SARA - Academic Computing Services    phone:  +31 20 592 8012   *
>>>> *  Kruislaan 415                         fax:    +31 20 6683167    *
>>>> *  1098 SJ Amsterdam                                               *
>>>> *                                                                  *
>>>> ********************************************************************
>>>> _______________________________________________
>>>> mvapich-discuss mailing list
>>>> mvapich-discuss at cse.ohio-state.edu
>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>
>>
>> --
>> ********************************************************************
>> *                                                                  *
>> *  Bas van der Vlies                     e-mail: basv at sara.nl      *
>> *  SARA - Academic Computing Services    phone:  +31 20 592 8012   *
>> *  Kruislaan 415                         fax:    +31 20 6683167    *
>> *  1098 SJ Amsterdam                                               *
>> *                                                                  *
>> ********************************************************************
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
> 
> 


-- 
********************************************************************
*                                                                  *
*  Bas van der Vlies                     e-mail: basv at sara.nl      *
*  SARA - Academic Computing Services    phone:  +31 20 592 8012   *
*  Kruislaan 415                         fax:    +31 20 6683167    *
*  1098 SJ Amsterdam                                               *
*                                                                  *
********************************************************************


More information about the mvapich-discuss mailing list