[mvapich-discuss] mvapich2-0.9.8 blacs problems

amith rajith mamidala mamidala at cse.ohio-state.edu
Thu Mar 22 18:09:45 EDT 2007


Hi Bas,

Can you please try out these two patches and see if the problem goes away.
I have tried out this with the scal.f test and things are fine.

Thanks,
Amith

On Thu, 22 Mar 2007, Bas van der Vlies wrote:

> Hello,
>
>   We have made a two simpler programs that does not use scalapack/blacs
> and also shows the same behavior. See attachments.
>
> Here are the result:
>   mvapich 0.9.8: No problems
>   mvapich 0.9.9 trunk: see below for errors
>   mvapich2 0.9.8 : see below for errors
>
>
> Regards and Hope this helps with diagnosing the problem
>
> =====================================================================
>   mvapich 0.9.9 trunk:
> duptest:
> ====================================================
> Running with 8 processes
> will do 100000 dups and frees
> ............................................0 - <NO ERROR MESSAGE> :
> Pointer conversions exhausted
> Too many MPI objects may have been passed to/from Fortran
> without being freed
> [0] [] Aborting Program!
> 4 - <NO ERROR MESSAGE> : Pointer conversions exhausted
> Too many MPI objects may have been passed to/from Fortran
> without being freed
> 2 - <NO ERROR MESSAGE> : Pointer conversions exhausted
> Too many MPI objects may have been passed to/from Fortran
> without being freed
> 6 - <NO ERROR MESSAGE> : Pointer conversions exhausted
> Too many MPI objects may have been passed to/from Fortran
> without being freed
> mpirun_rsh: Abort signaled from [0]
> [4] [] Aborting Program!
> [2] [] Aborting Program!
> [6] [] Aborting Program!
> done.
> ====================================================
>
> splittest:
> ====================================================
> bas at ib-r21n1:~/src/applications$ mpirun -np 8 ./a.out
>
> Running with 8 processes
> will do 100000 splits and frees
> ......................................0 - <NO ERROR MESSAGE> : Pointer
> conversions exhausted
> Too many MPI objects may have been passed to/from Fortran
> without being freed
> [0] [] Aborting Program!
> 6 - <NO ERROR MESSAGE> : Pointer conversions exhausted
> Too many MPI objects may have been passed to/from Fortran
> without being freed
> 2 - <NO ERROR MESSAGE> : Pointer conversions exhausted
> Too many MPI objects may have been passed to/from Fortran
> without being freed
> 4 - <NO ERROR MESSAGE> : Pointer conversions exhausted
> Too many MPI objects may have been passed to/from Fortran
> without being freed
> mpirun_rsh: Abort signaled from [0]
> [6] [] Aborting Program!
> [2] [] Aborting Program!
> [4] [] Aborting Program!
> done.
> ====================================================
>
>
> mvapich2 0.9.8:
>
> duptest:
> ====================================================
> as at ib-r21n1:~/src/applications$ mpiexec -n $nprocs  ./a.out
> Running with 8 processes
> will do 100000 dups and frees
> .Fatal error in MPI_Comm_split: Other MPI error, error stack:
> MPI_Comm_split(290).: MPI_Comm_split(comm=0x84000002, color=0, key=0,
> new_comm=0xb7f4d8a4) failed
> MPIR_Comm_create(90): Too many communicatorsFatal error in
> MPI_Comm_split: Other MPI error, error stack:
> MPI_Comm_split(290).: MPI_Comm_split(comm=0x84000001, color=1, key=1,
> new_comm=0xb7f7a7bc) failed
> MPIR_Comm_create(90): Too many communicatorsFatal error in
> MPI_Comm_split: Other MPI error, error stack:
> MPI_Comm_split(290).: MPI_Comm_split(comm=0x84000002, color=2, key=0,
> new_comm=0xb7f778a4) failed
> MPIR_Comm_create(90): Too many communicatorsFatal error in
> MPI_Comm_split: Other MPI error, error stack:
> MPI_Comm_split(290).: MPI_Comm_split(comm=0x84000001, color=2, key=1,
> new_comm=0xb7ecf7bc) failed
> MPIR_Comm_create(90): Too many communicatorsFatal error in
> MPI_Comm_split: Other MPI error, error stack:
> MPI_Comm_split(290).: MPI_Comm_split(comm=0x84000002, color=3, key=0,
> new_comm=0xb7f398a4) failed
> MPIR_Comm_create(90): Too many communicatorsFatal error in
> MPI_Comm_split: Other MPI error, error stack:
> MPI_Comm_split(290).: MPI_Comm_split(comm=0x84000001, color=3, key=1,
> new_comm=0xb7f447bc) failed
> MPIR_Comm_create(90): Too many communicatorsrank 7 in job 1
> ib-r21n1.irc.sara.nl_8763   caused collective abort of all ranks
>    exit status of rank 7: killed by signal 9
> rank 6 in job 1  ib-r21n1.irc.sara.nl_8763   caused collective abort of
> all ranks
>    exit status of rank 6: killed by signal 9
> Fatal error in MPI_Comm_split: Other MPI error, error stack:
> MPI_Comm_split(290).: MPI_Comm_split(comm=0x84000002, color=1, key=0,
> new_comm=0xb7f708a4) failed
> MPIR_Comm_create(90): Too many communicatorsFatal error in
> MPI_Comm_split: Other MPI error, error stack:
> MPI_Comm_split(290).: MPI_Comm_split(comm=0x84000001, color=0, key=1,
> new_comm=0xb7f4d7bc) failed
> MPIR_Comm_create(90): Too many communicatorsrank 5 in job 1
> ib-r21n1.irc.sara.nl_8763   caused collective abort of all ranks
>    exit status of rank 5: return code 13
> rank 4 in job 1  ib-r21n1.irc.sara.nl_8763   caused collective abort of
> all ranks
>    exit status of rank 4: killed by signal 9
> ====================================================
>
> splitest:
> ====================================================
> Running with 8 processes
> will do 100000 splits and frees
> .Fatal error in MPI_Comm_split: Other MPI error, error stack:
> MPI_Comm_split(290).: MPI_Comm_split(comm=0x84000002, color=0, key=0,
> new_comm=0xb7f2b8a4) failed
> MPIR_Comm_create(90): Too many communicatorsFatal error in
> MPI_Comm_split: Other MPI error, error stack:
> MPI_Comm_split(290).: MPI_Comm_split(comm=0x84000002, color=0, key=0,
> new_comm=0xb7f258a4) failed
> MPIR_Comm_create(90): Too many communicatorsFatal error in
> MPI_Comm_split: Other MPI error, error stack:
> MPI_Comm_split(290).: MPI_Comm_split(comm=0x84000002, color=1, key=0,
> new_comm=0xb7f168a4) failed
> MPIR_Comm_create(90): Too many communicatorsFatal error in
> MPI_Comm_split: Other MPI error, error stack:
> MPI_Comm_split(290).: MPI_Comm_split(comm=0x84000002, color=1, key=0,
> new_comm=0xb7f328a4) failed
> MPIR_Comm_create(90): Too many communicatorsrank 2 in job 3
> ib-r21n1.irc.sara.nl_8763   caused collective abort of all ranks
>    exit status of rank 2: killed by signal 9
> rank 1 in job 3  ib-r21n1.irc.sara.nl_8763   caused collective abort of
> all ranks
>    exit status of rank 1: killed by signal 9
> rank 0 in job 3  ib-r21n1.irc.sara.nl_8763   caused collective abort of
> all ranks
>    exit status of rank 0: killed by signal 9
> ====================================================
> --
> ********************************************************************
> *                                                                  *
> *  Bas van der Vlies                     e-mail: basv at sara.nl      *
> *  SARA - Academic Computing Services    phone:  +31 20 592 8012   *
> *  Kruislaan 415                         fax:    +31 20 6683167    *
> *  1098 SJ Amsterdam                                               *
> *                                                                  *
> ********************************************************************
>
-------------- next part --------------
Index: create_2level_comm.c
===================================================================
--- create_2level_comm.c	(revision 1112)
+++ create_2level_comm.c	(working copy)
@@ -164,7 +164,9 @@
     }
     else{
         comm_ptr->shmem_coll_ok = 0;
-        free_2level_comm(comm_ptr);
+	free_2level_comm(comm_ptr);
+	if (comm_ptr->leader_comm) { MPI_Comm_free(&(comm_ptr->leader_comm));}
+	if (comm_ptr->shmem_comm)  { MPI_Comm_free(&(comm_ptr->shmem_comm));}
     }
 
     ++comm_count;
-------------- next part --------------
Index: create_2level_comm.c
===================================================================
--- create_2level_comm.c	(revision 1118)
+++ create_2level_comm.c	(working copy)
@@ -26,6 +26,22 @@
 extern shmem_coll_region *shmem_coll;
 static pthread_mutex_t shmem_coll_lock  = PTHREAD_MUTEX_INITIALIZER;
 
+void clear_2level_comm (MPID_Comm* comm_ptr)
+{
+    comm_ptr->shmem_coll_ok = 0;
+    comm_ptr->leader_map  = NULL;
+    comm_ptr->leader_rank = NULL;
+}
+
+void free_2level_comm (MPID_Comm* comm_ptr)
+{
+    if (comm_ptr->leader_map)  { free(comm_ptr->leader_map);  }
+    if (comm_ptr->leader_rank) { free(comm_ptr->leader_rank); }
+    if (comm_ptr->leader_comm) { MPI_Comm_free(&(comm_ptr->leader_comm));}
+    if (comm_ptr->shmem_comm)  { MPI_Comm_free(&(comm_ptr->shmem_comm));}
+    clear_2level_comm(comm_ptr);
+}
+
 void create_2level_comm (MPI_Comm comm, int size, int my_rank){
 
     MPID_Comm* comm_ptr;
@@ -60,7 +76,9 @@
     /* Creating leader group */
     int leader = 0;
     leader = shmem_group[0];
+    free(shmem_group);
 
+
     /* Gives the mapping to any process's leader in comm */
     comm_ptr->leader_map = malloc(sizeof(int) * size);
     if (NULL == comm_ptr->leader_map){
@@ -105,6 +123,8 @@
 
     MPI_Group_incl(comm_group, leader_group_size, leader_group, &subgroup1);
     MPI_Comm_create(comm, subgroup1, &(comm_ptr->leader_comm));
+
+    free(leader_group);
     MPID_Comm *leader_ptr;
     MPID_Comm_get_ptr( comm_ptr->leader_comm, leader_ptr );
     
@@ -142,12 +162,16 @@
     }
     else{
         comm_ptr->shmem_coll_ok = 0;
+	free_2level_comm(comm_ptr);
     }
 
     ++comm_count;
     MPIR_Nest_decr();
 }
 
+
+
+
 int check_comm_registry(MPI_Comm comm)
 {
     MPID_Comm* comm_ptr;


More information about the mvapich-discuss mailing list