[mvapich-discuss] mvapich2-0.9.8 blacs problems
amith rajith mamidala
mamidala at cse.ohio-state.edu
Thu Mar 22 18:09:45 EDT 2007
Hi Bas,
Can you please try out these two patches and see if the problem goes away.
I have tried out this with the scal.f test and things are fine.
Thanks,
Amith
On Thu, 22 Mar 2007, Bas van der Vlies wrote:
> Hello,
>
> We have made a two simpler programs that does not use scalapack/blacs
> and also shows the same behavior. See attachments.
>
> Here are the result:
> mvapich 0.9.8: No problems
> mvapich 0.9.9 trunk: see below for errors
> mvapich2 0.9.8 : see below for errors
>
>
> Regards and Hope this helps with diagnosing the problem
>
> =====================================================================
> mvapich 0.9.9 trunk:
> duptest:
> ====================================================
> Running with 8 processes
> will do 100000 dups and frees
> ............................................0 - <NO ERROR MESSAGE> :
> Pointer conversions exhausted
> Too many MPI objects may have been passed to/from Fortran
> without being freed
> [0] [] Aborting Program!
> 4 - <NO ERROR MESSAGE> : Pointer conversions exhausted
> Too many MPI objects may have been passed to/from Fortran
> without being freed
> 2 - <NO ERROR MESSAGE> : Pointer conversions exhausted
> Too many MPI objects may have been passed to/from Fortran
> without being freed
> 6 - <NO ERROR MESSAGE> : Pointer conversions exhausted
> Too many MPI objects may have been passed to/from Fortran
> without being freed
> mpirun_rsh: Abort signaled from [0]
> [4] [] Aborting Program!
> [2] [] Aborting Program!
> [6] [] Aborting Program!
> done.
> ====================================================
>
> splittest:
> ====================================================
> bas at ib-r21n1:~/src/applications$ mpirun -np 8 ./a.out
>
> Running with 8 processes
> will do 100000 splits and frees
> ......................................0 - <NO ERROR MESSAGE> : Pointer
> conversions exhausted
> Too many MPI objects may have been passed to/from Fortran
> without being freed
> [0] [] Aborting Program!
> 6 - <NO ERROR MESSAGE> : Pointer conversions exhausted
> Too many MPI objects may have been passed to/from Fortran
> without being freed
> 2 - <NO ERROR MESSAGE> : Pointer conversions exhausted
> Too many MPI objects may have been passed to/from Fortran
> without being freed
> 4 - <NO ERROR MESSAGE> : Pointer conversions exhausted
> Too many MPI objects may have been passed to/from Fortran
> without being freed
> mpirun_rsh: Abort signaled from [0]
> [6] [] Aborting Program!
> [2] [] Aborting Program!
> [4] [] Aborting Program!
> done.
> ====================================================
>
>
> mvapich2 0.9.8:
>
> duptest:
> ====================================================
> as at ib-r21n1:~/src/applications$ mpiexec -n $nprocs ./a.out
> Running with 8 processes
> will do 100000 dups and frees
> .Fatal error in MPI_Comm_split: Other MPI error, error stack:
> MPI_Comm_split(290).: MPI_Comm_split(comm=0x84000002, color=0, key=0,
> new_comm=0xb7f4d8a4) failed
> MPIR_Comm_create(90): Too many communicatorsFatal error in
> MPI_Comm_split: Other MPI error, error stack:
> MPI_Comm_split(290).: MPI_Comm_split(comm=0x84000001, color=1, key=1,
> new_comm=0xb7f7a7bc) failed
> MPIR_Comm_create(90): Too many communicatorsFatal error in
> MPI_Comm_split: Other MPI error, error stack:
> MPI_Comm_split(290).: MPI_Comm_split(comm=0x84000002, color=2, key=0,
> new_comm=0xb7f778a4) failed
> MPIR_Comm_create(90): Too many communicatorsFatal error in
> MPI_Comm_split: Other MPI error, error stack:
> MPI_Comm_split(290).: MPI_Comm_split(comm=0x84000001, color=2, key=1,
> new_comm=0xb7ecf7bc) failed
> MPIR_Comm_create(90): Too many communicatorsFatal error in
> MPI_Comm_split: Other MPI error, error stack:
> MPI_Comm_split(290).: MPI_Comm_split(comm=0x84000002, color=3, key=0,
> new_comm=0xb7f398a4) failed
> MPIR_Comm_create(90): Too many communicatorsFatal error in
> MPI_Comm_split: Other MPI error, error stack:
> MPI_Comm_split(290).: MPI_Comm_split(comm=0x84000001, color=3, key=1,
> new_comm=0xb7f447bc) failed
> MPIR_Comm_create(90): Too many communicatorsrank 7 in job 1
> ib-r21n1.irc.sara.nl_8763 caused collective abort of all ranks
> exit status of rank 7: killed by signal 9
> rank 6 in job 1 ib-r21n1.irc.sara.nl_8763 caused collective abort of
> all ranks
> exit status of rank 6: killed by signal 9
> Fatal error in MPI_Comm_split: Other MPI error, error stack:
> MPI_Comm_split(290).: MPI_Comm_split(comm=0x84000002, color=1, key=0,
> new_comm=0xb7f708a4) failed
> MPIR_Comm_create(90): Too many communicatorsFatal error in
> MPI_Comm_split: Other MPI error, error stack:
> MPI_Comm_split(290).: MPI_Comm_split(comm=0x84000001, color=0, key=1,
> new_comm=0xb7f4d7bc) failed
> MPIR_Comm_create(90): Too many communicatorsrank 5 in job 1
> ib-r21n1.irc.sara.nl_8763 caused collective abort of all ranks
> exit status of rank 5: return code 13
> rank 4 in job 1 ib-r21n1.irc.sara.nl_8763 caused collective abort of
> all ranks
> exit status of rank 4: killed by signal 9
> ====================================================
>
> splitest:
> ====================================================
> Running with 8 processes
> will do 100000 splits and frees
> .Fatal error in MPI_Comm_split: Other MPI error, error stack:
> MPI_Comm_split(290).: MPI_Comm_split(comm=0x84000002, color=0, key=0,
> new_comm=0xb7f2b8a4) failed
> MPIR_Comm_create(90): Too many communicatorsFatal error in
> MPI_Comm_split: Other MPI error, error stack:
> MPI_Comm_split(290).: MPI_Comm_split(comm=0x84000002, color=0, key=0,
> new_comm=0xb7f258a4) failed
> MPIR_Comm_create(90): Too many communicatorsFatal error in
> MPI_Comm_split: Other MPI error, error stack:
> MPI_Comm_split(290).: MPI_Comm_split(comm=0x84000002, color=1, key=0,
> new_comm=0xb7f168a4) failed
> MPIR_Comm_create(90): Too many communicatorsFatal error in
> MPI_Comm_split: Other MPI error, error stack:
> MPI_Comm_split(290).: MPI_Comm_split(comm=0x84000002, color=1, key=0,
> new_comm=0xb7f328a4) failed
> MPIR_Comm_create(90): Too many communicatorsrank 2 in job 3
> ib-r21n1.irc.sara.nl_8763 caused collective abort of all ranks
> exit status of rank 2: killed by signal 9
> rank 1 in job 3 ib-r21n1.irc.sara.nl_8763 caused collective abort of
> all ranks
> exit status of rank 1: killed by signal 9
> rank 0 in job 3 ib-r21n1.irc.sara.nl_8763 caused collective abort of
> all ranks
> exit status of rank 0: killed by signal 9
> ====================================================
> --
> ********************************************************************
> * *
> * Bas van der Vlies e-mail: basv at sara.nl *
> * SARA - Academic Computing Services phone: +31 20 592 8012 *
> * Kruislaan 415 fax: +31 20 6683167 *
> * 1098 SJ Amsterdam *
> * *
> ********************************************************************
>
-------------- next part --------------
Index: create_2level_comm.c
===================================================================
--- create_2level_comm.c (revision 1112)
+++ create_2level_comm.c (working copy)
@@ -164,7 +164,9 @@
}
else{
comm_ptr->shmem_coll_ok = 0;
- free_2level_comm(comm_ptr);
+ free_2level_comm(comm_ptr);
+ if (comm_ptr->leader_comm) { MPI_Comm_free(&(comm_ptr->leader_comm));}
+ if (comm_ptr->shmem_comm) { MPI_Comm_free(&(comm_ptr->shmem_comm));}
}
++comm_count;
-------------- next part --------------
Index: create_2level_comm.c
===================================================================
--- create_2level_comm.c (revision 1118)
+++ create_2level_comm.c (working copy)
@@ -26,6 +26,22 @@
extern shmem_coll_region *shmem_coll;
static pthread_mutex_t shmem_coll_lock = PTHREAD_MUTEX_INITIALIZER;
+void clear_2level_comm (MPID_Comm* comm_ptr)
+{
+ comm_ptr->shmem_coll_ok = 0;
+ comm_ptr->leader_map = NULL;
+ comm_ptr->leader_rank = NULL;
+}
+
+void free_2level_comm (MPID_Comm* comm_ptr)
+{
+ if (comm_ptr->leader_map) { free(comm_ptr->leader_map); }
+ if (comm_ptr->leader_rank) { free(comm_ptr->leader_rank); }
+ if (comm_ptr->leader_comm) { MPI_Comm_free(&(comm_ptr->leader_comm));}
+ if (comm_ptr->shmem_comm) { MPI_Comm_free(&(comm_ptr->shmem_comm));}
+ clear_2level_comm(comm_ptr);
+}
+
void create_2level_comm (MPI_Comm comm, int size, int my_rank){
MPID_Comm* comm_ptr;
@@ -60,7 +76,9 @@
/* Creating leader group */
int leader = 0;
leader = shmem_group[0];
+ free(shmem_group);
+
/* Gives the mapping to any process's leader in comm */
comm_ptr->leader_map = malloc(sizeof(int) * size);
if (NULL == comm_ptr->leader_map){
@@ -105,6 +123,8 @@
MPI_Group_incl(comm_group, leader_group_size, leader_group, &subgroup1);
MPI_Comm_create(comm, subgroup1, &(comm_ptr->leader_comm));
+
+ free(leader_group);
MPID_Comm *leader_ptr;
MPID_Comm_get_ptr( comm_ptr->leader_comm, leader_ptr );
@@ -142,12 +162,16 @@
}
else{
comm_ptr->shmem_coll_ok = 0;
+ free_2level_comm(comm_ptr);
}
++comm_count;
MPIR_Nest_decr();
}
+
+
+
int check_comm_registry(MPI_Comm comm)
{
MPID_Comm* comm_ptr;
More information about the mvapich-discuss
mailing list