[mvapich-discuss] mvapich2-0.9.8 blacs problems
Bas van der Vlies
basv at sara.nl
Tue Mar 20 03:42:45 EDT 2007
amith rajith mamidala wrote:
> Hi Bas,
>
Hi Amit,
> Can you please apply the one line patches below and let us know the
> outcome? I have tried a couple of cases and the patch is working fine.
> Also, can you let us know the nature of this application (scal.f).
> It seems to be using several hundred of MPI_Comm_split operations. Is this
> the typical application pattern?
>
We have some users that do a lot of matrix calculations and runs for a
long time. This programs hangs. So we made a small example scalapack
program that behaves the same. MPI_Comm_split is called by the blacs
library.
I have applied the patch for mvapich2-0.9.8 and it does not hang. But it
still gives errors. It depends on the size of the matrix when the error
occurs and what kind of error:
the outcome of 4 nodes with 2 CPU's:
1 : 63
2 : 63
3 : 63
4 : 16
5 : 63
6 : 16
7 : 63
8 : 16
after 16 runs (8 procs):
{{{
loop, n, mb, nprocs, nprow, npcol: 16 100 16 8 2 4
Fatal error in MPI_Comm_create: Other MPI error, error stack:
MPI_Comm_create(266): MPI_Comm_create(comm=0xc4000012, group=0xc80300ec,
new_comm=0xbf9b2d64) failed
MPI_Comm_create(143): Too many communicatorsFatal error in
MPI_Comm_create: Other MPI error, error stack:
MPI_Comm_create(266): MPI_Comm_create(comm=0xc4000012, group=0xc80300ec,
new_comm=0xbfd450f4) failed
MPI_Comm_create(143): Too many communicatorsFatal error in
MPI_Comm_create: Other MPI error, error stack:
MPI_Comm_create(266): MPI_Comm_create(comm=0xc4000002, group=0xc803006c,
new_comm=0xbfdf39a4) failed
MPI_Comm_create(143): Too many communicatorsFatal error in
MPI_Comm_create: Other MPI error, error stack:
MPI_Comm_create(266): MPI_Comm_create(comm=0xc4000002, group=0xc803006c,
new_comm=0xbfc38ff4) failed
MPI_Comm_create(143): Too many communicatorsFatal error in
MPI_Comm_create: Other MPI error, error stack:
MPI_Comm_create(266): MPI_Comm_create(comm=0xc4000012, group=0xc80300ec,
new_comm=0xbfcfd0b4) failed
MPI_Comm_create(143): Too many communicatorsFatal error in
MPI_Comm_create: Other MPI error, error stack:
MPI_Comm_create(266): MPI_Comm_create(comm=0xc4000002, group=0xc803006c,
new_comm=0xbfcd6084) failed
MPI_Comm_create(143): Too many communicatorsFatal error in
MPI_Comm_create: Other MPI error, error stack:
MPI_Comm_create(266): MPI_Comm_create(comm=0xc4000012, group=0xc80300ec,
new_comm=0xbff62b14) failed
MPI_Comm_create(143): Too many communicatorsFatal error in
MPI_Comm_create: Other MPI error, error stack:
MPI_Comm_create(266): MPI_Comm_create(comm=0xc4000002, group=0xc803006c,
new_comm=0xbfb2fee4) failed
MPI_Comm_create(143): Too many communicatorsrank 7 in job 8
ib-r6n20.irc.sara.nl_11382 caused collective abort of all ranks
exit status of rank 7: killed by signal 9
rank 6 in job 8 ib-r6n20.irc.sara.nl_11382 caused collective abort of
all ranks
exit status of rank 6: killed by signal 9
rank 5 in job 8 ib-r6n20.irc.sara.nl_11382 caused collective abort of
all ranks
exit status of rank 5: killed by signal 9
rank 4 in job 8 ib-r6n20.irc.sara.nl_11382 caused collective abort of
all ranks
exit status of rank 4: killed by signal 9
end 8
}}}
after 63 runs (7 procs)
{{{
loop, n, mb, nprocs, nprow, npcol: 63 100 16 7 1 7
Fatal error in MPI_Comm_split: Other MPI error, error stack:
MPI_Comm_split(290).: MPI_Comm_split(comm=0xc40300f4, color=0, key=0,
new_comm=0x13535864) failed
MPIR_Comm_create(90): Too many communicatorsFatal error in
MPI_Comm_split: Other MPI error, error stack:
MPI_Comm_split(290).: MPI_Comm_split(comm=0xc40300f4, color=2, key=0,
new_comm=0x13533734) failed
MPIR_Comm_create(90): Too many communicatorsFatal error in
MPI_Comm_split: Other MPI error, error stack:
MPI_Comm_split(290).: MPI_Comm_split(comm=0xc401003c, color=2, key=1,
new_comm=0x134fea44) failed
MPIR_Comm_create(90): Too many communicatorsFatal error in
MPI_Comm_split: Other MPI error, error stack:
MPI_Comm_split(290).: MPI_Comm_split(comm=0xc401003c, color=1, key=1,
new_comm=0x134fe404) failed
MPIR_Comm_create(90): Too many communicatorsFatal error in
MPI_Comm_split: Other MPI error, error stack:
MPI_Comm_split(290).: MPI_Comm_split(comm=0xc40300f4, color=3, key=0,
new_comm=0x135f3de4) failed
MPIR_Comm_create(90): Too many communicatorsFatal error in
MPI_Comm_split: Other MPI error, error stack:
MPI_Comm_split(290).: MPI_Comm_split(comm=0xc40300f4, color=1, key=0,
new_comm=0x13532184) failed
MPIR_Comm_create(90): Too many communicatorsFatal error in
MPI_Comm_split: Other MPI error, error stack:
MPI_Comm_split(290).: MPI_Comm_split(comm=0xc401003c, color=0, key=1,
new_comm=0x134fe404) failed
MPIR_Comm_create(90): Too many communicatorsrank 6 in job 7
ib-r6n20.irc.sara.nl_11382 caused collective abort of all ranks
exit status of rank 6: killed by signal 9
rank 5 in job 7 ib-r6n20.irc.sara.nl_11382 caused collective abort of
all ranks
exit status of rank 5: killed by signal 9
rank 4 in job 7 ib-r6n20.irc.sara.nl_11382 caused collective abort of
all ranks
exit status of rank 4: killed by signal 9
end 7
}}}
> For mvapich-0.9.9-beta:
>
> Index: create_2level_comm.c (In $HOME/src/context)
> ===================================================================
> --- create_2level_comm.c (revision 1102)
> +++ create_2level_comm.c (working copy)
> @@ -56,7 +56,6 @@
> struct MPIR_COMMUNICATOR* comm_world_ptr;
> comm_world_ptr = MPIR_GET_COMM_PTR(MPI_COMM_WORLD);
>
> - if (comm_count > MAX_ALLOWED_COMM) return;
>
> int* shmem_group = malloc(sizeof(int) * size);
> if (NULL == shmem_group){
>
>
>
> For mvapich2-0.9.8:
>
> Index: create_2level_comm.c (In $HOME/src/mpi/comm)
> ===================================================================
> --- create_2level_comm.c (revision 1104)
> +++ create_2level_comm.c (working copy)
> @@ -33,7 +33,6 @@
> MPID_Comm_get_ptr( comm, comm_ptr );
> MPID_Comm_get_ptr( MPI_COMM_WORLD, comm_world_ptr );
>
> - if (comm_count > MAX_ALLOWED_COMM) return;
>
> MPIR_Nest_incr();
>
>
>
> Thanks,
> Amith
>
>
> On Mon, 19 Mar 2007, Bas van der Vlies wrote:
>
>> Dhabaleswar Panda wrote:
>>> Hi Bas,
>>>
>>>> We have done some further testing with mvapich version:
>>>> * 0.9.8 everything works
>>>>
>>>> * 0.9.9-beta it very slow and it hangs also like mvapich2
>>> Thanks for reporting this. Just to check ... are you using the latest
>>> mvapich 0.9.9 from the trunk or the beta tarball released on 02/09/07.
>>> A lot of successive fixes and tunings have gone since the beta version
>>> was released. You can get the latest version of the trunk through SVN
>>> checkout or downloading the nightly tarballs of the trunk.
>>>
>> I downloaded the tarball. We will test the latest trunk version.
>>
>> Regards
>>
>>> Best Regards,
>>>
>>> DK
>>>
>>>> Regards
>>>>
>>>>> Thanks.
>>>>>
>>>>> Regards,
>>>>> Wei Huang
>>>>>
>>>>> 774 Dreese Lab, 2015 Neil Ave,
>>>>> Dept. of Computer Science and Engineering
>>>>> Ohio State University
>>>>> OH 43210
>>>>> Tel: (614)292-8501
>>>>>
>>>>>
>>>>> On Mon, 19 Mar 2007, Bas van der Vlies wrote:
>>>>>
>>>>>> wei huang wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Thanks for letting us know the problem. We have generated a patch to
>>>>>>> address this problem, and have applied it to both the trunk and our svn
>>>>>>> 0.9.8 branch.
>>>>>>>
>>>>>>>
>>>>>> We have done some more tests and found some other problem using mvapich2
>>>>>> and blacs. This problems are encountered by user programs. We get
>>>>>> reports from our users that they get wrong answers from their programs.
>>>>>>
>>>>>> We have made a small fortran (g77) to illustrate a problem.
>>>>>> The calls a number of times the same scalapack routine. Independent of
>>>>>> the size of the problem the program hangs after 8 or 31 iterations
>>>>>> except when number of processes is a square, eg 1x1, 2x2, ...
>>>>>>
>>>>>> How to compile the program:
>>>>>> mpif77 -Wall -g -O0 -o scal scal.f -lscalapack -lfblacs -lcblacs -lblacs
>>>>>> -llapack -latlas
>>>>>>
>>>>>> The program expects on standard input:
>>>>>> <size of matrix> <block size> <number of iterations>
>>>>>>
>>>>>> for example:
>>>>>> echo '100 16 100' | mpiexec -n <np> ./scal
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>>
>>>>>> PS) This program behaves correctly with topspin/ciso software which is
>>>>>> based on their infiniband stack and bases on mvapich1 version.
>>>>>>
>>>>>> We gona test the program in mvapiach1 from OSU
>>>>>> --
>>>>>> ********************************************************************
>>>>>> * *
>>>>>> * Bas van der Vlies e-mail: basv at sara.nl *
>>>>>> * SARA - Academic Computing Services phone: +31 20 592 8012 *
>>>>>> * Kruislaan 415 fax: +31 20 6683167 *
>>>>>> * 1098 SJ Amsterdam *
>>>>>> * *
>>>>>> ********************************************************************
>>>>>>
>>>> --
>>>> ********************************************************************
>>>> * *
>>>> * Bas van der Vlies e-mail: basv at sara.nl *
>>>> * SARA - Academic Computing Services phone: +31 20 592 8012 *
>>>> * Kruislaan 415 fax: +31 20 6683167 *
>>>> * 1098 SJ Amsterdam *
>>>> * *
>>>> ********************************************************************
>>>> _______________________________________________
>>>> mvapich-discuss mailing list
>>>> mvapich-discuss at cse.ohio-state.edu
>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>
>>
>> --
>> ********************************************************************
>> * *
>> * Bas van der Vlies e-mail: basv at sara.nl *
>> * SARA - Academic Computing Services phone: +31 20 592 8012 *
>> * Kruislaan 415 fax: +31 20 6683167 *
>> * 1098 SJ Amsterdam *
>> * *
>> ********************************************************************
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>
>
--
********************************************************************
* *
* Bas van der Vlies e-mail: basv at sara.nl *
* SARA - Academic Computing Services phone: +31 20 592 8012 *
* Kruislaan 415 fax: +31 20 6683167 *
* 1098 SJ Amsterdam *
* *
********************************************************************
More information about the mvapich-discuss
mailing list