[mvapich-discuss] Problem with more number of processes
jasjit singh
singh.jasjit at yahoo.co.in
Tue Sep 11 10:39:37 EDT 2007
Hi
I have been running MVAPICH 1.0 over OFED 1.2 uDAPL interface on four nodes.
i ran 64 processes, that came out to be 16 processes per node. it ran finely.
but after increasing the number of processes further, i started getting error. here are some of the final lines of the output i got when i ran 68 processes on 4 nodes i.e 17 processes per node
[rdma_udapl_init.c:1875] error(-2147024846): Could not reset ep
rank 58 in job 12 in05_33381 caused collective abort of all ranks
exit status of rank 58: return code 1
[rdma_udapl_init.c:1875] error(-2147024849): Could not reset ep
[rdma_udapl_init.c:1875] error(-2147024849): Could not reset ep
[rdma_udapl_init.c:1875] error(-2147024849): Could not reset ep
rank 57 in job 12 in05_33381 caused collective abort of all ranks
exit status of rank 57: killed by signal 9
[rdma_udapl_init.c:1875] error(-2147024849): Could not reset ep
and here is the same when i ran 200 processes i.e 50 processes per node
hello: dapl/common/dapl_evd_util.c:1012: dapli_evd_cqe_to_event: Assertion `(((void *)0) != cookie)' failed.
hello: dapl/common/dapl_evd_util.c:1012: dapli_evd_cqe_to_event: Assertion `(((void *)0) != cookie)' failed.
hello: dapl/common/dapl_evd_util.c:1012: dapli_evd_cqe_to_event: Assertion `(((void *)0) != cookie)' failed.
rank 158 in job 16 in05_36664 caused collective abort of all ranks
exit status of rank 158: killed by signal 9
[rdma_udapl_init.c:1875] error(-2147024849): Could not reset ep
rank 86 in job 16 in05_36664 caused collective abort of all ranks
exit status of rank 86: killed by signal 9
rank 46 in job 16 in05_36664 caused collective abort of all ranks
exit status of rank 46: killed by signal 9
rank 6 in job 16 in05_36664 caused collective abort of all ranks
exit status of rank 6: killed by signal 9
Could anybody please tell
why increasing the number of processes results in an absurd behaviour ?
Is any limit affecting this run, that needs to be changed ?
What is the solution to get more number of processes run successfully ?
thanks,
Jasjit Singh
---------------------------------
For ideas on reducing your carbon footprint visit Yahoo! For Good this month.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20070911/c532619d/attachment-0001.html
More information about the mvapich-discuss
mailing list