[mvapich-discuss] Problem with more number of processes

jasjit singh singh.jasjit at yahoo.co.in
Tue Sep 11 10:39:37 EDT 2007


Hi

I have been running MVAPICH 1.0 over OFED 1.2 uDAPL interface on four nodes.
i ran 64 processes, that came out to be 16 processes per node. it ran finely.

but after increasing the number of processes further, i started getting error. here are some of the final lines of the output i got when i ran 68 processes on 4 nodes i.e 17 processes per node

[rdma_udapl_init.c:1875] error(-2147024846): Could not reset ep
rank 58 in job 12  in05_33381   caused collective abort of all ranks
  exit status of rank 58: return code 1
[rdma_udapl_init.c:1875] error(-2147024849): Could not reset ep
[rdma_udapl_init.c:1875] error(-2147024849): Could not reset ep
[rdma_udapl_init.c:1875]  error(-2147024849): Could not reset ep
rank 57 in job 12  in05_33381   caused collective abort of all ranks
  exit status of rank 57: killed by signal 9
[rdma_udapl_init.c:1875] error(-2147024849): Could not reset ep


and here is the same when i ran 200 processes i.e 50 processes per node

hello: dapl/common/dapl_evd_util.c:1012: dapli_evd_cqe_to_event: Assertion `(((void *)0) != cookie)' failed.
hello: dapl/common/dapl_evd_util.c:1012: dapli_evd_cqe_to_event: Assertion `(((void *)0) != cookie)' failed.
hello: dapl/common/dapl_evd_util.c:1012: dapli_evd_cqe_to_event: Assertion `(((void *)0) != cookie)' failed.
rank 158 in job 16  in05_36664   caused collective abort of all ranks
  exit status of rank 158: killed by signal 9
[rdma_udapl_init.c:1875] error(-2147024849): Could not reset ep
rank 86 in job 16  in05_36664   caused collective  abort of all ranks
  exit status of rank 86: killed by signal 9
rank 46 in job 16  in05_36664   caused collective abort of all ranks
  exit status of rank 46: killed by signal 9
rank 6 in job 16  in05_36664   caused collective abort of all ranks
  exit status of rank 6: killed by signal 9


Could anybody please tell

why increasing the number of processes results in an absurd behaviour ?
Is any limit affecting this run, that needs to be changed ?
What is the solution to get more number of processes run successfully ?

thanks,
Jasjit Singh
       
---------------------------------
 For ideas on reducing your carbon footprint visit Yahoo! For Good this month.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20070911/c532619d/attachment-0001.html


More information about the mvapich-discuss mailing list