[mvapich-discuss] Problem with more number of processes

LEI CHAI chai.15 at osu.edu
Tue Sep 11 15:11:23 EDT 2007


Hi Jasjit,

May we ask how many processors do you have on each node? Since 16 or 50 processes per node seems large :-) 

If you are running more processes than processors (oversubscription) then it needs `blocking' support to get good performance, which we have for OFED-gen2 layer, but not for the uDAPL layer yet. So we suggest you do not run it in oversubscription mode.

And finally, could you try to increase the on demand connection mode threshold:

$ mpiexec -n 64 -env MV2_ON_DEMAND_THRESHOLD 1024 ./a.out

Thanks,
Lei
Content-Type: multipart/alternative; boundary="0-409717822-1189521577=:97647"
Content-Transfer-Encoding: 8bit


--0-409717822-1189521577=:97647
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit

Hi

I have been running MVAPICH 1.0 over OFED 1.2 uDAPL interface on four nodes.
i ran 64 processes, that came out to be 16 processes per node. it ran finely.

but after increasing the number of processes further, i started getting error. here are some of the final lines of the output i got when i ran 68 processes on 4 nodes i.e 17 processes per node

[rdma_udapl_init.c:1875] error(-2147024846): Could not reset ep
rank 58 in job 12  in05_33381   caused collective abort of all ranks
  exit status of rank 58: return code 1
[rdma_udapl_init.c:1875] error(-2147024849): Could not reset ep
[rdma_udapl_init.c:1875] error(-2147024849): Could not reset ep
[rdma_udapl_init.c:1875]  error(-2147024849): Could not reset ep
rank 57 in job 12  in05_33381   caused collective abort of all ranks
  exit status of rank 57: killed by signal 9
[rdma_udapl_init.c:1875] error(-2147024849): Could not reset ep


and here is the same when i ran 200 processes i.e 50 processes per node

hello: dapl/common/dapl_evd_util.c:1012: dapli_evd_cqe_to_event: Assertion `(((void *)0) != cookie)' failed.
hello: dapl/common/dapl_evd_util.c:1012: dapli_evd_cqe_to_event: Assertion `(((void *)0) != cookie)' failed.
hello: dapl/common/dapl_evd_util.c:1012: dapli_evd_cqe_to_event: Assertion `(((void *)0) != cookie)' failed.
rank 158 in job 16  in05_36664   caused collective abort of all ranks
  exit status of rank 158: killed by signal 9
[rdma_udapl_init.c:1875] error(-2147024849): Could not reset ep
rank 86 in job 16  in05_36664   caused collective  abort of all ranks
  exit status of rank 86: killed by signal 9
rank 46 in job 16  in05_36664   caused collective abort of all ranks
  exit status of rank 46: killed by signal 9
rank 6 in job 16  in05_36664   caused collective abort of all ranks
  exit status of rank 6: killed by signal 9


Could anybody please tell

why increasing the number of processes results in an absurd behaviour ?
Is any limit affecting this run, that needs to be changed ?
What is the solution to get more number of processes run successfully ?

thanks,
Jasjit Singh
       
---------------------------------
 For ideas on reducing your carbon footprint visit Yahoo! For Good this month.
--0-409717822-1189521577=:97647
Content-Type: text/html; charset=iso-8859-1
Content-Transfer-Encoding: 8bit

<span style="font-weight: bold;">Hi</span><br><br style="font-weight: bold;"><span style="font-weight: bold;">I have been running MVAPICH 1.0 over OFED 1.2 uDAPL interface on four nodes.</span><br style="font-weight: bold;"><span style="font-weight: bold;">i ran 64 processes, that came out to be 16 processes per node. it ran finely.</span><br style="font-weight: bold;"><br style="font-weight: bold;"><span style="font-weight: bold;">but after increasing the number of processes further, i started getting error. here are some of the final lines of the output i got when i ran 68 processes on 4 nodes i.e 17 processes per node</span><br><br>[rdma_udapl_init.c:1875] error(-2147024846): Could not reset ep<br>rank 58 in job 12&nbsp; in05_33381&nbsp;&nbsp; caused collective abort of all ranks<br>&nbsp; exit status of rank 58: return code 1<br>[rdma_udapl_init.c:1875] error(-2147024849): Could not reset ep<br>[rdma_udapl_init.c:1875] error(-2147024849): Could not reset
 ep<br>[rdma_udapl_init.c:1875]  error(-2147024849): Could not reset ep<br>rank 57 in job 12&nbsp; in05_33381&nbsp;&nbsp; caused collective abort of all ranks<br>&nbsp; exit status of rank 57: killed by signal 9<br>[rdma_udapl_init.c:1875] error(-2147024849): Could not reset ep<br><br><br><span style="font-weight: bold;">and here is the same when i ran 200 processes i.e 50 processes per node</span><br><br>hello: dapl/common/dapl_evd_util.c:1012: dapli_evd_cqe_to_event: Assertion `(((void *)0) != cookie)' failed.<br>hello: dapl/common/dapl_evd_util.c:1012: dapli_evd_cqe_to_event: Assertion `(((void *)0) != cookie)' failed.<br>hello: dapl/common/dapl_evd_util.c:1012: dapli_evd_cqe_to_event: Assertion `(((void *)0) != cookie)' failed.<br>rank 158 in job 16&nbsp; in05_36664&nbsp;&nbsp; caused collective abort of all ranks<br>&nbsp; exit status of rank 158: killed by signal 9<br>[rdma_udapl_init.c:1875] error(-2147024849): Could not reset ep<br>rank 86 in job 16&nbsp;
 in05_36664&nbsp;&nbsp; caused collective  abort of all ranks<br>&nbsp; exit status of rank 86: killed by signal 9<br>rank 46 in job 16&nbsp; in05_36664&nbsp;&nbsp; caused collective abort of all ranks<br>&nbsp; exit status of rank 46: killed by signal 9<br>rank 6 in job 16&nbsp; in05_36664&nbsp;&nbsp; caused collective abort of all ranks<br>&nbsp; exit status of rank 6: killed by signal 9<br><br><br><span style="font-weight: bold;">Could anybody please tell<br><br>why increasing the number of processes results in an absurd behaviour ?<br>Is any limit affecting this run, that needs to be changed ?<br>What is the solution<span style="font-weight: bold;"></span> to get more number of processes run successfully ?<br><br>thanks,<br>Jasjit Singh</span><p>&#32;


      <hr size=1> 
For ideas on reducing your carbon footprint visit <a href="http://uk.promotions.yahoo.com/forgood/environment.html">Yahoo! For Good</a> this month.

--0-409717822-1189521577=:97647--
-------------- next part --------------
_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu
http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss


More information about the mvapich-discuss mailing list