[mvapich-discuss] rdma_iba_priv.c + Error posting recv
wei huang
huanwei at cse.ohio-state.edu
Mon Nov 6 09:25:37 EST 2006
Hi Vishwas,
Since you cluster has 260 cores, I am not really sure why you want to
start a job with np more than that number. Given the CPU intensive nature
of high performance application, in most cases you don't want to having
more processes running than your cores. Also, typically MPI library will
using polling, which may not have good performanc with more number of
processes than the number of cores.
However, your issue may not be directly related with running more
processes than the number of cores. Would you please confirm that you are
using vapi instead of gen2? So we want take a look at it.
Thanks.
Regards,
Wei Huang
774 Dreese Lab, 2015 Neil Ave,
Dept. of Computer Science and Engineering
Ohio State University
OH 43210
Tel: (614)292-8501
On Mon, 6 Nov 2006, Vishwas Vasisht wrote:
> Hi,
>
> I have 65 nodes Opetron cluster, with total of 260 cores(64 nodes + 1 Master, each dual processor, dual cored)
> I was trying to submit a job (cpi, jobfarming..), using -np to be greater than 260. It was working till -np 300. But for above 300, I am getting these errors several times.
>
> --------------------------------------------------------------------------
> [rdma_iba_priv.c:406] error(-236): Error posting recv!
> rank 12 in job 7 masternode_33851 caused collective abort of all ranks
> exit status of rank 12: killed by signal 9
> --------------------------------------------------------------------------
>
> Can you please help me sorting this out.
>
> Regards
> Vishwas
>
More information about the mvapich-discuss
mailing list