[mvapich-discuss] rdma_iba_priv.c + Error posting recv

wei huang huanwei at cse.ohio-state.edu
Mon Nov 6 09:25:37 EST 2006


Hi Vishwas,

Since you cluster has 260 cores, I am not really sure why you want to
start a job with np more than that number. Given the CPU intensive nature
of high performance application, in most cases you don't want to having
more processes running than your cores. Also, typically MPI library will
using polling, which may not have good performanc with more number of
processes than the number of cores.

However, your issue may not be directly related with running more
processes than the number of cores. Would you please confirm that you are
using vapi instead of gen2? So we want take a look at it.

Thanks.

Regards,
Wei Huang

774 Dreese Lab, 2015 Neil Ave,
Dept. of Computer Science and Engineering
Ohio State University
OH 43210
Tel: (614)292-8501


On Mon, 6 Nov 2006, Vishwas Vasisht wrote:

> Hi,
>
> I have 65 nodes Opetron cluster, with total of 260 cores(64 nodes + 1 Master, each dual processor, dual cored)
> I was trying to submit a job (cpi, jobfarming..), using -np to be greater than 260. It was working till -np 300. But for above 300, I am getting these errors several times.
>
> --------------------------------------------------------------------------
> [rdma_iba_priv.c:406] error(-236): Error posting recv!
> rank 12 in job 7  masternode_33851   caused collective abort of all ranks
>   exit status of rank 12: killed by signal 9
> --------------------------------------------------------------------------
>
> Can you please help me sorting this out.
>
> Regards
> Vishwas
>



More information about the mvapich-discuss mailing list