[mvapich-discuss] Help with polled desc error

Mike Colonno Mike.Colonno at spacex.com
Wed Jan 30 17:10:50 EST 2008


	We have experienced this as well, and it appears to be a function of Intel's compilers. This application running on just n = 12 produced:

rank 11 in job 3  node1_32811   caused collective abort of all ranks
  exit status of rank 11: killed by signal 9
rank 6 in job 3  node1_32811   caused collective abort of all ranks
  exit status of rank 6: killed by signal 9
rank 5 in job 3  node1_32811   caused collective abort of all ranks
  exit status of rank 5: killed by signal 11
rank 4 in job 3  node1_32811   caused collective abort of all ranks
  exit status of rank 4: killed by signal 9

	But the same job on n <=8 nodes runs fine. This happens for at least 3 different MPI codes so it must be a function on the MVAPICH2 compile and / or compiling MVAPICH2 applications using Intel's compilers. I ran all of the ibv_* tests which appear to return nominal output. The job above ran on just 3 different machines with 4 processes on each machine (2x quad-core Xeons, x64, Red Hat Enterprise 4.5). I built MVAPICH2 1.0.1 as well using Intel C++ / Fortran compilers, version 10.1. Is there any way to generate more detailed debug info to see exactly where these processes run into trouble?

	Thanks, 

Michael R. Colonno, Ph.D. | Chief Aerodynamic Engineer
Space Exploration Technologies
1 Rocket Road 
Hawthorne, CA 90250
W: 310 363 6263 | M: 310 570 3299 | F: 310 363 6001 | www.spacex.com 

-- This Email Contains Sensitive Proprietary and Confidential Information - Not for Further Distribution Without the Express Written Consent of Space Exploration Technologies --



-----Original Message-----
From: mvapich-discuss-bounces at cse.ohio-state.edu [mailto:mvapich-discuss-bounces at cse.ohio-state.edu] On Behalf Of wei huang
Sent: Wednesday, January 30, 2008 11:25 AM
To: Scott A. Friedman
Cc: mvapich-discuss at cse.ohio-state.edu
Subject: Re: [mvapich-discuss] Help with polled desc error

Hi Scott,

Thanks for letting us know the problem. From your description, however, it
looks like there are some problems with your system setup. The program has
not passed initialization phase yet.

Would you please verify that your system setup is correct by running IB
level ibv_* benchmarks? Those benchmarks are standard components of OFED
installation and should be available on your systems already.

Thanks.

Regards,
Wei Huang

774 Dreese Lab, 2015 Neil Ave,
Dept. of Computer Science and Engineering
Ohio State University
OH 43210
Tel: (614)292-8501


On Tue, 29 Jan 2008, Scott A. Friedman wrote:

> Hi
>
> We have found applications crashing with the following error:
>
> [113] Abort: Error code in polled desc!
>   at line 1229 in file rdma_iba_priv.c
> rank 113 in job 1  n90_57923   caused collective abort of all ranks
>    exit status of rank 113: killed by signal 9
>
> Have not been able to find anything useful on this on the web. Hopefully
> someone here can shed some light on it.
>
> Using mvapich2-1.0.1
>
> Would have to check on the exact build number but it is from the last
> month or so.
>
> Thanks
> Scott Friedman
> UCLA
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>

_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu
http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss



More information about the mvapich-discuss mailing list