[mvapich-discuss] Help with polled desc error

Scott A. Friedman friedman at ats.ucla.edu
Wed Jan 30 17:15:40 EST 2008


It does not appear to be specifically related to the intel compiler - we 
rebuilt everything using gcc last night and had the same problem.

This is a new cluster, so I am double checking things are up properly - 
they appear to be so far - but we see this problem on a couple of other 
IB clusters here at UCLA as well.

mvapich2 1.0.1 and OFED 1.2.5.4, CentOS 5.x, just so we are all on the 
same page.

Scott

Mike Colonno wrote:
> 	We have experienced this as well, and it appears to be a function of Intel's compilers. This application running on just n = 12 produced:
> 
> rank 11 in job 3  node1_32811   caused collective abort of all ranks
>   exit status of rank 11: killed by signal 9
> rank 6 in job 3  node1_32811   caused collective abort of all ranks
>   exit status of rank 6: killed by signal 9
> rank 5 in job 3  node1_32811   caused collective abort of all ranks
>   exit status of rank 5: killed by signal 11
> rank 4 in job 3  node1_32811   caused collective abort of all ranks
>   exit status of rank 4: killed by signal 9
> 
> 	But the same job on n <=8 nodes runs fine. This happens for at least 3 different MPI codes so it must be a function on the MVAPICH2 compile and / or compiling MVAPICH2 applications using Intel's compilers. I ran all of the ibv_* tests which appear to return nominal output. The job above ran on just 3 different machines with 4 processes on each machine (2x quad-core Xeons, x64, Red Hat Enterprise 4.5). I built MVAPICH2 1.0.1 as well using Intel C++ / Fortran compilers, version 10.1. Is there any way to generate more detailed debug info to see exactly where these processes run into trouble?
> 
> 	Thanks, 
> 
> Michael R. Colonno, Ph.D. | Chief Aerodynamic Engineer
> Space Exploration Technologies
> 1 Rocket Road 
> Hawthorne, CA 90250
> W: 310 363 6263 | M: 310 570 3299 | F: 310 363 6001 | www.spacex.com 
> 
> -- This Email Contains Sensitive Proprietary and Confidential Information - Not for Further Distribution Without the Express Written Consent of Space Exploration Technologies --
> 
> 
> 
> -----Original Message-----
> From: mvapich-discuss-bounces at cse.ohio-state.edu [mailto:mvapich-discuss-bounces at cse.ohio-state.edu] On Behalf Of wei huang
> Sent: Wednesday, January 30, 2008 11:25 AM
> To: Scott A. Friedman
> Cc: mvapich-discuss at cse.ohio-state.edu
> Subject: Re: [mvapich-discuss] Help with polled desc error
> 
> Hi Scott,
> 
> Thanks for letting us know the problem. From your description, however, it
> looks like there are some problems with your system setup. The program has
> not passed initialization phase yet.
> 
> Would you please verify that your system setup is correct by running IB
> level ibv_* benchmarks? Those benchmarks are standard components of OFED
> installation and should be available on your systems already.
> 
> Thanks.
> 
> Regards,
> Wei Huang
> 
> 774 Dreese Lab, 2015 Neil Ave,
> Dept. of Computer Science and Engineering
> Ohio State University
> OH 43210
> Tel: (614)292-8501
> 
> 
> On Tue, 29 Jan 2008, Scott A. Friedman wrote:
> 
>> Hi
>>
>> We have found applications crashing with the following error:
>>
>> [113] Abort: Error code in polled desc!
>>   at line 1229 in file rdma_iba_priv.c
>> rank 113 in job 1  n90_57923   caused collective abort of all ranks
>>    exit status of rank 113: killed by signal 9
>>
>> Have not been able to find anything useful on this on the web. Hopefully
>> someone here can shed some light on it.
>>
>> Using mvapich2-1.0.1
>>
>> Would have to check on the exact build number but it is from the last
>> month or so.
>>
>> Thanks
>> Scott Friedman
>> UCLA
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
> 
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss


More information about the mvapich-discuss mailing list