[mvapich-discuss] Help with polled desc error
Scott A. Friedman
friedman at ats.ucla.edu
Wed Jan 30 17:15:40 EST 2008
It does not appear to be specifically related to the intel compiler - we
rebuilt everything using gcc last night and had the same problem.
This is a new cluster, so I am double checking things are up properly -
they appear to be so far - but we see this problem on a couple of other
IB clusters here at UCLA as well.
mvapich2 1.0.1 and OFED 1.2.5.4, CentOS 5.x, just so we are all on the
same page.
Scott
Mike Colonno wrote:
> We have experienced this as well, and it appears to be a function of Intel's compilers. This application running on just n = 12 produced:
>
> rank 11 in job 3 node1_32811 caused collective abort of all ranks
> exit status of rank 11: killed by signal 9
> rank 6 in job 3 node1_32811 caused collective abort of all ranks
> exit status of rank 6: killed by signal 9
> rank 5 in job 3 node1_32811 caused collective abort of all ranks
> exit status of rank 5: killed by signal 11
> rank 4 in job 3 node1_32811 caused collective abort of all ranks
> exit status of rank 4: killed by signal 9
>
> But the same job on n <=8 nodes runs fine. This happens for at least 3 different MPI codes so it must be a function on the MVAPICH2 compile and / or compiling MVAPICH2 applications using Intel's compilers. I ran all of the ibv_* tests which appear to return nominal output. The job above ran on just 3 different machines with 4 processes on each machine (2x quad-core Xeons, x64, Red Hat Enterprise 4.5). I built MVAPICH2 1.0.1 as well using Intel C++ / Fortran compilers, version 10.1. Is there any way to generate more detailed debug info to see exactly where these processes run into trouble?
>
> Thanks,
>
> Michael R. Colonno, Ph.D. | Chief Aerodynamic Engineer
> Space Exploration Technologies
> 1 Rocket Road
> Hawthorne, CA 90250
> W: 310 363 6263 | M: 310 570 3299 | F: 310 363 6001 | www.spacex.com
>
> -- This Email Contains Sensitive Proprietary and Confidential Information - Not for Further Distribution Without the Express Written Consent of Space Exploration Technologies --
>
>
>
> -----Original Message-----
> From: mvapich-discuss-bounces at cse.ohio-state.edu [mailto:mvapich-discuss-bounces at cse.ohio-state.edu] On Behalf Of wei huang
> Sent: Wednesday, January 30, 2008 11:25 AM
> To: Scott A. Friedman
> Cc: mvapich-discuss at cse.ohio-state.edu
> Subject: Re: [mvapich-discuss] Help with polled desc error
>
> Hi Scott,
>
> Thanks for letting us know the problem. From your description, however, it
> looks like there are some problems with your system setup. The program has
> not passed initialization phase yet.
>
> Would you please verify that your system setup is correct by running IB
> level ibv_* benchmarks? Those benchmarks are standard components of OFED
> installation and should be available on your systems already.
>
> Thanks.
>
> Regards,
> Wei Huang
>
> 774 Dreese Lab, 2015 Neil Ave,
> Dept. of Computer Science and Engineering
> Ohio State University
> OH 43210
> Tel: (614)292-8501
>
>
> On Tue, 29 Jan 2008, Scott A. Friedman wrote:
>
>> Hi
>>
>> We have found applications crashing with the following error:
>>
>> [113] Abort: Error code in polled desc!
>> at line 1229 in file rdma_iba_priv.c
>> rank 113 in job 1 n90_57923 caused collective abort of all ranks
>> exit status of rank 113: killed by signal 9
>>
>> Have not been able to find anything useful on this on the web. Hopefully
>> someone here can shed some light on it.
>>
>> Using mvapich2-1.0.1
>>
>> Would have to check on the exact build number but it is from the last
>> month or so.
>>
>> Thanks
>> Scott Friedman
>> UCLA
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
More information about the mvapich-discuss
mailing list