[mvapich-discuss] MVAPICH2 Invalid communicator errors

Mark Potts potts at hpcapplications.com
Thu Sep 13 11:12:36 EDT 2007


Shaun,
    You were dead on.  A spurious reference to /usr/include/mpi.h in
    a script originally intended for other MPI builds was the culprit.
    Thanks  -- until the next problem...
          regards,

Shaun Rowland wrote:
> Mark Potts wrote:
>> DK and Tom,
>>    Thanks for your interest.
>>
>>    I'm not certain what version info you wanted.  However, one
>>    designator is "mvapich2-0.9.8-12".  The MVAPICH2 source was
>>    obtained as part of OFED 1.2.  I'll get more explicit
>>    version info (OFED and MVAPICH2) if you tell me what and where
>>    to look.
> 
> That's the information we were looking for. The -12 is the RPM version
> number, which has to be incremented whenever there is any SRPM change.
> That should correspond to the latest MVAPICH2. There's a slightly
> updated one with OFED 1.2.5.
> 
>>    We have built MVAPICH (and lots of other packages) with Intel
>>    compilers and are using them without problem.  However, the
>>    responses received to date indicate that the problem is not
>>    a known issue with MVAPICH2 and Intel compilers and thus must
>>    be a setup issue on our end.
> 
> It seems we have seen a similar error before on one of the clusters we
> use. The cluster had a modules system to set up user environments, and
> it ended up causing a different mpi.h file to be included, instead of
> the one that was supposed to be used with the package the user expected
> (from their specific build). You should check your user environment to
> make sure there's not something like that happening, or that there's no
> mpi.h in /usr/include or something. Also, check the mpicc command with
> the -show argument I suggested and check the paths. The type of error we
> would see was:
> 
> Fatal error in MPI_Comm_size: Invalid communicator, error stack:
> MPI_Comm_size(110): MPI_Comm_size(comm=0x5b, size=0x7ffffff2b308) failed
> MPI_Comm_size(69).: Invalidcommunicatorrank 0 in job 4  bm1_48690
> caused collective abort of all ranks
>  exit status of rank 0: killed by signal 9
> 
> which looks like your error.

-- 
***********************************
 >> Mark J. Potts, PhD
 >>
 >> HPC Applications Inc.
 >> phone: 410-992-8360 Bus
 >>        410-313-9318 Home
 >>        443-418-4375 Cell
 >> email: potts at hpcapplications.com
 >>        potts at excray.com
***********************************


More information about the mvapich-discuss mailing list