[mvapich-discuss] mvapich2 error

Lei Chai chai.15 at osu.edu
Mon Sep 29 21:14:06 EDT 2008


Hi Bharat,

Thanks for reporting the problem. Since we don't have the license for 
siesta we are not able to run it on our cluster. Could you try the 
following and let us know the results:

- Use the option MV2_USE_SHMEM_COLL=0 <#x1-13400011.56>
   e.g. $ mpirun_rsh -np N -hostfile ./hosts MV2_USE_SHMEM_COLL=0 
<#x1-13400011.56> ./prog

- Try to run the program with MPICH2-1.0.7, since mvapich2-1.2rc2 is 
based on MPICH2-1.0.7

This will help us get more insight into the problem.

Thanks,
Lei


Bharat wrote:
> Hi All,
>
> After several days of trying various things, I am posting my problem. 
> We have 16node, dual processor, Quad Core Intel Xeon with 16GB 
> RAM/node cluster interconnected with infiniband. I am using 
> mvapich2-1.2RC2. And I am running an application compiled using ifort 
> 10.1.017, intel mkl 10.0.1.014 (scalapack & blacs taken from intel 
> libraries). The program runs fine for some time and then it stops with 
> the error message like
>
> siesta:                 ==============================
>                             Begin CG move =     15
>                         ==============================
>
>
> siesta: iscf   Eharris(eV)      E_KS(eV)   FreeEng(eV)   dDmax  Ef(eV)
> siesta:    1  -110464.5442  -110476.9339  -110477.1312  0.1268 -4.4928
> siesta:    2  -110507.6684  -110459.2304  -110459.4392  0.3223 -5.8411
> siesta:    3  -110463.9960  -110472.4056  -110472.5206  0.0867 -4.6470
> Fatal error in MPI_Bcast:
> Message truncated, error stack:
> MPI_Bcast(1144)...................: MPI_Bcast(buf=0x20c0fe0, count=1, 
> dtype=USER<vector>, root=2, comm=0xc4000006) failed
> MPIR_Bcast(228)...................:
> MPIDI_CH3U_Receive_data_found(254): Message from rank 0 and tag 2 
> truncated; 31744 bytes received but buffer size is 1600
> rank 5 in job 27  master_39065   caused collective abort of all ranks
>   exit status of rank 5: killed by signal 9
>
> I tried different compiler flags, and also tried gfortran, but the 
> problem is still present. So I am thinking
> the error is related to mvapich2. And I am new to mvapich2. So can 
> someone please help me in solving this issue.
> I did only  default install of mvapich2 (i.e., ./configure CC=... 
> F90=..., make, make install). Do I have to
> set any environment variables? I used the option of -heap_arrays 
> during compiling to overcome stack size issue.
> The output of ibstatus is
>
> Infiniband device 'mthca0' port 1 status:
>     default gid:     fe80:0000:0000:0000:0002:c902:0027:da55
>     base lid:     0x13
>     sm lid:         0x13
>     state:         4: ACTIVE
>     phys state:     5: LinkUp
>     rate:         20 Gb/sec (4X DDR)
>
> The output of ibv_devinfo is
> hca_id:    mthca0
>     fw_ver:                1.2.0
>     node_guid:            0002:c902:0027:da54
>     sys_image_guid:            0002:c902:0027:da57
>     vendor_id:            0x02c9
>     vendor_part_id:            25204
>     hw_ver:                0xA0
>     board_id:            MT_03B0150002
>     phys_port_cnt:            1
>         port:    1
>             state:            PORT_ACTIVE (4)
>             max_mtu:        2048 (4)
>             active_mtu:        2048 (4)
>             sm_lid:            19
>             port_lid:        19
>             port_lmc:        0x00
>
>
> Thanks,
> Bharat
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss



More information about the mvapich-discuss mailing list