[mvapich-discuss] mvapich2 error

Bharat mbkumar at gmail.com
Sat Sep 27 16:37:43 EDT 2008


Hi All,

After several days of trying various things, I am posting my problem. We  
have 16node, dual processor, Quad Core Intel Xeon with 16GB RAM/node  
cluster interconnected with infiniband. I am using mvapich2-1.2RC2. And I  
am running an application compiled using ifort 10.1.017, intel mkl  
10.0.1.014 (scalapack & blacs taken from intel libraries). The program  
runs fine for some time and then it stops with the error message like

siesta:                 ==============================
                             Begin CG move =     15
                         ==============================


siesta: iscf   Eharris(eV)      E_KS(eV)   FreeEng(eV)   dDmax  Ef(eV)
siesta:    1  -110464.5442  -110476.9339  -110477.1312  0.1268 -4.4928
siesta:    2  -110507.6684  -110459.2304  -110459.4392  0.3223 -5.8411
siesta:    3  -110463.9960  -110472.4056  -110472.5206  0.0867 -4.6470
Fatal error in MPI_Bcast:
Message truncated, error stack:
MPI_Bcast(1144)...................: MPI_Bcast(buf=0x20c0fe0, count=1,  
dtype=USER<vector>, root=2, comm=0xc4000006) failed
MPIR_Bcast(228)...................:
MPIDI_CH3U_Receive_data_found(254): Message from rank 0 and tag 2  
truncated; 31744 bytes received but buffer size is 1600
rank 5 in job 27  master_39065   caused collective abort of all ranks
   exit status of rank 5: killed by signal 9

I tried different compiler flags, and also tried gfortran, but the problem  
is still present. So I am thinking
the error is related to mvapich2. And I am new to mvapich2. So can someone  
please help me in solving this issue.
I did only  default install of mvapich2 (i.e., ./configure CC=... F90=...,  
make, make install). Do I have to
set any environment variables? I used the option of -heap_arrays during  
compiling to overcome stack size issue.
The output of ibstatus is

Infiniband device 'mthca0' port 1 status:
	default gid:	 fe80:0000:0000:0000:0002:c902:0027:da55
	base lid:	 0x13
	sm lid:		 0x13
	state:		 4: ACTIVE
	phys state:	 5: LinkUp
	rate:		 20 Gb/sec (4X DDR)

The output of ibv_devinfo is
hca_id:	mthca0
	fw_ver:				1.2.0
	node_guid:			0002:c902:0027:da54
	sys_image_guid:			0002:c902:0027:da57
	vendor_id:			0x02c9
	vendor_part_id:			25204
	hw_ver:				0xA0
	board_id:			MT_03B0150002
	phys_port_cnt:			1
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		2048 (4)
			active_mtu:		2048 (4)
			sm_lid:			19
			port_lid:		19
			port_lmc:		0x00


Thanks,
Bharat




More information about the mvapich-discuss mailing list