[mvapich-discuss] MPI Job crash on multi-node settings only

Kin Fai Tse kftse20031207 at gmail.com
Mon Nov 17 19:42:32 EST 2014


Dear all,

I am running a small MPI program on cluster using mpirun_rsh.

When the 2 process are on the same node, there is no problem
But when I use 2 processes on 2 different nodes, communicating a small part
of a very large static array approximately to 1562500 will immediately
crash the program during launch.

The error is:

[z1-4:mpispawn_1][child_handler] MPI process (rank: 1, pid: 29421)
terminated with signal 11 -> abort job
[z1-0:mpirun_rsh][process_mpispawn_connection] mpispawn_1 from node z1-4
aborted: MPI process error (1)
[z1-0:mpispawn_0][read_size] read() failed on file descriptor 8: Connection
reset by peer (104)
[z1-0:mpispawn_0][read_size] Unexpected End-Of-File on file descriptor 8.
MPI process died?
[z1-0:mpispawn_0][error_sighandler] Caught error: Segmentation fault
(signal 11)
[unset]: Error reading initack on 6
Error on readline:: Connection reset by peer
/bin/bash: line 1: 29409 Segmentation fault


and occationally I got some delayed error message up to 30s after running
the program:

[z1-0:mpi_rank_0][handle_cqe] Send desc error in msg to 1, wc_opcode=0
[z1-0:mpi_rank_0][handle_cqe] Msg from 1: wc.status=12, wc.wr_id=0xc8d1c0,
wc.opcode=0, vbuf->phead->type=0 = MPIDI_CH3_PKT_EAGER_SEND
[z1-0:mpi_rank_0][handle_cqe]
src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:573: [] Got
completion with error 12, vendor code=0x81, dest rank=1
: Cannot allocate memory (12)


Here is my program for your reference:
I am sure it is due to MPI communication, as the program does not ever
crash when I comment out both MPI_SEND and MPI_RECV, however commenting out
only 1 does not work.

#include "mpi.h"
#include <cstdio>

#define MAXBLOCK 9999999
#define INIT 1000
#define INCR 1000

int main(int argc, char* argv[]){
 int rank, size;
 int i;
 double time;
 double data[MAXBLOCK];
 double data2[2];
 MPI::Status status;
 MPI::Init();
 time=MPI::Wtime();
 rank = MPI::COMM_WORLD.Get_rank();
 size = MPI::COMM_WORLD.Get_size();
 if(rank == 0){
  for(i = INIT; i < MAXBLOCK; i+=INCR){
  data[i]=data[i];
   MPI::COMM_WORLD.Send(data, i, MPI::DOUBLE, 1, 0);
   printf("Size: %d sent.\n", i);
  }
 } else {
  i = INIT;
  while(i < MAXBLOCK){
  data[i]=data[i];
   MPI::COMM_WORLD.Recv(data, i, MPI::DOUBLE, 0, 0, status);
   i+=INCR;
  }
 }
 MPI::Finalize();
 return 0;
}

I am quite frustrated about why communicating only a fraction of data of
the whole array will crash on multi-node setting.

Best regards,
Kin Fai
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20141118/71bfe368/attachment.html>


More information about the mvapich-discuss mailing list