[mvapich-discuss] MPI Job crash on multi-node settings only

Fri Nov 21 22:20:30 EST 2014

Hello Kin,

Good to know that the issues got resolved for you. I'm cc'ing this message
to MVAPICH-discuss for everyone's information.

Thx, Hari.

On Friday, November 21, 2014, Kin Fai Tse <kftse20031207 at gmail.com> wrote:

> Hello Hari,
>
> It turns out that the stacksize and descriptor limit are inconsistent for
> "ssh" and "qsub -I" access.
> While "qsub -I" does not set some very low limit, "ssh" actually set the
> stacksize to 10MB and descriptor limit to 1024. My program actually hit the
> stacksize limit and always generate a segfault when run by mpirun_rsh.
>
> After increasing stacksize, I still found some program crashing, but it
> seems fixed by increasing descriptor limit to 4096, the value of the other
> cluster.
>
> Best Regards,
> Kin Fai
>
>
> 2014-11-19 6:33 GMT+08:00 Kin Fai Tse <kftse20031207 at gmail.com
> <javascript:_e(%7B%7D,'cvml','kftse20031207 at gmail.com');>>:
>
>> Hello Hari,
>>
>> My MVAPICH2 is configured with only compilers set to Intel compilers
>> 11.1, and I launch the job using this line:
>>
>> mpirun_rsh -np 2 -hostfile nf ./a.out
>>
>> with nf contains 2 lines, z1-1 and z1-4.
>>
>> I actually also ran the same program in multiple clusters that I have
>> access to, only that particular one (z1-x) had the problem, the problem is
>> not there even in another cluster(z0) purchased together with z1, which is
>> connected to another infiniband switch.
>>
>> As I investigate this problem, I heard that z1's infiniband connection
>> might be different from z0, so I do suspect it is infiniband's problem,
>> however I don't know how to interpret that occasional error reported:
>> Cannot allocate memory (12).
>>
>> Regards,
>> Kin Fai
>>
>> On Wednesday, November 19, 2014, Hari Subramoni <subramoni.1 at osu.edu
>> <javascript:_e(%7B%7D,'cvml','subramoni.1 at osu.edu');>> wrote:
>>
>>> Hello Kin,
>>>
>>> I ran your program on our local cluster on a multi-node setting multiple
>>> times with the latest MVAPICH2-2.1a and was not able to get the failure you
>>> were talking about.
>>>
>>> From the error message, it looks like there might be some firewall
>>> running on your system preventing mpirun_rsh from accessing the second node
>>> leading to the error. Could you please consult your system administrator
>>> and disable any firewalls that could be running and retry? Could you also
>>> let us know how you're launching the job using mpirun_rsh, your hostfile,
>>> and how you configured MVAPICH2?
>>>
>>>
>>> Regards,
>>> Hari.
>>>
>>> On Mon, Nov 17, 2014 at 6:42 PM, Kin Fai Tse <kftse20031207 at gmail.com>
>>> wrote:
>>>
>>>> Dear all,
>>>>
>>>> I am running a small MPI program on cluster using mpirun_rsh.
>>>>
>>>> When the 2 process are on the same node, there is no problem
>>>> But when I use 2 processes on 2 different nodes, communicating a small
>>>> part of a very large static array approximately to 1562500 will immediately
>>>> crash the program during launch.
>>>>
>>>> The error is:
>>>>
>>>> [z1-4:mpispawn_1][child_handler] MPI process (rank: 1, pid: 29421)
>>>> terminated with signal 11 -> abort job
>>>> [z1-0:mpirun_rsh][process_mpispawn_connection] mpispawn_1 from node
>>>> z1-4 aborted: MPI process error (1)
>>>> [z1-0:mpispawn_0][read_size] read() failed on file descriptor 8:
>>>> Connection reset by peer (104)
>>>> [z1-0:mpispawn_0][read_size] Unexpected End-Of-File on file descriptor
>>>> 8. MPI process died?
>>>> [z1-0:mpispawn_0][error_sighandler] Caught error: Segmentation fault
>>>> (signal 11)
>>>> [unset]: Error reading initack on 6
>>>> Error on readline:: Connection reset by peer
>>>> /bin/bash: line 1: 29409 Segmentation fault
>>>>
>>>>
>>>> and occationally I got some delayed error message up to 30s after
>>>> running the program:
>>>>
>>>> [z1-0:mpi_rank_0][handle_cqe] Send desc error in msg to 1, wc_opcode=0
>>>> [z1-0:mpi_rank_0][handle_cqe] Msg from 1: wc.status=12,
>>>> wc.wr_id=0xc8d1c0, wc.opcode=0, vbuf->phead->type=0 =
>>>> MPIDI_CH3_PKT_EAGER_SEND
>>>> [z1-0:mpi_rank_0][handle_cqe]
>>>> src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:573: [] Got
>>>> completion with error 12, vendor code=0x81, dest rank=1
>>>> : Cannot allocate memory (12)
>>>>
>>>>
>>>> Here is my program for your reference:
>>>> I am sure it is due to MPI communication, as the program does not ever
>>>> crash when I comment out both MPI_SEND and MPI_RECV, however commenting out
>>>> only 1 does not work.
>>>>
>>>> #include "mpi.h"
>>>> #include <cstdio>
>>>>
>>>> #define MAXBLOCK 9999999
>>>> #define INIT 1000
>>>> #define INCR 1000
>>>>
>>>> int main(int argc, char* argv[]){
>>>>  int rank, size;
>>>>  int i;
>>>>  double time;
>>>>  double data[MAXBLOCK];
>>>>  double data2[2];
>>>>  MPI::Status status;
>>>>  MPI::Init();
>>>>  time=MPI::Wtime();
>>>>  rank = MPI::COMM_WORLD.Get_rank();
>>>>  size = MPI::COMM_WORLD.Get_size();
>>>>  if(rank == 0){
>>>>   for(i = INIT; i < MAXBLOCK; i+=INCR){
>>>>   data[i]=data[i];
>>>>    MPI::COMM_WORLD.Send(data, i, MPI::DOUBLE, 1, 0);
>>>>    printf("Size: %d sent.\n", i);
>>>>   }
>>>>  } else {
>>>>   i = INIT;
>>>>   while(i < MAXBLOCK){
>>>>   data[i]=data[i];
>>>>    MPI::COMM_WORLD.Recv(data, i, MPI::DOUBLE, 0, 0, status);
>>>>    i+=INCR;
>>>>   }
>>>>  }
>>>>  MPI::Finalize();
>>>>  return 0;
>>>> }
>>>>
>>>> I am quite frustrated about why communicating only a fraction of data
>>>> of the whole array will crash on multi-node setting.
>>>>
>>>> Best regards,
>>>> Kin Fai
>>>>
>>>> _______________________________________________
>>>> mvapich-discuss mailing list
>>>> mvapich-discuss at cse.ohio-state.edu
>>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>
>>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20141121/a3029c12/attachment-0001.html>