[mvapich-discuss] [SPAM] Re: segment fault from MPI_Send

吴雪 sy1406125 at buaa.edu.cn
Wed Oct 12 23:57:06 EDT 2016


Hi,
Thanks for your reply. And I tried to use mpirun_rsh to launch. But still didn't work. Errors are as follows. I tried to set MV2_SUPPORT_DPM=1 to support MPI_Comm_spawn. I got the same result. And I want to use '-genv' to pass some variables to each process. mpirun_rsh seems not support '-genv'.Is there any alternative?


Best wishes,
xue 


run at gpu-cluster-2:~/wx-cuda-workplace/mpiSpawn$ mpirun_rsh -hostfile hf -np 1 ./father
[cli_0]: aborting job:
Fatal error in MPI_Comm_spawn:
Other MPI error, error stack:
MPI_Comm_spawn(144)...........: MPI_Comm_spawn(cmd="./child", argv=(nil), maxprocs=8, info=0x9c000000, root=0, MPI_COMM_WORLD, intercomm=0x7fff46c654cc, errors=(nil)) failed
MPIDI_Comm_spawn_multiple(147): 
MPID_Open_port(70)............: Function not implemented


[gpu-cluster-2:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 5. MPI process died?
[gpu-cluster-2:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died?
[gpu-cluster-2:mpispawn_0][child_handler] MPI process (rank: 0, pid: 25579) exited with status 1




-----原始邮件-----
发件人: "Hari Subramoni" <subramoni.1 at osu.edu>
发送时间: 2016年10月12日 星期三
收件人: "吴雪" <sy1406125 at buaa.edu.cn>
抄送: "mvapich-discuss at cse.ohio-state.edu" <mvapich-discuss at cse.ohio-state.edu>
主题: [SPAM] Re: [mvapich-discuss] segment fault from MPI_Send


Hello,


Can you try the mpirun_rsh job launcher instead of mpiexec and see if things work?


Regards,
Hari.


On Tue, Oct 11, 2016 at 8:36 AM, 吴雪 <sy1406125 at buaa.edu.cn> wrote:
Hi,all
I'm using MVAPICH2-2.2rc2.I have a program called father and the father process use MPI_Comm_spawn to start 8 children processes called child. Source code are as follows.
father:
#include<mpi.h>
int main(int argc,char **argv)
{
int provided = 0;
MPI_Init(&argc,&argv);
MPI_Info info=MPI_INFO_NULL;
char deviceHosts[10] = "hf";
MPI_Info_create(&info);
MPI_Info_set(info,"hostfile",deviceHosts);
MPI_Comm childComm;
MPI_Comm_spawn("./child",MPI_ARGV_NULL,8,info,0,MPI_COMM_WORLD,&childComm,MPI_ERRCODES_IGNORE);
int size = 64 * 1024;
int i,j;
int *a,*b;
a = (int *)malloc(size * sizeof(int));
b = (int *)malloc(size * sizeof(int));
for(j = 0;j < 500;j ++)
{
for(i = 0;i < 8;i ++)
{
MPI_Send(a,size,MPI_BYTE,i,0,childComm);
MPI_Recv(b,size,MPI_BYTE,i,0,childComm,MPI_STATUS_IGNORE);
}
}
MPI_Finalize();
return 0;
} 
child:
#include<mpi.h>
#include<stdio.h>
int main(int argc,char **argv)
{
int provided = 0;
//MPI_Init_thread(argc,argv,MPI_THREAD_MULTIPLE,&provided);
MPI_Init(&argc,&argv);
int rank;
MPI_Comm fatherComm;
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
printf("child %d start\n",rank);


MPI_Comm_get_parent(&fatherComm);
int size = 64 * 1024;
int i;
int *a,*b;
b = (int *)malloc(size * sizeof(int));
for(i = 0;i < 500;i ++)
{
printf("child %d receive round %d\n",rank,i);
MPI_Recv(b,size,MPI_BYTE,0,0,fatherComm,MPI_STATUS_IGNORE);
MPI_Send(b,size,MPI_BYTE,0,0,fatherComm);
}
printf("child %d exit\n",rank);
MPI_Finalize();
return 0;
} 


the core file is:
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fb379bf2a50 in vma_compare_search () from /home/run/wx-workplace/mvapich2-2.2rc2/lib/libmpi.so.12
(gdb) bt
#0  0x00007fb379bf2a50 in vma_compare_search () from /home/run/wx-workplace/mvapich2-2.2rc2/lib/libmpi.so.12
#1  0x00007fb379c11342 in avl_find () from /home/run/wx-workplace/mvapich2-2.2rc2/lib/libmpi.so.12
#2  0x00007fb379bf311e in dreg_find () from /home/run/wx-workplace/mvapich2-2.2rc2/lib/libmpi.so.12
#3  0x00007fb379bf539a in dreg_register () from /home/run/wx-workplace/mvapich2-2.2rc2/lib/libmpi.so.12
#4  0x00007fb379c0e669 in MPIDI_CH3I_MRAIL_Prepare_rndv () from /home/run/wx-workplace/mvapich2-2.2rc2/lib/libmpi.so.12
#5  0x00007fb379bd63db in MPIDI_CH3_iStartRndvMsg () from /home/run/wx-workplace/mvapich2-2.2rc2/lib/libmpi.so.12
#6  0x00007fb379bd0916 in MPID_MRAIL_RndvSend () from /home/run/wx-workplace/mvapich2-2.2rc2/lib/libmpi.so.12
#7  0x00007fb379bca91d in MPID_Send () from /home/run/wx-workplace/mvapich2-2.2rc2/lib/libmpi.so.12
#8  0x00007fb379b574e5 in PMPI_Send () from /home/run/wx-workplace/mvapich2-2.2rc2/lib/libmpi.so.12
#9  0x0000000000400a5e in main ()


and in file 'hf' is '192.168.2.2:8'. I use mpiexec to launch the job,'mpiexec -genv MV2_SUPPORT_DPM 1 -n 1 ./father'
I've not been able to find out what causes segment fault and how to make it correct. I'll appreciate for any advice.
Looking forward to your reply.


xue







_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20161013/6cdaf983/attachment-0001.html>


More information about the mvapich-discuss mailing list