[mvapich-discuss] my application hangs up depending on node number
Michael Li
mli at deform.com
Thu Feb 23 12:18:20 EST 2006
Hi, all
I have an application to run on a 8-node cluster.
I have a very strange problem as follows:
if I specify node number as 4, 8, the application
hangs up at the beginning; if I specify node number as 2,3,
5,6,7, the application runs well until end.
Can anyone point me a direction how to solve this problem ?
I am using mvapich-0.9.6-121/Mellanox IB Gold Distribution (IBGD) v1.7.0.
mli at sftc001:/home/mli> uname -a
Linux sftc001 2.6.10-suse92-i4smp #62 SMP Thu Mar 31 12:03:47 EST 2005
i686 i686 i386 GNU/Linux
mli at sftc001:/home/mli> cat /etc/issue
Welcome to SuSE Linux 9.2 (i586) - Kernel \r (\l).
Here is how do I start my application:
mli at sftc001:/home/mli/PROBLEM/tmp1>
/home/deform/3d/v60/image/mvapich/bin/mpirun_rsh -rsh -hostfile
/usr/rels/mvapich/share/machines/machines.LINUX -np 4
/home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
I've ps/grep-ed my application :
node# process#
2 7
3 8
4 9
5 10
6 11
7 12
8 13
The attached file t.txt has more detailed output of ps/grep command.
Best regards.
Michael Li
-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
This email message and any attachments are for the sole use of the
intended recipients and may contain proprietary and/or confidential
information which may be privileged or otherwise protected from
disclosure. Any unauthorized review, use, disclosure or distribution is
prohibited. If you are not the intended recipients, please contact the
sender by reply email and destroy the original message and any copies of
the message as well as any attachments to the original message.
-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
-------------- next part --------------
--------------2-node(start OK)-----------------
mli at sftc001:/home/mli> ps -ef | grep DEF_SIM_P4P
mli 29976 25212 0 11:53 pts/9 00:00:00 /home/deform/3d/v60/image/mvapich/bin/mpirun_rsh -rsh -hostfile /usr/rels/mvapich/share/machines/machines.LINUX -np 2 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli 29977 29976 0 11:53 pts/9 00:00:00 /usr/bin/rsh sftc001 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33701 MPIRUN_PROCESSES='sftc001:sftc002:' MPIRUN_RANK=0 MPIRUN_NPROCS=2 MPIRUN_ID=29976 DISPLAY=localhost:15.0 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli 29978 29976 0 11:53 pts/9 00:00:00 /usr/bin/rsh sftc002 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33701 MPIRUN_PROCESSES='sftc001:sftc002:' MPIRUN_RANK=1 MPIRUN_NPROCS=2 MPIRUN_ID=29976 DISPLAY=localhost:15.0 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli 29982 29977 0 11:53 pts/9 00:00:00 /usr/bin/rsh sftc001 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33701 MPIRUN_PROCESSES='sftc001:sftc002:' MPIRUN_RANK=0 MPIRUN_NPROCS=2 MPIRUN_ID=29976 DISPLAY=localhost:15.0 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli 29981 29979 0 11:53 ? 00:00:00 tcsh -c cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33701 MPIRUN_PROCESSES='sftc001:sftc002:' MPIRUN_RANK=0 MPIRUN_NPROCS=2 MPIRUN_ID=29976 DISPLAY=localhost:15.0 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli 29994 29981 99 11:53 ? 00:00:23 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli 351 13004 0 11:53 pts/6 00:00:00 grep DEF_SIM_P4P
--------------2-node(end OK)-----------------
--------------4-node(start Not OK)-----------------
mli at sftc001:/home/mli> ps -ef | grep DEF_SIM_P4P
mli 19368 25212 0 12:02 pts/9 00:00:00 /home/deform/3d/v60/image/mvapich/bin/mpirun_rsh -rsh -hostfile /usr/rels/mvapich/share/machines/machines.LINUX -np 4 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli 19369 19368 0 12:02 pts/9 00:00:00 /usr/bin/rsh sftc001 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33763 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:' MPIRUN_RANK=0 MPIRUN_NPROCS=4 MPIRUN_ID=19368 DISPLAY=localhost:15.0 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli 19370 19368 0 12:02 pts/9 00:00:00 /usr/bin/rsh sftc002 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33763 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:' MPIRUN_RANK=1 MPIRUN_NPROCS=4 MPIRUN_ID=19368 DISPLAY=localhost:15.0 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli 19371 19368 0 12:02 pts/9 00:00:00 /usr/bin/rsh sftc003 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33763 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:' MPIRUN_RANK=2 MPIRUN_NPROCS=4 MPIRUN_ID=19368 DISPLAY=localhost:15.0 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli 19372 19368 0 12:02 pts/9 00:00:00 /usr/bin/rsh sftc004 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33763 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:' MPIRUN_RANK=3 MPIRUN_NPROCS=4 MPIRUN_ID=19368 DISPLAY=localhost:15.0 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli 19376 19373 0 12:02 ? 00:00:00 tcsh -c cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33763 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:' MPIRUN_RANK=0 MPIRUN_NPROCS=4 MPIRUN_ID=19368 DISPLAY=localhost:15.0 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli 19379 19369 0 12:02 pts/9 00:00:00 /usr/bin/rsh sftc001 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33763 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:' MPIRUN_RANK=0 MPIRUN_NPROCS=4 MPIRUN_ID=19368 DISPLAY=localhost:15.0 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli 19391 19376 99 12:02 ? 00:00:23 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli 20145 13004 0 12:03 pts/6 00:00:00 grep DEF_SIM_P4P
--------------4-node(end Not OK)-----------------
--------------7-node(start OK)-----------------
mli at sftc001:/home/mli> ps -ef | grep DEF_SIM_P4P
mli 15643 25212 0 11:56 pts/9 00:00:00 /home/deform/3d/v60/image/mvapich/bin/mpirun_rsh -rsh -hostfile /usr/rels/mvapich/share/machines/machines.LINUX -np 7 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli 15646 15643 0 11:56 pts/9 00:00:00 /usr/bin/rsh sftc001 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33721 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:sftc005:sftc006:sftc007:' MPIRUN_RANK=0 MPIRUN_NPROCS=7 MPIRUN_ID=15643 DISPLAY=localhost:15.0 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli 15647 15643 0 11:56 pts/9 00:00:00 /usr/bin/rsh sftc002 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33721 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:sftc005:sftc006:sftc007:' MPIRUN_RANK=1 MPIRUN_NPROCS=7 MPIRUN_ID=15643 DISPLAY=localhost:15.0 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli 15652 15643 0 11:56 pts/9 00:00:00 /usr/bin/rsh sftc003 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33721 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:sftc005:sftc006:sftc007:' MPIRUN_RANK=2 MPIRUN_NPROCS=7 MPIRUN_ID=15643 DISPLAY=localhost:15.0 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli 15653 15643 0 11:56 pts/9 00:00:00 /usr/bin/rsh sftc004 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33721 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:sftc005:sftc006:sftc007:' MPIRUN_RANK=3 MPIRUN_NPROCS=7 MPIRUN_ID=15643 DISPLAY=localhost:15.0 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli 15662 15643 0 11:56 pts/9 00:00:00 /usr/bin/rsh sftc005 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33721 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:sftc005:sftc006:sftc007:' MPIRUN_RANK=4 MPIRUN_NPROCS=7 MPIRUN_ID=15643 DISPLAY=localhost:15.0 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli 15663 15643 0 11:56 pts/9 00:00:00 /usr/bin/rsh sftc006 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33721 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:sftc005:sftc006:sftc007:' MPIRUN_RANK=5 MPIRUN_NPROCS=7 MPIRUN_ID=15643 DISPLAY=localhost:15.0 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli 15664 15643 0 11:56 pts/9 00:00:00 /usr/bin/rsh sftc007 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33721 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:sftc005:sftc006:sftc007:' MPIRUN_RANK=6 MPIRUN_NPROCS=7 MPIRUN_ID=15643 DISPLAY=localhost:15.0 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli 15667 15646 0 11:56 pts/9 00:00:00 /usr/bin/rsh sftc001 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33721 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:sftc005:sftc006:sftc007:' MPIRUN_RANK=0 MPIRUN_NPROCS=7 MPIRUN_ID=15643 DISPLAY=localhost:15.0 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli 15668 15660 0 11:56 ? 00:00:00 tcsh -c cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33721 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:sftc005:sftc006:sftc007:' MPIRUN_RANK=0 MPIRUN_NPROCS=7 MPIRUN_ID=15643 DISPLAY=localhost:15.0 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli 15688 15668 92 11:56 ? 00:00:27 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli 19096 13004 0 11:56 pts/6 00:00:00 grep DEF_SIM_P4P
--------------7-node(end OK)-----------------
--------------8-node(start Not OK)-----------------
mli at sftc001:/home/mli> ps -ef | grep DEF_SIM_P4P
mli 8382 25212 0 11:50 pts/9 00:00:00 /home/deform/3d/v60/image/mvapich/bin/mpirun_rsh -rsh -hostfile /usr/rels/mvapich/share/machines/machines.LINUX -np 8 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli 8383 8382 0 11:50 pts/9 00:00:00 /usr/bin/rsh sftc001 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33680 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:sftc005:sftc006:sftc007:sftc008:' MPIRUN_RANK=0 MPIRUN_NPROCS=8 MPIRUN_ID=8382 DISPLAY=localhost:15.0 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli 8384 8382 0 11:50 pts/9 00:00:00 /usr/bin/rsh sftc002 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33680 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:sftc005:sftc006:sftc007:sftc008:' MPIRUN_RANK=1 MPIRUN_NPROCS=8 MPIRUN_ID=8382 DISPLAY=localhost:15.0 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli 8385 8382 0 11:50 pts/9 00:00:00 /usr/bin/rsh sftc003 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33680 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:sftc005:sftc006:sftc007:sftc008:' MPIRUN_RANK=2 MPIRUN_NPROCS=8 MPIRUN_ID=8382 DISPLAY=localhost:15.0 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli 8386 8382 0 11:50 pts/9 00:00:00 /usr/bin/rsh sftc004 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33680 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:sftc005:sftc006:sftc007:sftc008:' MPIRUN_RANK=3 MPIRUN_NPROCS=8 MPIRUN_ID=8382 DISPLAY=localhost:15.0 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli 8387 8382 0 11:50 pts/9 00:00:00 /usr/bin/rsh sftc005 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33680 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:sftc005:sftc006:sftc007:sftc008:' MPIRUN_RANK=4 MPIRUN_NPROCS=8 MPIRUN_ID=8382 DISPLAY=localhost:15.0 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli 8388 8382 0 11:50 pts/9 00:00:00 /usr/bin/rsh sftc006 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33680 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:sftc005:sftc006:sftc007:sftc008:' MPIRUN_RANK=5 MPIRUN_NPROCS=8 MPIRUN_ID=8382 DISPLAY=localhost:15.0 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli 8389 8382 0 11:50 pts/9 00:00:00 /usr/bin/rsh sftc007 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33680 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:sftc005:sftc006:sftc007:sftc008:' MPIRUN_RANK=6 MPIRUN_NPROCS=8 MPIRUN_ID=8382 DISPLAY=localhost:15.0 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli 8390 8382 0 11:50 pts/9 00:00:00 /usr/bin/rsh sftc008 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33680 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:sftc005:sftc006:sftc007:sftc008:' MPIRUN_RANK=7 MPIRUN_NPROCS=8 MPIRUN_ID=8382 DISPLAY=localhost:15.0 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli 8398 8395 0 11:50 ? 00:00:00 tcsh -c cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33680 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:sftc005:sftc006:sftc007:sftc008:' MPIRUN_RANK=0 MPIRUN_NPROCS=8 MPIRUN_ID=8382 DISPLAY=localhost:15.0 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli 8400 8383 0 11:50 pts/9 00:00:00 /usr/bin/rsh sftc001 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33680 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:sftc005:sftc006:sftc007:sftc008:' MPIRUN_RANK=0 MPIRUN_NPROCS=8 MPIRUN_ID=8382 DISPLAY=localhost:15.0 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli 8433 8398 94 11:50 ? 00:00:24 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli 11475 13004 0 11:50 pts/6 00:00:00 grep DEF_SIM_P4P
--------------8-node(end Not OK)-----------------
More information about the mvapich-discuss
mailing list