[mvapich-discuss] my application hangs up depending on node number

Michael Li mli at deform.com
Thu Feb 23 12:18:20 EST 2006


Hi, all

I have an application to run on a 8-node cluster.
I have a very strange problem as follows:
if I specify node number as 4, 8, the application
hangs up at the beginning; if I specify node number as 2,3,
5,6,7, the application runs well until end.

Can anyone point me a direction how to solve this problem ?

I am using mvapich-0.9.6-121/Mellanox IB Gold Distribution (IBGD) v1.7.0.
mli at sftc001:/home/mli> uname -a
Linux sftc001 2.6.10-suse92-i4smp #62 SMP Thu Mar 31 12:03:47 EST 2005 
i686 i686 i386 GNU/Linux
mli at sftc001:/home/mli> cat /etc/issue

Welcome to SuSE Linux 9.2 (i586) - Kernel \r (\l).



Here is how do I start my application:

mli at sftc001:/home/mli/PROBLEM/tmp1> 
/home/deform/3d/v60/image/mvapich/bin/mpirun_rsh -rsh -hostfile 
/usr/rels/mvapich/share/machines/machines.LINUX -np 4 
/home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE

I've ps/grep-ed my application :

node#   process#
2        7
3        8
4        9
5       10
6       11
7       12
8       13

The attached file t.txt has more detailed output of ps/grep command.

Best regards.
Michael Li

-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
This email message and any attachments are for the sole use of the
intended recipients and may contain proprietary and/or confidential 
information which may be privileged or otherwise protected from 
disclosure. Any unauthorized review, use, disclosure or distribution is 
prohibited. If you are not the intended recipients, please contact the 
sender by reply email and destroy the original message and any copies of 
the message as well as any attachments to the original message.
-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
-------------- next part --------------
--------------2-node(start OK)-----------------
mli at sftc001:/home/mli> ps -ef | grep DEF_SIM_P4P
mli      29976 25212  0 11:53 pts/9    00:00:00 /home/deform/3d/v60/image/mvapich/bin/mpirun_rsh -rsh -hostfile /usr/rels/mvapich/share/machines/machines.LINUX -np 2 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli      29977 29976  0 11:53 pts/9    00:00:00 /usr/bin/rsh sftc001 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33701 MPIRUN_PROCESSES='sftc001:sftc002:' MPIRUN_RANK=0 MPIRUN_NPROCS=2 MPIRUN_ID=29976 DISPLAY=localhost:15.0  /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli      29978 29976  0 11:53 pts/9    00:00:00 /usr/bin/rsh sftc002 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33701 MPIRUN_PROCESSES='sftc001:sftc002:' MPIRUN_RANK=1 MPIRUN_NPROCS=2 MPIRUN_ID=29976 DISPLAY=localhost:15.0  /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli      29982 29977  0 11:53 pts/9    00:00:00 /usr/bin/rsh sftc001 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33701 MPIRUN_PROCESSES='sftc001:sftc002:' MPIRUN_RANK=0 MPIRUN_NPROCS=2 MPIRUN_ID=29976 DISPLAY=localhost:15.0  /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli      29981 29979  0 11:53 ?        00:00:00 tcsh -c cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33701 MPIRUN_PROCESSES='sftc001:sftc002:' MPIRUN_RANK=0 MPIRUN_NPROCS=2 MPIRUN_ID=29976 DISPLAY=localhost:15.0  /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli      29994 29981 99 11:53 ?        00:00:23 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli        351 13004  0 11:53 pts/6    00:00:00 grep DEF_SIM_P4P
--------------2-node(end OK)-----------------

--------------4-node(start Not OK)-----------------
mli at sftc001:/home/mli> ps -ef | grep DEF_SIM_P4P
mli      19368 25212  0 12:02 pts/9    00:00:00 /home/deform/3d/v60/image/mvapich/bin/mpirun_rsh -rsh -hostfile /usr/rels/mvapich/share/machines/machines.LINUX -np 4 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli      19369 19368  0 12:02 pts/9    00:00:00 /usr/bin/rsh sftc001 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33763 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:' MPIRUN_RANK=0 MPIRUN_NPROCS=4 MPIRUN_ID=19368 DISPLAY=localhost:15.0  /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli      19370 19368  0 12:02 pts/9    00:00:00 /usr/bin/rsh sftc002 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33763 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:' MPIRUN_RANK=1 MPIRUN_NPROCS=4 MPIRUN_ID=19368 DISPLAY=localhost:15.0  /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli      19371 19368  0 12:02 pts/9    00:00:00 /usr/bin/rsh sftc003 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33763 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:' MPIRUN_RANK=2 MPIRUN_NPROCS=4 MPIRUN_ID=19368 DISPLAY=localhost:15.0  /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli      19372 19368  0 12:02 pts/9    00:00:00 /usr/bin/rsh sftc004 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33763 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:' MPIRUN_RANK=3 MPIRUN_NPROCS=4 MPIRUN_ID=19368 DISPLAY=localhost:15.0  /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli      19376 19373  0 12:02 ?        00:00:00 tcsh -c cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33763 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:' MPIRUN_RANK=0 MPIRUN_NPROCS=4 MPIRUN_ID=19368 DISPLAY=localhost:15.0  /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli      19379 19369  0 12:02 pts/9    00:00:00 /usr/bin/rsh sftc001 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33763 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:' MPIRUN_RANK=0 MPIRUN_NPROCS=4 MPIRUN_ID=19368 DISPLAY=localhost:15.0  /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli      19391 19376 99 12:02 ?        00:00:23 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli      20145 13004  0 12:03 pts/6    00:00:00 grep DEF_SIM_P4P
--------------4-node(end Not OK)-----------------


--------------7-node(start OK)-----------------
mli at sftc001:/home/mli> ps -ef | grep DEF_SIM_P4P
mli      15643 25212  0 11:56 pts/9    00:00:00 /home/deform/3d/v60/image/mvapich/bin/mpirun_rsh -rsh -hostfile /usr/rels/mvapich/share/machines/machines.LINUX -np 7 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli      15646 15643  0 11:56 pts/9    00:00:00 /usr/bin/rsh sftc001 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33721 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:sftc005:sftc006:sftc007:' MPIRUN_RANK=0 MPIRUN_NPROCS=7 MPIRUN_ID=15643 DISPLAY=localhost:15.0  /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli      15647 15643  0 11:56 pts/9    00:00:00 /usr/bin/rsh sftc002 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33721 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:sftc005:sftc006:sftc007:' MPIRUN_RANK=1 MPIRUN_NPROCS=7 MPIRUN_ID=15643 DISPLAY=localhost:15.0  /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli      15652 15643  0 11:56 pts/9    00:00:00 /usr/bin/rsh sftc003 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33721 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:sftc005:sftc006:sftc007:' MPIRUN_RANK=2 MPIRUN_NPROCS=7 MPIRUN_ID=15643 DISPLAY=localhost:15.0  /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli      15653 15643  0 11:56 pts/9    00:00:00 /usr/bin/rsh sftc004 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33721 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:sftc005:sftc006:sftc007:' MPIRUN_RANK=3 MPIRUN_NPROCS=7 MPIRUN_ID=15643 DISPLAY=localhost:15.0  /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli      15662 15643  0 11:56 pts/9    00:00:00 /usr/bin/rsh sftc005 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33721 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:sftc005:sftc006:sftc007:' MPIRUN_RANK=4 MPIRUN_NPROCS=7 MPIRUN_ID=15643 DISPLAY=localhost:15.0  /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli      15663 15643  0 11:56 pts/9    00:00:00 /usr/bin/rsh sftc006 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33721 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:sftc005:sftc006:sftc007:' MPIRUN_RANK=5 MPIRUN_NPROCS=7 MPIRUN_ID=15643 DISPLAY=localhost:15.0  /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli      15664 15643  0 11:56 pts/9    00:00:00 /usr/bin/rsh sftc007 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33721 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:sftc005:sftc006:sftc007:' MPIRUN_RANK=6 MPIRUN_NPROCS=7 MPIRUN_ID=15643 DISPLAY=localhost:15.0  /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli      15667 15646  0 11:56 pts/9    00:00:00 /usr/bin/rsh sftc001 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33721 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:sftc005:sftc006:sftc007:' MPIRUN_RANK=0 MPIRUN_NPROCS=7 MPIRUN_ID=15643 DISPLAY=localhost:15.0  /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli      15668 15660  0 11:56 ?        00:00:00 tcsh -c cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33721 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:sftc005:sftc006:sftc007:' MPIRUN_RANK=0 MPIRUN_NPROCS=7 MPIRUN_ID=15643 DISPLAY=localhost:15.0  /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli      15688 15668 92 11:56 ?        00:00:27 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli      19096 13004  0 11:56 pts/6    00:00:00 grep DEF_SIM_P4P
--------------7-node(end OK)-----------------



--------------8-node(start Not OK)-----------------
mli at sftc001:/home/mli> ps -ef | grep DEF_SIM_P4P
mli       8382 25212  0 11:50 pts/9    00:00:00 /home/deform/3d/v60/image/mvapich/bin/mpirun_rsh -rsh -hostfile /usr/rels/mvapich/share/machines/machines.LINUX -np 8 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli       8383  8382  0 11:50 pts/9    00:00:00 /usr/bin/rsh sftc001 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33680 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:sftc005:sftc006:sftc007:sftc008:' MPIRUN_RANK=0 MPIRUN_NPROCS=8 MPIRUN_ID=8382 DISPLAY=localhost:15.0  /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli       8384  8382  0 11:50 pts/9    00:00:00 /usr/bin/rsh sftc002 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33680 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:sftc005:sftc006:sftc007:sftc008:' MPIRUN_RANK=1 MPIRUN_NPROCS=8 MPIRUN_ID=8382 DISPLAY=localhost:15.0  /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli       8385  8382  0 11:50 pts/9    00:00:00 /usr/bin/rsh sftc003 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33680 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:sftc005:sftc006:sftc007:sftc008:' MPIRUN_RANK=2 MPIRUN_NPROCS=8 MPIRUN_ID=8382 DISPLAY=localhost:15.0  /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli       8386  8382  0 11:50 pts/9    00:00:00 /usr/bin/rsh sftc004 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33680 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:sftc005:sftc006:sftc007:sftc008:' MPIRUN_RANK=3 MPIRUN_NPROCS=8 MPIRUN_ID=8382 DISPLAY=localhost:15.0  /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli       8387  8382  0 11:50 pts/9    00:00:00 /usr/bin/rsh sftc005 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33680 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:sftc005:sftc006:sftc007:sftc008:' MPIRUN_RANK=4 MPIRUN_NPROCS=8 MPIRUN_ID=8382 DISPLAY=localhost:15.0  /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli       8388  8382  0 11:50 pts/9    00:00:00 /usr/bin/rsh sftc006 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33680 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:sftc005:sftc006:sftc007:sftc008:' MPIRUN_RANK=5 MPIRUN_NPROCS=8 MPIRUN_ID=8382 DISPLAY=localhost:15.0  /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli       8389  8382  0 11:50 pts/9    00:00:00 /usr/bin/rsh sftc007 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33680 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:sftc005:sftc006:sftc007:sftc008:' MPIRUN_RANK=6 MPIRUN_NPROCS=8 MPIRUN_ID=8382 DISPLAY=localhost:15.0  /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli       8390  8382  0 11:50 pts/9    00:00:00 /usr/bin/rsh sftc008 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33680 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:sftc005:sftc006:sftc007:sftc008:' MPIRUN_RANK=7 MPIRUN_NPROCS=8 MPIRUN_ID=8382 DISPLAY=localhost:15.0  /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli       8398  8395  0 11:50 ?        00:00:00 tcsh -c cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33680 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:sftc005:sftc006:sftc007:sftc008:' MPIRUN_RANK=0 MPIRUN_NPROCS=8 MPIRUN_ID=8382 DISPLAY=localhost:15.0  /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli       8400  8383  0 11:50 pts/9    00:00:00 /usr/bin/rsh sftc001 cd /home/mli/PROBLEM/tmp1; /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=sftc001 MPIRUN_PORT=33680 MPIRUN_PROCESSES='sftc001:sftc002:sftc003:sftc004:sftc005:sftc006:sftc007:sftc008:' MPIRUN_RANK=0 MPIRUN_NPROCS=8 MPIRUN_ID=8382 DISPLAY=localhost:15.0  /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli       8433  8398 94 11:50 ?        00:00:24 /home/deform/3d/v60/image/EXE/DEF_SIM_P4P_INFINIBAND.EXE
mli      11475 13004  0 11:50 pts/6    00:00:00 grep DEF_SIM_P4P
--------------8-node(end Not OK)-----------------


More information about the mvapich-discuss mailing list