[mvapich-discuss] Problem with more MPI jobs on the same node
Krishna Chaitanya Kandalla
kandalla at cse.ohio-state.edu
Sat Aug 29 17:16:04 EDT 2009
Emir,
> mpirun_rsh -ssh -np 8 -hostfile ./machines
VIADEV_CPU_MAPPING=0,1,2,3,4,5,6,7 VIADEV_USE_AFFINITY=1 ./lu.C.8.mvapich
> mpirun_rsh -ssh -np 8 -hostfile ./machines
VIADEV_CPU_MAPPING=8,9,10,11,12,13,14,15 VIADEV_USE_AFFINITY=1
./lu.C.8.mvapich
This should ensure that the processes get mapped to the
core-id's that you have specified. It is a little strange that it is not
happening on your systems. You can tweak the "top" output to also show
the "last used cpu" information for each process running within a node.
This information will help us ascertain that the 16 processes are indeed
getting mapped onto the first 8 cores and nothing else is going on.
To do this, you need to :
1. Open the top interface, hit the "f" button, hit the "j" key and return.
2. Optionally, you can then hit the "o" key, hold the shift and the j
keys so that the "J" and the "A" fields are juxtaposed - this will be
easier to compare visually.
Thanks,
Krishna
Emir Imamagic wrote:
> Dhabaleswar Panda wrote:
>> What is the output of top and mpstat when you run a 16-process LU job on
>> the same 16-cores (0-15)?
>
> Command:
> mpirun_rsh -ssh -np 16 -hostfile ./machines VIADEV_USE_AFFINITY=0
> ./lu.C.16
>
> TOP:
> top - 20:45:42 up 56 days, 15:18, 2 users, load average: 8.55, 5.76,
> 4.46
> Tasks: 484 total, 17 running, 467 sleeping, 0 stopped, 0 zombie
> Cpu(s): 15.2%us, 1.4%sy, 0.0%ni, 83.4%id, 0.0%wa, 0.0%hi, 0.0%si,
> 0.0%st
> Mem: 66072240k total, 9708912k used, 56363328k free, 336556k buffers
> Swap: 7999992k total, 0k used, 7999992k free, 7728032k cached
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 32508 eimamagi 25 0 140m 79m 19m R 99.1 0.1 0:42.09 lu.C.16
> 32509 eimamagi 25 0 140m 63m 4176 R 99.1 0.1 0:42.10 lu.C.16
> 32510 eimamagi 25 0 140m 63m 3792 R 99.1 0.1 0:42.08 lu.C.16
> 32511 eimamagi 25 0 140m 63m 3332 R 99.1 0.1 0:42.09 lu.C.16
> 32512 eimamagi 25 0 140m 63m 4228 R 99.1 0.1 0:42.11 lu.C.16
> 32513 eimamagi 25 0 140m 64m 5148 R 99.1 0.1 0:42.11 lu.C.16
> 32514 eimamagi 25 0 140m 64m 4772 R 99.1 0.1 0:42.11 lu.C.16
> 32515 eimamagi 25 0 140m 63m 4232 R 99.1 0.1 0:42.11 lu.C.16
> 32516 eimamagi 25 0 140m 63m 4052 R 99.1 0.1 0:42.11 lu.C.16
> 32517 eimamagi 25 0 140m 64m 4716 R 99.1 0.1 0:42.10 lu.C.16
> 32518 eimamagi 25 0 140m 63m 4544 R 99.1 0.1 0:42.10 lu.C.16
> 32519 eimamagi 25 0 140m 63m 4060 R 99.1 0.1 0:42.11 lu.C.16
> 32520 eimamagi 25 0 140m 62m 3892 R 99.1 0.1 0:42.10 lu.C.16
> 32521 eimamagi 25 0 140m 63m 4428 R 99.1 0.1 0:42.11 lu.C.16
> 32522 eimamagi 25 0 140m 63m 4428 R 99.1 0.1 0:42.11 lu.C.16
> 32523 eimamagi 25 0 140m 62m 3392 R 99.1 0.1 0:42.11 lu.C.16
>
> MPSTAT:
> 20:45:23 CPU %user %nice %sys %iowait %irq %soft
> %steal %idle intr/s
> 20:45:25 all 50.02 0.00 0.03 0.00 0.00 0.00
> 0.00 49.95 1005.00
> 20:45:25 0 100.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 1005.00
> 20:45:25 1 100.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00
> 20:45:25 2 100.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00
> 20:45:25 3 100.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00
> 20:45:25 4 100.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00
> 20:45:25 5 100.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00
> 20:45:25 6 100.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00
> 20:45:25 7 100.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00
> 20:45:25 8 100.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00
> 20:45:25 9 100.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00
> 20:45:25 10 100.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00
> 20:45:25 11 100.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00
> 20:45:25 12 100.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00
> 20:45:25 13 100.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00
> 20:45:25 14 100.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00
> 20:45:25 15 100.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00
> 20:45:25 16 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 100.00 0.00
> 20:45:25 17 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 100.00 0.00
> 20:45:25 18 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 100.00 0.00
> 20:45:25 19 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 100.00 0.00
> 20:45:25 20 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 100.00 0.00
> 20:45:25 21 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 100.00 0.00
> 20:45:25 22 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 100.00 0.00
> 20:45:25 23 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 100.00 0.00
> 20:45:25 24 0.50 0.00 0.50 0.00 0.00 0.00
> 0.00 99.00 0.00
> 20:45:25 25 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 100.00 0.00
> 20:45:25 26 0.50 0.00 0.50 0.00 0.00 0.00
> 0.00 99.00 0.00
> 20:45:25 27 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 100.00 0.00
> 20:45:25 28 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 100.00 0.00
> 20:45:25 29 0.00 0.00 0.50 0.00 0.00 0.00
> 0.00 99.50 0.00
> 20:45:25 30 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 100.00 0.00
> 20:45:25 31 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 100.00 0.00
>
>
>
> Just for comparison, here's the output when I run 2 instances of
> lu.C.16. It is pretty obvious that only first 16 CPUs are used no
> matter how many jobs I start.
>
> TOP:
> top - 20:47:06 up 56 days, 15:19, 3 users, load average: 16.74,
> 8.87, 5.66
> Tasks: 509 total, 33 running, 476 sleeping, 0 stopped, 0 zombie
> Cpu(s): 50.0%us, 0.1%sy, 0.0%ni, 49.9%id, 0.0%wa, 0.0%hi, 0.0%si,
> 0.0%st
> Mem: 66072240k total, 10744044k used, 55328196k free, 336564k buffers
> Swap: 7999992k total, 0k used, 7999992k free, 7769652k cached
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 32671 eimamagi 25 0 140m 62m 3464 R 50.5 0.1 0:05.25 lu.C.16
> 32673 eimamagi 25 0 140m 63m 3996 R 50.5 0.1 0:05.26 lu.C.16
> 32510 eimamagi 25 0 140m 63m 3892 R 50.2 0.1 2:01.07 lu.C.16
> 32511 eimamagi 25 0 140m 63m 3380 R 50.2 0.1 2:01.13 lu.C.16
> 32512 eimamagi 25 0 140m 64m 4228 R 50.2 0.1 2:01.15 lu.C.16
> 32513 eimamagi 25 0 140m 64m 5148 R 50.2 0.1 2:01.15 lu.C.16
> 32514 eimamagi 25 0 140m 64m 4860 R 50.2 0.1 2:01.15 lu.C.16
> 32516 eimamagi 25 0 141m 63m 4084 R 50.2 0.1 2:01.13 lu.C.16
> 32519 eimamagi 25 0 140m 63m 4152 R 50.2 0.1 2:01.14 lu.C.16
> 32521 eimamagi 25 0 140m 63m 4468 R 50.2 0.1 2:01.14 lu.C.16
> 32523 eimamagi 25 0 140m 62m 3756 R 50.2 0.1 2:01.13 lu.C.16
> 32659 eimamagi 25 0 140m 79m 19m R 50.2 0.1 0:05.25 lu.C.16
> 32660 eimamagi 25 0 140m 63m 4160 R 50.2 0.1 0:05.26 lu.C.16
> 32662 eimamagi 25 0 140m 63m 3280 R 50.2 0.1 0:05.27 lu.C.16
> 32664 eimamagi 25 0 141m 64m 5140 R 50.2 0.1 0:05.27 lu.C.16
> 32665 eimamagi 25 0 140m 64m 4876 R 50.2 0.1 0:05.27 lu.C.16
> 32666 eimamagi 25 0 140m 64m 4348 R 50.2 0.1 0:05.27 lu.C.16
> 32668 eimamagi 25 0 140m 64m 4688 R 50.2 0.1 0:05.26 lu.C.16
> 32669 eimamagi 25 0 140m 63m 4416 R 50.2 0.1 0:05.26 lu.C.16
> 32672 eimamagi 25 0 140m 63m 4396 R 50.2 0.1 0:05.26 lu.C.16
> 32508 eimamagi 25 0 140m 79m 19m R 49.8 0.1 2:01.14 lu.C.16
> 32509 eimamagi 25 0 140m 63m 4176 R 49.8 0.1 2:01.13 lu.C.16
> 32515 eimamagi 25 0 140m 64m 4404 R 49.8 0.1 2:01.14 lu.C.16
> 32517 eimamagi 25 0 141m 64m 4716 R 49.8 0.1 2:01.14 lu.C.16
> 32518 eimamagi 25 0 140m 64m 4660 R 49.8 0.1 2:01.13 lu.C.16
> 32520 eimamagi 25 0 140m 63m 3960 R 49.8 0.1 2:01.15 lu.C.16
> 32522 eimamagi 25 0 140m 63m 4484 R 49.8 0.1 2:01.14 lu.C.16
> 32661 eimamagi 25 0 140m 63m 3776 R 49.8 0.1 0:05.27 lu.C.16
> 32663 eimamagi 25 0 140m 64m 4216 R 49.8 0.1 0:05.26 lu.C.16
> 32667 eimamagi 25 0 140m 63m 3896 R 49.8 0.1 0:05.27 lu.C.16
> 32670 eimamagi 25 0 140m 63m 3956 R 49.8 0.1 0:05.26 lu.C.16
> 32674 eimamagi 25 0 140m 62m 3408 R 49.8 0.1 0:05.27 lu.C.16
>
> MPSTAT:
> 20:47:35 CPU %user %nice %sys %iowait %irq %soft
> %steal %idle intr/s
> 20:47:37 all 50.00 0.02 0.08 0.00 0.00 0.00
> 0.00 49.91 1004.50
> 20:47:37 0 100.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 1004.50
> 20:47:37 1 100.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00
> 20:47:37 2 100.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00
> 20:47:37 3 100.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00
> 20:47:37 4 100.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00
> 20:47:37 5 100.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00
> 20:47:37 6 100.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00
> 20:47:37 7 100.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00
> 20:47:37 8 100.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00
> 20:47:37 9 100.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00
> 20:47:37 10 100.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00
> 20:47:37 11 100.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00
> 20:47:37 12 100.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00
> 20:47:37 13 100.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00
> 20:47:37 14 100.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00
> 20:47:37 15 100.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00
> 20:47:37 16 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 100.00 0.00
> 20:47:37 17 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 100.00 0.00
> 20:47:37 18 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 100.00 0.00
> 20:47:37 19 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 100.00 0.00
> 20:47:37 20 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 100.00 0.00
> 20:47:37 21 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 100.00 0.00
> 20:47:37 22 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 100.00 0.00
> 20:47:37 23 0.00 0.00 0.50 0.00 0.00 0.00
> 0.00 99.50 0.00
> 20:47:37 24 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 100.00 0.00
> 20:47:37 25 0.00 0.00 0.50 0.00 0.00 0.00
> 0.00 99.50 0.00
> 20:47:37 26 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 100.00 0.00
> 20:47:37 27 0.00 0.00 0.50 0.00 0.00 0.00
> 0.00 99.50 0.00
> 20:47:37 28 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 100.00 0.00
> 20:47:37 29 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 100.00 0.00
> 20:47:37 30 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 100.00 0.00
> 20:47:37 31 0.00 0.00 0.50 0.00 0.00 0.00
> 0.00 99.50 0.00
>
>
>> You also indicated in your original e-mail that a single node has 32
>> cores. I am assuming that it has eight sockets of four cores each. Are
>> these Opterons or any other processor type?
>
> Quad-Core AMD Opteron(tm) Processor 8384.
>
> Thanks,
> emir
> ------------------------------------------------------------------------
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
More information about the mvapich-discuss
mailing list