[mvapich-discuss] Problem with more MPI jobs on the same node

Krishna Chaitanya Kandalla kandalla at cse.ohio-state.edu
Sat Aug 29 17:16:04 EDT 2009


Emir,

 > mpirun_rsh -ssh -np 8 -hostfile ./machines 
VIADEV_CPU_MAPPING=0,1,2,3,4,5,6,7 VIADEV_USE_AFFINITY=1 ./lu.C.8.mvapich
 > mpirun_rsh -ssh -np 8 -hostfile ./machines 
VIADEV_CPU_MAPPING=8,9,10,11,12,13,14,15 VIADEV_USE_AFFINITY=1 
./lu.C.8.mvapich

        This should ensure that the processes get mapped to the 
core-id's that you have specified. It is a little strange that it is not 
happening on your systems. You can tweak the "top" output to also show 
the "last used cpu" information for each process running within a node. 
This information will help us ascertain that the 16 processes are indeed 
getting mapped onto the first 8 cores and nothing else is going on.
         To do this, you need to :
1. Open the top interface, hit the "f" button, hit the "j" key and return.
2. Optionally, you can then hit the "o" key, hold the shift and the j 
keys so that the "J" and the "A" fields are juxtaposed - this will be 
easier to compare visually.

Thanks,
Krishna



  

Emir Imamagic wrote:
> Dhabaleswar Panda wrote:
>> What is the output of top and mpstat when you run a 16-process LU job on
>> the same 16-cores (0-15)?
>
> Command:
>  mpirun_rsh -ssh -np 16 -hostfile ./machines VIADEV_USE_AFFINITY=0 
> ./lu.C.16
>
> TOP:
> top - 20:45:42 up 56 days, 15:18,  2 users,  load average: 8.55, 5.76, 
> 4.46
> Tasks: 484 total,  17 running, 467 sleeping,   0 stopped,   0 zombie
> Cpu(s): 15.2%us,  1.4%sy,  0.0%ni, 83.4%id,  0.0%wa,  0.0%hi,  0.0%si, 
> 0.0%st
> Mem:  66072240k total,  9708912k used, 56363328k free,   336556k buffers
> Swap:  7999992k total,        0k used,  7999992k free,  7728032k cached
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 32508 eimamagi  25   0  140m  79m  19m R 99.1  0.1   0:42.09 lu.C.16
> 32509 eimamagi  25   0  140m  63m 4176 R 99.1  0.1   0:42.10 lu.C.16
> 32510 eimamagi  25   0  140m  63m 3792 R 99.1  0.1   0:42.08 lu.C.16
> 32511 eimamagi  25   0  140m  63m 3332 R 99.1  0.1   0:42.09 lu.C.16
> 32512 eimamagi  25   0  140m  63m 4228 R 99.1  0.1   0:42.11 lu.C.16
> 32513 eimamagi  25   0  140m  64m 5148 R 99.1  0.1   0:42.11 lu.C.16
> 32514 eimamagi  25   0  140m  64m 4772 R 99.1  0.1   0:42.11 lu.C.16
> 32515 eimamagi  25   0  140m  63m 4232 R 99.1  0.1   0:42.11 lu.C.16
> 32516 eimamagi  25   0  140m  63m 4052 R 99.1  0.1   0:42.11 lu.C.16
> 32517 eimamagi  25   0  140m  64m 4716 R 99.1  0.1   0:42.10 lu.C.16
> 32518 eimamagi  25   0  140m  63m 4544 R 99.1  0.1   0:42.10 lu.C.16
> 32519 eimamagi  25   0  140m  63m 4060 R 99.1  0.1   0:42.11 lu.C.16
> 32520 eimamagi  25   0  140m  62m 3892 R 99.1  0.1   0:42.10 lu.C.16
> 32521 eimamagi  25   0  140m  63m 4428 R 99.1  0.1   0:42.11 lu.C.16
> 32522 eimamagi  25   0  140m  63m 4428 R 99.1  0.1   0:42.11 lu.C.16
> 32523 eimamagi  25   0  140m  62m 3392 R 99.1  0.1   0:42.11 lu.C.16
>
> MPSTAT:
> 20:45:23     CPU   %user   %nice    %sys %iowait    %irq   %soft  
> %steal   %idle    intr/s
> 20:45:25     all   50.02    0.00    0.03    0.00    0.00    0.00    
> 0.00   49.95   1005.00
> 20:45:25       0  100.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00   1005.00
> 20:45:25       1  100.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00      0.00
> 20:45:25       2  100.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00      0.00
> 20:45:25       3  100.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00      0.00
> 20:45:25       4  100.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00      0.00
> 20:45:25       5  100.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00      0.00
> 20:45:25       6  100.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00      0.00
> 20:45:25       7  100.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00      0.00
> 20:45:25       8  100.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00      0.00
> 20:45:25       9  100.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00      0.00
> 20:45:25      10  100.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00      0.00
> 20:45:25      11  100.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00      0.00
> 20:45:25      12  100.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00      0.00
> 20:45:25      13  100.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00      0.00
> 20:45:25      14  100.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00      0.00
> 20:45:25      15  100.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00      0.00
> 20:45:25      16    0.00    0.00    0.00    0.00    0.00    0.00    
> 0.00  100.00      0.00
> 20:45:25      17    0.00    0.00    0.00    0.00    0.00    0.00    
> 0.00  100.00      0.00
> 20:45:25      18    0.00    0.00    0.00    0.00    0.00    0.00    
> 0.00  100.00      0.00
> 20:45:25      19    0.00    0.00    0.00    0.00    0.00    0.00    
> 0.00  100.00      0.00
> 20:45:25      20    0.00    0.00    0.00    0.00    0.00    0.00    
> 0.00  100.00      0.00
> 20:45:25      21    0.00    0.00    0.00    0.00    0.00    0.00    
> 0.00  100.00      0.00
> 20:45:25      22    0.00    0.00    0.00    0.00    0.00    0.00    
> 0.00  100.00      0.00
> 20:45:25      23    0.00    0.00    0.00    0.00    0.00    0.00    
> 0.00  100.00      0.00
> 20:45:25      24    0.50    0.00    0.50    0.00    0.00    0.00    
> 0.00   99.00      0.00
> 20:45:25      25    0.00    0.00    0.00    0.00    0.00    0.00    
> 0.00  100.00      0.00
> 20:45:25      26    0.50    0.00    0.50    0.00    0.00    0.00    
> 0.00   99.00      0.00
> 20:45:25      27    0.00    0.00    0.00    0.00    0.00    0.00    
> 0.00  100.00      0.00
> 20:45:25      28    0.00    0.00    0.00    0.00    0.00    0.00    
> 0.00  100.00      0.00
> 20:45:25      29    0.00    0.00    0.50    0.00    0.00    0.00    
> 0.00   99.50      0.00
> 20:45:25      30    0.00    0.00    0.00    0.00    0.00    0.00    
> 0.00  100.00      0.00
> 20:45:25      31    0.00    0.00    0.00    0.00    0.00    0.00    
> 0.00  100.00      0.00
>
>
>
> Just for comparison, here's the output when I run 2 instances of 
> lu.C.16. It is pretty obvious that only first 16 CPUs are used no 
> matter how many jobs I start.
>
> TOP:
> top - 20:47:06 up 56 days, 15:19,  3 users,  load average: 16.74, 
> 8.87, 5.66
> Tasks: 509 total,  33 running, 476 sleeping,   0 stopped,   0 zombie
> Cpu(s): 50.0%us,  0.1%sy,  0.0%ni, 49.9%id,  0.0%wa,  0.0%hi,  0.0%si, 
> 0.0%st
> Mem:  66072240k total, 10744044k used, 55328196k free,   336564k buffers
> Swap:  7999992k total,        0k used,  7999992k free,  7769652k cached
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 32671 eimamagi  25   0  140m  62m 3464 R 50.5  0.1   0:05.25 lu.C.16
> 32673 eimamagi  25   0  140m  63m 3996 R 50.5  0.1   0:05.26 lu.C.16
> 32510 eimamagi  25   0  140m  63m 3892 R 50.2  0.1   2:01.07 lu.C.16
> 32511 eimamagi  25   0  140m  63m 3380 R 50.2  0.1   2:01.13 lu.C.16
> 32512 eimamagi  25   0  140m  64m 4228 R 50.2  0.1   2:01.15 lu.C.16
> 32513 eimamagi  25   0  140m  64m 5148 R 50.2  0.1   2:01.15 lu.C.16
> 32514 eimamagi  25   0  140m  64m 4860 R 50.2  0.1   2:01.15 lu.C.16
> 32516 eimamagi  25   0  141m  63m 4084 R 50.2  0.1   2:01.13 lu.C.16
> 32519 eimamagi  25   0  140m  63m 4152 R 50.2  0.1   2:01.14 lu.C.16
> 32521 eimamagi  25   0  140m  63m 4468 R 50.2  0.1   2:01.14 lu.C.16
> 32523 eimamagi  25   0  140m  62m 3756 R 50.2  0.1   2:01.13 lu.C.16
> 32659 eimamagi  25   0  140m  79m  19m R 50.2  0.1   0:05.25 lu.C.16
> 32660 eimamagi  25   0  140m  63m 4160 R 50.2  0.1   0:05.26 lu.C.16
> 32662 eimamagi  25   0  140m  63m 3280 R 50.2  0.1   0:05.27 lu.C.16
> 32664 eimamagi  25   0  141m  64m 5140 R 50.2  0.1   0:05.27 lu.C.16
> 32665 eimamagi  25   0  140m  64m 4876 R 50.2  0.1   0:05.27 lu.C.16
> 32666 eimamagi  25   0  140m  64m 4348 R 50.2  0.1   0:05.27 lu.C.16
> 32668 eimamagi  25   0  140m  64m 4688 R 50.2  0.1   0:05.26 lu.C.16
> 32669 eimamagi  25   0  140m  63m 4416 R 50.2  0.1   0:05.26 lu.C.16
> 32672 eimamagi  25   0  140m  63m 4396 R 50.2  0.1   0:05.26 lu.C.16
> 32508 eimamagi  25   0  140m  79m  19m R 49.8  0.1   2:01.14 lu.C.16
> 32509 eimamagi  25   0  140m  63m 4176 R 49.8  0.1   2:01.13 lu.C.16
> 32515 eimamagi  25   0  140m  64m 4404 R 49.8  0.1   2:01.14 lu.C.16
> 32517 eimamagi  25   0  141m  64m 4716 R 49.8  0.1   2:01.14 lu.C.16
> 32518 eimamagi  25   0  140m  64m 4660 R 49.8  0.1   2:01.13 lu.C.16
> 32520 eimamagi  25   0  140m  63m 3960 R 49.8  0.1   2:01.15 lu.C.16
> 32522 eimamagi  25   0  140m  63m 4484 R 49.8  0.1   2:01.14 lu.C.16
> 32661 eimamagi  25   0  140m  63m 3776 R 49.8  0.1   0:05.27 lu.C.16
> 32663 eimamagi  25   0  140m  64m 4216 R 49.8  0.1   0:05.26 lu.C.16
> 32667 eimamagi  25   0  140m  63m 3896 R 49.8  0.1   0:05.27 lu.C.16
> 32670 eimamagi  25   0  140m  63m 3956 R 49.8  0.1   0:05.26 lu.C.16
> 32674 eimamagi  25   0  140m  62m 3408 R 49.8  0.1   0:05.27 lu.C.16
>
> MPSTAT:
> 20:47:35     CPU   %user   %nice    %sys %iowait    %irq   %soft  
> %steal   %idle    intr/s
> 20:47:37     all   50.00    0.02    0.08    0.00    0.00    0.00    
> 0.00   49.91   1004.50
> 20:47:37       0  100.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00   1004.50
> 20:47:37       1  100.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00      0.00
> 20:47:37       2  100.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00      0.00
> 20:47:37       3  100.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00      0.00
> 20:47:37       4  100.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00      0.00
> 20:47:37       5  100.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00      0.00
> 20:47:37       6  100.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00      0.00
> 20:47:37       7  100.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00      0.00
> 20:47:37       8  100.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00      0.00
> 20:47:37       9  100.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00      0.00
> 20:47:37      10  100.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00      0.00
> 20:47:37      11  100.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00      0.00
> 20:47:37      12  100.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00      0.00
> 20:47:37      13  100.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00      0.00
> 20:47:37      14  100.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00      0.00
> 20:47:37      15  100.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00      0.00
> 20:47:37      16    0.00    0.00    0.00    0.00    0.00    0.00    
> 0.00  100.00      0.00
> 20:47:37      17    0.00    0.00    0.00    0.00    0.00    0.00    
> 0.00  100.00      0.00
> 20:47:37      18    0.00    0.00    0.00    0.00    0.00    0.00    
> 0.00  100.00      0.00
> 20:47:37      19    0.00    0.00    0.00    0.00    0.00    0.00    
> 0.00  100.00      0.00
> 20:47:37      20    0.00    0.00    0.00    0.00    0.00    0.00    
> 0.00  100.00      0.00
> 20:47:37      21    0.00    0.00    0.00    0.00    0.00    0.00    
> 0.00  100.00      0.00
> 20:47:37      22    0.00    0.00    0.00    0.00    0.00    0.00    
> 0.00  100.00      0.00
> 20:47:37      23    0.00    0.00    0.50    0.00    0.00    0.00    
> 0.00   99.50      0.00
> 20:47:37      24    0.00    0.00    0.00    0.00    0.00    0.00    
> 0.00  100.00      0.00
> 20:47:37      25    0.00    0.00    0.50    0.00    0.00    0.00    
> 0.00   99.50      0.00
> 20:47:37      26    0.00    0.00    0.00    0.00    0.00    0.00    
> 0.00  100.00      0.00
> 20:47:37      27    0.00    0.00    0.50    0.00    0.00    0.00    
> 0.00   99.50      0.00
> 20:47:37      28    0.00    0.00    0.00    0.00    0.00    0.00    
> 0.00  100.00      0.00
> 20:47:37      29    0.00    0.00    0.00    0.00    0.00    0.00    
> 0.00  100.00      0.00
> 20:47:37      30    0.00    0.00    0.00    0.00    0.00    0.00    
> 0.00  100.00      0.00
> 20:47:37      31    0.00    0.00    0.50    0.00    0.00    0.00    
> 0.00   99.50      0.00
>
>
>> You also indicated in your original e-mail that a single node has 32
>> cores. I am assuming that it has eight sockets of four cores each. Are
>> these Opterons or any other processor type?
>
> Quad-Core AMD Opteron(tm) Processor 8384.
>
> Thanks,
> emir
> ------------------------------------------------------------------------
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>   


More information about the mvapich-discuss mailing list