[mvapich-discuss] Profiling timer expired

Jim Galarowicz jeg at krellinst.org
Wed May 11 11:23:32 EDT 2011


Hi Sayantan,

I'm using mpi/mvapich2-1.4.1_oobpr_intel-11.1-f064-c064 at Sandia and a 
similar version at LLNL and seeing the same messages about the profiling 
timer.

I'm having trouble getting mpirun_rsh to work.   I must be doing 
something wrong.  I haven't run in this mode before.
This is the command I'm using:
     
/apps/x86_64/mpi/mvapich2/intel-11.1-f064-c064/mvapich2-1.4.1_oobpr/bin/mpirun_rsh  
-np 32  -hostfile hostfile ./aleph-redsky-opt ./test1.in\
Where hostfile: cat hostfile
rs264.sandia.gov
rs265.sandia.gov
rs266.sandia.gov
rs267.sandia.gov

I was getting asked for my password for each node and when I entered it, 
it was rejected.

Now I'm just seeing the cursor returned when I enter the mpirun_rsh command?


I used salloc to alloc 4 nodes at 8 procs each.

Sorry for not knowing how to run mpirun_rsh... Any suggestions?    I 
also haven't used hydra either.

Thanks,
Jim G



On 05/11/2011 09:36 AM, Sayantan Sur wrote:
> Hi Jim,
>
> Thanks for reporting this issue. May I ask which exact version of
> MVAPICH2 have you been using? I see that you launched your job using
> SLURM. Did you try launching with mpirun_rsh or hydra? If you did, did
> you see the same problem with mpirun_rsh, it might help us narrow down
> the problem. You can 'salloc' the nodes and then use mpirun_rsh
> directly.
>
> Thanks for your help.
>
> On Wed, May 11, 2011 at 9:09 AM, Jim Galarowicz<jeg at krellinst.org>  wrote:
>> Hi,
>>
>> I'm a developer on the Open|SpeedShop performance tool project
>> (www.openspeedshop.org).  I've recently (maybe tied to recent versions of
>> mvapich2) started seeing the error message:
>>       srun: error: rs265: tasks 0-1: Profiling timer expired
>> when running any Open|SpeedShop experiments that use the SIGPROF signal in
>> conjunction with setitimer(ITIMER_PROF,...) to periodically interrupt the
>> application to take program counter, call tree samples, and read papi event
>> counters.   I'm trying to figure out if mvapich2 is also using these
>> mechanisms and we are in conflict.
>>
>> Does mvapich2 use these timer mechanisms?    Otherwise, maybe slurm is using
>> the timer mechanisms?
>>
>> Thanks,
>> Jim G
>>
>>
>> Here is an example of what I'm seeing when running our pcsamp (program
>> counter experiment):
>>
>>
>> osspcsamp "srun -n 8 --ntasks-per-node 8 /usr/bin/numa_wrapper -ppn 8
>> ./application-redsky-opt ./test1.in"
>> [openss]: pcsamp experiment using the pcsamp experiment default sampling
>> rate: "100".
>> [openss]: Using OPENSS_PREFIX installed in
>> /projects/OSS/openspeedshop-2.0.1_beta2
>> [openss]: Setting up offline raw data directory in ./offline-oss
>> [openss]: Running offline pcsamp experiment using the command:
>> "/apps/slurm/wrapper/srun -n 8 --ntasks-per-node 8 /usr/bin/numa_wrapper
>> -ppn 8 /projects/OSS/openspeedshop-2.0.1_beta2/bin/ossrun -c pcsamp
>> "./application-redsky-opt ./test1.in""
>>
>> srun: error: rs265: tasks 0-1: Profiling timer expired
>> srun: First task exited 30s ago
>> srun: tasks 2-7: running
>> srun: tasks 0-1: exited abnormally
>> srun: Terminating job step 4249909.15
>> slurmd[rs265]: *** STEP 4249909.15 KILLED AT 2011-05-10T22:22:42 WITH SIGNAL
>> 9 ***
>>
>> [openss]: Converting raw data from ./offline-oss into temp file X.0.openss
>>
>> Processing raw data for application
>> Processing processes and threads ...
>> Processing performance data ...
>> Processing functions and statements ...
>>
>> [openss]: Restoring and displaying default view for:
>>     /home/jgalaro/demos/application/application-redsky-opt-pcsamp-10.openss
>> [openss]: The restored experiment identifier is:  -x 1
>> No performance measurements were made for the experiment.
>>
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>
>


More information about the mvapich-discuss mailing list