[mvapich-discuss] MV2_USE_BLOCKING

khaled hamidouche khaledhamidouche at gmail.com
Tue Dec 17 12:49:44 EST 2013


Hi Alex,

I tried a simple code which broadcasts an integer on our local cluster with
MV2-1.9 and PGI 13.2 and I'm not able to reproduce your error. I have
enabled blocking mode and I also oversubscribed the nodes by 2x processes
as there are cores.

Can you please give us more information :
1) How many nodes/cores and process did you use for your run ?
2) if possible to share a reproducer?

Thanks.


On Mon, Dec 16, 2013 at 5:44 PM, Alex Breslow <abreslow at cs.ucsd.edu> wrote:

> Hi there,
>
> I find that setting MV2_USE_BLOCKING=1  is causing an MPI program that I
> am writing to have nondeterministic  behavior.  When run without specifying
> this flag, the program runs to completion every time.  However, when the
> flag is set, the program only correctly finishes about 10% of the time.
>
> Specifically, I am designing a distributed system that is built on top of
> MPI.  The program has a single controller process (rank = 0) and a number
> of worker processes.  The controller process cyclically broadcasts
> instructions to the worker processes at regular intervals.
>
> I am using MVAPICH2 version 1.9 and PGI compiler version 13.2 on the
> Gordon Supercomputer.
>
> Currently, I use the code below.
>
> Controller:
>
> int broadcastMessage(int action, int jobID, char* executableName,
>   char* jobName){
>   MSG_Payload msg;
>   msg.action = action; // We always need an action.
>   switch(action){
>   /* Some more stuff that I have omitted for clarity */
>
>   }
>   int root = 0;
>   int msgCount = 1;
>
>   MPI_Bcast(&msg, msgCount, PayloadType, root, MPI_COMM_WORLD);
> }
>
> Worker:
>
> int postReceiveBroadcast(){
>   // TODO: Encapsulate code up to variable `code' in MPI specific
>   //       implementation
>   int root = 0;
>   int msgCount = 1;
>   MSG_Payload msg;
>   cout <<"Posting broadcast receive\n";
>   MPI_Bcast(&msg, msgCount, PayloadType, root, MPI_COMM_WORLD);
>
>   /* Some more stuff that I have omitted for clarity */
>
>   return ret_code;
> }
>
> The controller managers to get to MPI_Finalize without incident but the
> workers don't wake up after their final MPI_Bcast, so they never terminate.
>  Workers call MPI_Bcast about 2 minutes before the controller does.
>
> Let me know if I am missing anything or you need more information.
>
> Many thanks,
> Alex
>
> --
> Alex Breslow
> PhD student in computer science at UC San Diego
> Email: abreslow at cs.ucsd.edu
> Website: cseweb.ucsd.edu/~abreslow
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>


-- 
 K.H
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20131217/71168980/attachment.html>


More information about the mvapich-discuss mailing list