[mvapich-discuss] dat_evd_dequeue erroneous condition is not handled

nilesh awate nilesh_awate at yahoo.com
Wed Jul 9 09:05:55 EDT 2008


Hi lei,

i have created a small patch which take care of transport error;  abort the mpi appliaction
and come out of it.
i have tried it on mvapich2-1.0.1 & mvapich2-1.0.3

here is the patch

--- orig_mvapich2-1.0.1/src/mpid/osu_ch3/channels/mrail/src/udapl/udapl_channel_manager.c       2007-09-06 02:14:15.000000000 +0530
+++ mvapich2-1.0.1_patched/src/mpid/osu_ch3/channels/mrail/src/udapl/udapl_channel_manager.c    2008-07-02 15:30:45.000000000 +0530
@@ -455,6 +455,8 @@
     int i, j, needed;
     static int last_poll = 0;
     int type = T_CHANNEL_NO_ARRIVE;
+    int rank;
+    PMI_Get_rank(&rank);

     *vbuf_handle = NULL;
     for (i = last_poll, j = 0;
@@ -467,6 +469,16 @@
             {
                 DEBUG_PRINT ("[poll cq]: get complete queue entry\n");
                 assert (event.event_number == DAT_DTO_COMPLETION_EVENT);
+
+               /* Following is the patch to come out in case of fatal error like
+                   DAT_DTO_ERR_TRANSPORT (occures when network disfunction) */
+
+               if (event.event_data.dto_completion_event_data.status != DAT_DTO_SUCCESS)
+               {
+                      udapl_error_abort(UDAPL_STATUS_ERR,"[%d]DAT_EVD_ERROR in Consume_signals %x  \n",rank,
+                                        event.event_data.dto_completion_event_data.status);
+                }
+
                 sc = ((struct vbuf *) event.event_data.
                       dto_completion_event_data.user_cookie.as_ptr)->desc;
                 v = (vbuf *) ((aint_t) sc.cookie.as_ptr);


regards

Nilesh



----- Original Message ----
From: LEI CHAI <chai.15 at osu.edu>
To: nilesh awate <nilesh_awate at yahoo.com>
Cc: MVAPICH2 <mvapich-discuss at cse.ohio-state.edu>
Sent: Wednesday, 18 June, 2008 2:27:32 AM
Subject: Re: [mvapich-discuss] dat_evd_dequeue erroneous condition is not handled

Hi,
 
We have never got the DAT_DTO_ERR_TRANSPORT error before. This error usually means the network has problem and is not functional well. I think a proper way to handle it is to report the error and abort the mpi program since it is kind of a fatal error.
 
Lei  


----- Original Message -----
From: nilesh awate <nilesh_awate at yahoo.com>
Date: Tuesday, June 17, 2008 10:58 am
Subject: [mvapich-discuss] dat_evd_dequeue erroneous condition is not handled
To: MVAPICH2 <mvapich-discuss at cse.ohio-state.edu>



> Hi All,

> I am using mvapich2-1.0.1 over udapl stack.

> I am getting DAT_DTO_ERR_TRANSPORT error at udapl level, but mpi application is not terminating with some error 

> as i browse through the code i observe following thing.

> ret1 = dat_evd_dequeue (MPIDI_CH3I_RDMA_Process.cq_hndl[i], &event);
> if (ret1 == DAT_SUCCESS)
{
> assert (event.event_number == DAT_DTO_COMPLETION_EVENT);
> /* but there is no check for event.event_data.dto_completion_event_data.status */
> . . . .
> . . . .

}

> but above condition is handled in rdma_udapl_1sc.c file while dequeuing 

> what is expected behavior of mpi when udapl throws error like DAT_DTO_ERR_TRANSPORT ?

> How this kind of error going to be handled at mpi level?
> OR
> How underlying udapl errors are reflected by mpi ?

> I am using pallas as an application for testing purpose 

> waiting for reply
> thanking 
> Nilesh







________________________________
 > Bring your gang together. Do your thing. Find your favourite Yahoo! Group. > _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state..edu/mailman/listinfo/mvapich-discuss 


      Meet people who discuss and share your passions. Go to http://in.promos.yahoo.com/groups/bestofyahoo/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080709/2b217c9a/attachment-0001.html


More information about the mvapich-discuss mailing list