[mvapich-discuss] dat_evd_dequeue erroneous condition is not
handled
nilesh awate
nilesh_awate at yahoo.com
Wed Jul 9 09:05:55 EDT 2008
Hi lei,
i have created a small patch which take care of transport error; abort the mpi appliaction
and come out of it.
i have tried it on mvapich2-1.0.1 & mvapich2-1.0.3
here is the patch
--- orig_mvapich2-1.0.1/src/mpid/osu_ch3/channels/mrail/src/udapl/udapl_channel_manager.c 2007-09-06 02:14:15.000000000 +0530
+++ mvapich2-1.0.1_patched/src/mpid/osu_ch3/channels/mrail/src/udapl/udapl_channel_manager.c 2008-07-02 15:30:45.000000000 +0530
@@ -455,6 +455,8 @@
int i, j, needed;
static int last_poll = 0;
int type = T_CHANNEL_NO_ARRIVE;
+ int rank;
+ PMI_Get_rank(&rank);
*vbuf_handle = NULL;
for (i = last_poll, j = 0;
@@ -467,6 +469,16 @@
{
DEBUG_PRINT ("[poll cq]: get complete queue entry\n");
assert (event.event_number == DAT_DTO_COMPLETION_EVENT);
+
+ /* Following is the patch to come out in case of fatal error like
+ DAT_DTO_ERR_TRANSPORT (occures when network disfunction) */
+
+ if (event.event_data.dto_completion_event_data.status != DAT_DTO_SUCCESS)
+ {
+ udapl_error_abort(UDAPL_STATUS_ERR,"[%d]DAT_EVD_ERROR in Consume_signals %x \n",rank,
+ event.event_data.dto_completion_event_data.status);
+ }
+
sc = ((struct vbuf *) event.event_data.
dto_completion_event_data.user_cookie.as_ptr)->desc;
v = (vbuf *) ((aint_t) sc.cookie.as_ptr);
regards
Nilesh
----- Original Message ----
From: LEI CHAI <chai.15 at osu.edu>
To: nilesh awate <nilesh_awate at yahoo.com>
Cc: MVAPICH2 <mvapich-discuss at cse.ohio-state.edu>
Sent: Wednesday, 18 June, 2008 2:27:32 AM
Subject: Re: [mvapich-discuss] dat_evd_dequeue erroneous condition is not handled
Hi,
We have never got the DAT_DTO_ERR_TRANSPORT error before. This error usually means the network has problem and is not functional well. I think a proper way to handle it is to report the error and abort the mpi program since it is kind of a fatal error.
Lei
----- Original Message -----
From: nilesh awate <nilesh_awate at yahoo.com>
Date: Tuesday, June 17, 2008 10:58 am
Subject: [mvapich-discuss] dat_evd_dequeue erroneous condition is not handled
To: MVAPICH2 <mvapich-discuss at cse.ohio-state.edu>
> Hi All,
> I am using mvapich2-1.0.1 over udapl stack.
> I am getting DAT_DTO_ERR_TRANSPORT error at udapl level, but mpi application is not terminating with some error
> as i browse through the code i observe following thing.
> ret1 = dat_evd_dequeue (MPIDI_CH3I_RDMA_Process.cq_hndl[i], &event);
> if (ret1 == DAT_SUCCESS)
{
> assert (event.event_number == DAT_DTO_COMPLETION_EVENT);
> /* but there is no check for event.event_data.dto_completion_event_data.status */
> . . . .
> . . . .
}
> but above condition is handled in rdma_udapl_1sc.c file while dequeuing
> what is expected behavior of mpi when udapl throws error like DAT_DTO_ERR_TRANSPORT ?
> How this kind of error going to be handled at mpi level?
> OR
> How underlying udapl errors are reflected by mpi ?
> I am using pallas as an application for testing purpose
> waiting for reply
> thanking
> Nilesh
________________________________
> Bring your gang together. Do your thing. Find your favourite Yahoo! Group. > _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state..edu/mailman/listinfo/mvapich-discuss
Meet people who discuss and share your passions. Go to http://in.promos.yahoo.com/groups/bestofyahoo/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080709/2b217c9a/attachment-0001.html
More information about the mvapich-discuss
mailing list