[mvapich-discuss] Erros with MVAPICH2-0.9.3

Amit H Kumar AHKumar at odu.edu
Tue Sep 5 11:56:42 EDT 2006


Hi Abhinav,

I believe as you mentioned it only happens with specific nodes.

My perf_main tests fails as follows.

SENDER:
========
11:16] node-0-0:bin] perf_main --send -trc -mbw -s1280 -n1000

********************************************
*********  perf_main version 8.0   *********
*********  CPU is: 1.00 MHz     *********
********************************************



************* RC BW Test started  *********************

Completion with error on send queue (syndrome=0x81=VAPI_RETRY_EXC_ERR ,
opcode=0=VAPI_CQE_SQ_SEND_DATA)
PERF_poll_cq: completion with error (12) detected



RECEIVER:
==========
 perf_main -a172.25.23.254

********************************************
*********  perf_main version 8.0   *********
*********  CPU is: 1.00 MHz     *********
********************************************



************* RC BW Test started  *********************

..... Nothing shows up it hangs ...



Any idea what could be wrong.

Thank you,
-Amit



Abhinav Vishnu <vishnu at cse.ohio-state.edu> wrote on 09/02/2006 12:57:03 AM:

> Hi Amit,
>
> > > The problem of getting "VAPI_PORT_ACTIVE""VAPI_PORT_ERROR" events
> > > seems like a system setup issue. These are the events
> > > generated by VAPI layer once a port comes down and up.
> > >
> >
> > Does it mean that the VAPI library setup needs to be re-visited on our
> > system.
> > Or this is something which we can ignore? Also jobs resulting in these
> > errors never get terminated.
>
> Thanks for reporting this problem.
>
> Actually, VAPI_PORT_ACTIVE means that the port status
> of the HCA is fluctuating. This is a typical symptom of bad
> connector either at the HCA end or the switch end. Do you see this error
> with the same two nodes?  Because of this asynchronous event, the QPs
will
> go into the error state, and i do not expect the processes to go to
> completion.
>
> IMO, you will be able to see the errors at the VAPI level tests too, like
> perf_main.  Can you please try perf_main on the two nodes which are
> showing the error and update us on your findings?
>
> Thanks again,
>
> -- Abhinav
>
> >
> > Thank you,
> > Amit
> >
> >
> > > Thanks.
> > >
> > > Lei
> > >
> > > ----- Original Message -----
> > > >
> > > > Hi MVAPICH----2-0.9.3
> > > >
> > > > Kernel Version    :     2.4.21-20.ELsmp
> > > > Arch        :     x86_64
> > > > Compiler    :     INTEL8.1
> > > > mvapich2    :     0.9.3
> > > >
> > > >
> > > > Another Scenario of errors:
> > > >
> > > >
> > > > Thank you for any feedback,
> > > > -Amit
> > > >
> > > > <===================
> > > > sched_setaffinity: Bad address
> > > > sched_setaffinity: Bad address
> > > > sched_setaffinity: Bad address
> > > > sched_setaffinity: Bad address
> > > > sched_setaffinity: Bad address
> > > > sched_setaffinity: Bad address
> > > > sched_setaffinity: Bad address
> > > > sched_setaffinity: Bad address
> > > > sched_setaffinity: Bad address
> > > > sched_setaffinity: Bad address
> > > > # OSU MPI Bandwidth Test (Version 2.2)
> > > > # Size          Bandwidth (MB/s)
> > > > 1               0.640062
> > > > 2               1.280123
> > > > 4               2.560247
> > > > 8               5.120615
> > > > 16              10.240986
> > > > 32              20.481973
> > > > 64              40.963946
> > > > 128             81.927891
> > > > 256             163.855783
> > > > 512             327.719380
> > > > 1024            655.485649
> > > > 2048            655.423131
> > > > 4096            655.427038
> > > > 8192            1048.682011
> > > > 16384           1048.727022
> > > > 32768           1048.677010
> > > > 65536           1048.680135
> > > > 131072          1048.687949
> > > > rank 6, Got an asynchronous event: VAPI_PORT_ACTIVE
> > > > rank 7, Got an asynchronous event: VAPI_PORT_ACTIVE
> > > > 262144          986.999817
> > > > 524288          828.593573
> > > > 1048576         803.785572
> > > > 2097152         816.001536
> > > > 4194304         822.250940
> > > > rank 6, Got an asynchronous event: VAPI_PORT_ERROR
> > > > (VAPI_EV_SYNDROME_NONE)rank 7, Got an asynchronous event:
> > > > VAPI_PORT_ERROR(VAPI_EV_SYNDROME_NONE)ran
> > > > k 7, Got an asynchronous event: VAPI_PORT_ACTIVE
> > > > rank 6, Got an asynchronous event: VAPI_PORT_ACTIVE
> > > > <==================
> > > >
> > > > _______________________________________________
> > > > mvapich-discuss mailing list
> > > > mvapich-discuss at mail.cse.ohio-state.edu
> > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > > >
> > >
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at mail.cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
>



More information about the mvapich-discuss mailing list