[mvapich-discuss] Erros with MVAPICH2-0.9.3

Amit H Kumar AHKumar at odu.edu
Tue Sep 5 15:25:27 EDT 2006


Thank you Abhinav,

Will try to dig into and let you guys know about it.

-Amit

mvapich-discuss-bounces at cse.ohio-state.edu wrote on 09/05/2006 01:59:34 PM:

> Hi Amit,
>
> Thanks for the update. As i had contemplated, we are seeing the
> error in data transmission with the VAPI level test, perf_main.
>
> >
> > ************* RC BW Test started  *********************
> >
> > Completion with error on send queue (syndrome=0x81=VAPI_RETRY_EXC_ERR ,
> > opcode=0=VAPI_CQE_SQ_SEND_DATA)
> > PERF_poll_cq: completion with error (12) detected
> >
>
> The error means that even after multiple retries the remote destination
> could not be reached (as shown by VAPI_RETRY_EXEC_ERR). As a result, the
QP
> is broken. Once a QP is in broken state, any descriptors posted further
on
> this QP will also result in error.
>
> The remote side is waiting for data to arrive, which never does
> due to this broken QP and hence it hangs.
>
> >
> >
> > RECEIVER:
> > ==========
> >  perf_main -a172.25.23.254
> >
> > ********************************************
> > *********  perf_main version 8.0   *********
> > *********  CPU is: 1.00 MHz     *********
> > ********************************************
> >
> >
> >
> > ************* RC BW Test started  *********************
> >
> > ..... Nothing shows up it hangs ...
> >
> >
> >
> > Any idea what could be wrong.
>
> I would suggest you to contact your system administrator/vendor to verify
> the proper functioning of the switch ports, cables and HCA ports for the
> machines which you are seeing the errors with.
>
> Please let us know your findings.
>
> Thanks,
>
> -- Abhinav
> >
> > Thank you,
> > -Amit
> >
> >
> >
> > Abhinav Vishnu <vishnu at cse.ohio-state.edu> wrote on 09/02/2006 12:57:03
AM:
> >
> > > Hi Amit,
> > >
> > > > > The problem of getting "VAPI_PORT_ACTIVE""VAPI_PORT_ERROR" events
> > > > > seems like a system setup issue. These are the events
> > > > > generated by VAPI layer once a port comes down and up.
> > > > >
> > > >
> > > > Does it mean that the VAPI library setup needs to be re-visited on
our
> > > > system.
> > > > Or this is something which we can ignore? Also jobs resulting in
these
> > > > errors never get terminated.
> > >
> > > Thanks for reporting this problem.
> > >
> > > Actually, VAPI_PORT_ACTIVE means that the port status
> > > of the HCA is fluctuating. This is a typical symptom of bad
> > > connector either at the HCA end or the switch end. Do you see this
error
> > > with the same two nodes?  Because of this asynchronous event, the QPs
> > will
> > > go into the error state, and i do not expect the processes to go to
> > > completion.
> > >
> > > IMO, you will be able to see the errors at the VAPI level tests too,
like
> > > perf_main.  Can you please try perf_main on the two nodes which are
> > > showing the error and update us on your findings?
> > >
> > > Thanks again,
> > >
> > > -- Abhinav
> > >
> > > >
> > > > Thank you,
> > > > Amit
> > > >
> > > >
> > > > > Thanks.
> > > > >
> > > > > Lei
> > > > >
> > > > > ----- Original Message -----
> > > > > >
> > > > > > Hi MVAPICH----2-0.9.3
> > > > > >
> > > > > > Kernel Version    :     2.4.21-20.ELsmp
> > > > > > Arch        :     x86_64
> > > > > > Compiler    :     INTEL8.1
> > > > > > mvapich2    :     0.9.3
> > > > > >
> > > > > >
> > > > > > Another Scenario of errors:
> > > > > >
> > > > > >
> > > > > > Thank you for any feedback,
> > > > > > -Amit
> > > > > >
> > > > > > <===================
> > > > > > sched_setaffinity: Bad address
> > > > > > sched_setaffinity: Bad address
> > > > > > sched_setaffinity: Bad address
> > > > > > sched_setaffinity: Bad address
> > > > > > sched_setaffinity: Bad address
> > > > > > sched_setaffinity: Bad address
> > > > > > sched_setaffinity: Bad address
> > > > > > sched_setaffinity: Bad address
> > > > > > sched_setaffinity: Bad address
> > > > > > sched_setaffinity: Bad address
> > > > > > # OSU MPI Bandwidth Test (Version 2.2)
> > > > > > # Size          Bandwidth (MB/s)
> > > > > > 1               0.640062
> > > > > > 2               1.280123
> > > > > > 4               2.560247
> > > > > > 8               5.120615
> > > > > > 16              10.240986
> > > > > > 32              20.481973
> > > > > > 64              40.963946
> > > > > > 128             81.927891
> > > > > > 256             163.855783
> > > > > > 512             327.719380
> > > > > > 1024            655.485649
> > > > > > 2048            655.423131
> > > > > > 4096            655.427038
> > > > > > 8192            1048.682011
> > > > > > 16384           1048.727022
> > > > > > 32768           1048.677010
> > > > > > 65536           1048.680135
> > > > > > 131072          1048.687949
> > > > > > rank 6, Got an asynchronous event: VAPI_PORT_ACTIVE
> > > > > > rank 7, Got an asynchronous event: VAPI_PORT_ACTIVE
> > > > > > 262144          986.999817
> > > > > > 524288          828.593573
> > > > > > 1048576         803.785572
> > > > > > 2097152         816.001536
> > > > > > 4194304         822.250940
> > > > > > rank 6, Got an asynchronous event: VAPI_PORT_ERROR
> > > > > > (VAPI_EV_SYNDROME_NONE)rank 7, Got an asynchronous event:
> > > > > > VAPI_PORT_ERROR(VAPI_EV_SYNDROME_NONE)ran
> > > > > > k 7, Got an asynchronous event: VAPI_PORT_ACTIVE
> > > > > > rank 6, Got an asynchronous event: VAPI_PORT_ACTIVE
> > > > > > <==================
> > > > > >
> > > > > > _______________________________________________
> > > > > > mvapich-discuss mailing list
> > > > > > mvapich-discuss at mail.cse.ohio-state.edu
> > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > > > > >
> > > > >
> > > >
> > > > _______________________________________________
> > > > mvapich-discuss mailing list
> > > > mvapich-discuss at mail.cse.ohio-state.edu
> > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > > >
> > >
> >
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at mail.cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss



More information about the mvapich-discuss mailing list