[mvapich-discuss] Erros with MVAPICH2-0.9.3

Abhinav Vishnu vishnu at cse.ohio-state.edu
Thu Sep 7 14:26:28 EDT 2006


Hi Amit,

The status of the second port should not matter as long as the first
port is ACTIVE and you are using the first port for communication.
By default perf_main uses first port for communication.

On the other hand, you may want to use the 2nd port for communication
and see if the problem persists.

Thanks,

-- Abhinav
-------------------------------
Abhinav Vishnu,
Graduate Research Associate,
Department Of Comp. Sc. & Engg.
The Ohio State University.
-------------------------------

On Thu, 7 Sep 2006, Amit H Kumar wrote:

>
> Hi Abhinav,
>
> I am not sure if this is reasonable or not but giving it a shot:
>
> Of the 2 ports on my HCA I have PORT 2 Active on all nodes. Could this be a
> problem ?
>
>
> Thank you,
> Amit
>
> mvapich-discuss-bounces at cse.ohio-state.edu wrote on 09/05/2006 01:59:34 PM:
>
> > Hi Amit,
> >
> > Thanks for the update. As i had contemplated, we are seeing the
> > error in data transmission with the VAPI level test, perf_main.
> >
> > >
> > > ************* RC BW Test started  *********************
> > >
> > > Completion with error on send queue (syndrome=0x81=VAPI_RETRY_EXC_ERR ,
> > > opcode=0=VAPI_CQE_SQ_SEND_DATA)
> > > PERF_poll_cq: completion with error (12) detected
> > >
> >
> > The error means that even after multiple retries the remote destination
> > could not be reached (as shown by VAPI_RETRY_EXEC_ERR). As a result, the
> QP
> > is broken. Once a QP is in broken state, any descriptors posted further
> on
> > this QP will also result in error.
> >
> > The remote side is waiting for data to arrive, which never does
> > due to this broken QP and hence it hangs.
> >
> > >
> > >
> > > RECEIVER:
> > > ==========
> > >  perf_main -a172.25.23.254
> > >
> > > ********************************************
> > > *********  perf_main version 8.0   *********
> > > *********  CPU is: 1.00 MHz     *********
> > > ********************************************
> > >
> > >
> > >
> > > ************* RC BW Test started  *********************
> > >
> > > ..... Nothing shows up it hangs ...
> > >
> > >
> > >
> > > Any idea what could be wrong.
> >
> > I would suggest you to contact your system administrator/vendor to verify
> > the proper functioning of the switch ports, cables and HCA ports for the
> > machines which you are seeing the errors with.
> >
> > Please let us know your findings.
> >
> > Thanks,
> >
> > -- Abhinav
> > >
> > > Thank you,
> > > -Amit
> > >
> > >
> > >
> > > Abhinav Vishnu <vishnu at cse.ohio-state.edu> wrote on 09/02/2006 12:57:03
> AM:
> > >
> > > > Hi Amit,
> > > >
> > > > > > The problem of getting "VAPI_PORT_ACTIVE""VAPI_PORT_ERROR" events
> > > > > > seems like a system setup issue. These are the events
> > > > > > generated by VAPI layer once a port comes down and up.
> > > > > >
> > > > >
> > > > > Does it mean that the VAPI library setup needs to be re-visited on
> our
> > > > > system.
> > > > > Or this is something which we can ignore? Also jobs resulting in
> these
> > > > > errors never get terminated.
> > > >
> > > > Thanks for reporting this problem.
> > > >
> > > > Actually, VAPI_PORT_ACTIVE means that the port status
> > > > of the HCA is fluctuating. This is a typical symptom of bad
> > > > connector either at the HCA end or the switch end. Do you see this
> error
> > > > with the same two nodes?  Because of this asynchronous event, the QPs
> > > will
> > > > go into the error state, and i do not expect the processes to go to
> > > > completion.
> > > >
> > > > IMO, you will be able to see the errors at the VAPI level tests too,
> like
> > > > perf_main.  Can you please try perf_main on the two nodes which are
> > > > showing the error and update us on your findings?
> > > >
> > > > Thanks again,
> > > >
> > > > -- Abhinav
> > > >
> > > > >
> > > > > Thank you,
> > > > > Amit
> > > > >
> > > > >
> > > > > > Thanks.
> > > > > >
> > > > > > Lei
> > > > > >
> > > > > > ----- Original Message -----
> > > > > > >
> > > > > > > Hi MVAPICH----2-0.9.3
> > > > > > >
> > > > > > > Kernel Version    :     2.4.21-20.ELsmp
> > > > > > > Arch        :     x86_64
> > > > > > > Compiler    :     INTEL8.1
> > > > > > > mvapich2    :     0.9.3
> > > > > > >
> > > > > > >
> > > > > > > Another Scenario of errors:
> > > > > > >
> > > > > > >
> > > > > > > Thank you for any feedback,
> > > > > > > -Amit
> > > > > > >
> > > > > > > <===================
> > > > > > > sched_setaffinity: Bad address
> > > > > > > sched_setaffinity: Bad address
> > > > > > > sched_setaffinity: Bad address
> > > > > > > sched_setaffinity: Bad address
> > > > > > > sched_setaffinity: Bad address
> > > > > > > sched_setaffinity: Bad address
> > > > > > > sched_setaffinity: Bad address
> > > > > > > sched_setaffinity: Bad address
> > > > > > > sched_setaffinity: Bad address
> > > > > > > sched_setaffinity: Bad address
> > > > > > > # OSU MPI Bandwidth Test (Version 2.2)
> > > > > > > # Size          Bandwidth (MB/s)
> > > > > > > 1               0.640062
> > > > > > > 2               1.280123
> > > > > > > 4               2.560247
> > > > > > > 8               5.120615
> > > > > > > 16              10.240986
> > > > > > > 32              20.481973
> > > > > > > 64              40.963946
> > > > > > > 128             81.927891
> > > > > > > 256             163.855783
> > > > > > > 512             327.719380
> > > > > > > 1024            655.485649
> > > > > > > 2048            655.423131
> > > > > > > 4096            655.427038
> > > > > > > 8192            1048.682011
> > > > > > > 16384           1048.727022
> > > > > > > 32768           1048.677010
> > > > > > > 65536           1048.680135
> > > > > > > 131072          1048.687949
> > > > > > > rank 6, Got an asynchronous event: VAPI_PORT_ACTIVE
> > > > > > > rank 7, Got an asynchronous event: VAPI_PORT_ACTIVE
> > > > > > > 262144          986.999817
> > > > > > > 524288          828.593573
> > > > > > > 1048576         803.785572
> > > > > > > 2097152         816.001536
> > > > > > > 4194304         822.250940
> > > > > > > rank 6, Got an asynchronous event: VAPI_PORT_ERROR
> > > > > > > (VAPI_EV_SYNDROME_NONE)rank 7, Got an asynchronous event:
> > > > > > > VAPI_PORT_ERROR(VAPI_EV_SYNDROME_NONE)ran
> > > > > > > k 7, Got an asynchronous event: VAPI_PORT_ACTIVE
> > > > > > > rank 6, Got an asynchronous event: VAPI_PORT_ACTIVE
> > > > > > > <==================
> > > > > > >
> > > > > > > _______________________________________________
> > > > > > > mvapich-discuss mailing list
> > > > > > > mvapich-discuss at mail.cse.ohio-state.edu
> > > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > > > > > >
> > > > > >
> > > > >
> > > > > _______________________________________________
> > > > > mvapich-discuss mailing list
> > > > > mvapich-discuss at mail.cse.ohio-state.edu
> > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > > > >
> > > >
> > >
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at mail.cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



More information about the mvapich-discuss mailing list