[mvapich-discuss] Erros with MVAPICH2-0.9.3

Abhinav Vishnu vishnu at cse.ohio-state.edu
Tue Sep 5 13:59:34 EDT 2006


Hi Amit,

Thanks for the update. As i had contemplated, we are seeing the
error in data transmission with the VAPI level test, perf_main.

>
> ************* RC BW Test started  *********************
>
> Completion with error on send queue (syndrome=0x81=VAPI_RETRY_EXC_ERR ,
> opcode=0=VAPI_CQE_SQ_SEND_DATA)
> PERF_poll_cq: completion with error (12) detected
>

The error means that even after multiple retries the remote destination
could not be reached (as shown by VAPI_RETRY_EXEC_ERR). As a result, the QP
is broken. Once a QP is in broken state, any descriptors posted further on
this QP will also result in error.

The remote side is waiting for data to arrive, which never does
due to this broken QP and hence it hangs.

>
>
> RECEIVER:
> ==========
>  perf_main -a172.25.23.254
>
> ********************************************
> *********  perf_main version 8.0   *********
> *********  CPU is: 1.00 MHz     *********
> ********************************************
>
>
>
> ************* RC BW Test started  *********************
>
> ..... Nothing shows up it hangs ...
>
>
>
> Any idea what could be wrong.

I would suggest you to contact your system administrator/vendor to verify
the proper functioning of the switch ports, cables and HCA ports for the
machines which you are seeing the errors with.

Please let us know your findings.

Thanks,

-- Abhinav
>
> Thank you,
> -Amit
>
>
>
> Abhinav Vishnu <vishnu at cse.ohio-state.edu> wrote on 09/02/2006 12:57:03 AM:
>
> > Hi Amit,
> >
> > > > The problem of getting "VAPI_PORT_ACTIVE""VAPI_PORT_ERROR" events
> > > > seems like a system setup issue. These are the events
> > > > generated by VAPI layer once a port comes down and up.
> > > >
> > >
> > > Does it mean that the VAPI library setup needs to be re-visited on our
> > > system.
> > > Or this is something which we can ignore? Also jobs resulting in these
> > > errors never get terminated.
> >
> > Thanks for reporting this problem.
> >
> > Actually, VAPI_PORT_ACTIVE means that the port status
> > of the HCA is fluctuating. This is a typical symptom of bad
> > connector either at the HCA end or the switch end. Do you see this error
> > with the same two nodes?  Because of this asynchronous event, the QPs
> will
> > go into the error state, and i do not expect the processes to go to
> > completion.
> >
> > IMO, you will be able to see the errors at the VAPI level tests too, like
> > perf_main.  Can you please try perf_main on the two nodes which are
> > showing the error and update us on your findings?
> >
> > Thanks again,
> >
> > -- Abhinav
> >
> > >
> > > Thank you,
> > > Amit
> > >
> > >
> > > > Thanks.
> > > >
> > > > Lei
> > > >
> > > > ----- Original Message -----
> > > > >
> > > > > Hi MVAPICH----2-0.9.3
> > > > >
> > > > > Kernel Version    :     2.4.21-20.ELsmp
> > > > > Arch        :     x86_64
> > > > > Compiler    :     INTEL8.1
> > > > > mvapich2    :     0.9.3
> > > > >
> > > > >
> > > > > Another Scenario of errors:
> > > > >
> > > > >
> > > > > Thank you for any feedback,
> > > > > -Amit
> > > > >
> > > > > <===================
> > > > > sched_setaffinity: Bad address
> > > > > sched_setaffinity: Bad address
> > > > > sched_setaffinity: Bad address
> > > > > sched_setaffinity: Bad address
> > > > > sched_setaffinity: Bad address
> > > > > sched_setaffinity: Bad address
> > > > > sched_setaffinity: Bad address
> > > > > sched_setaffinity: Bad address
> > > > > sched_setaffinity: Bad address
> > > > > sched_setaffinity: Bad address
> > > > > # OSU MPI Bandwidth Test (Version 2.2)
> > > > > # Size          Bandwidth (MB/s)
> > > > > 1               0.640062
> > > > > 2               1.280123
> > > > > 4               2.560247
> > > > > 8               5.120615
> > > > > 16              10.240986
> > > > > 32              20.481973
> > > > > 64              40.963946
> > > > > 128             81.927891
> > > > > 256             163.855783
> > > > > 512             327.719380
> > > > > 1024            655.485649
> > > > > 2048            655.423131
> > > > > 4096            655.427038
> > > > > 8192            1048.682011
> > > > > 16384           1048.727022
> > > > > 32768           1048.677010
> > > > > 65536           1048.680135
> > > > > 131072          1048.687949
> > > > > rank 6, Got an asynchronous event: VAPI_PORT_ACTIVE
> > > > > rank 7, Got an asynchronous event: VAPI_PORT_ACTIVE
> > > > > 262144          986.999817
> > > > > 524288          828.593573
> > > > > 1048576         803.785572
> > > > > 2097152         816.001536
> > > > > 4194304         822.250940
> > > > > rank 6, Got an asynchronous event: VAPI_PORT_ERROR
> > > > > (VAPI_EV_SYNDROME_NONE)rank 7, Got an asynchronous event:
> > > > > VAPI_PORT_ERROR(VAPI_EV_SYNDROME_NONE)ran
> > > > > k 7, Got an asynchronous event: VAPI_PORT_ACTIVE
> > > > > rank 6, Got an asynchronous event: VAPI_PORT_ACTIVE
> > > > > <==================
> > > > >
> > > > > _______________________________________________
> > > > > mvapich-discuss mailing list
> > > > > mvapich-discuss at mail.cse.ohio-state.edu
> > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > > > >
> > > >
> > >
> > > _______________________________________________
> > > mvapich-discuss mailing list
> > > mvapich-discuss at mail.cse.ohio-state.edu
> > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > >
> >
>



More information about the mvapich-discuss mailing list