[mvapich-discuss] Erros with MVAPICH2-0.9.3
Amit H Kumar
AHKumar at odu.edu
Tue Sep 5 15:25:27 EDT 2006
Thank you Abhinav,
Will try to dig into and let you guys know about it.
-Amit
mvapich-discuss-bounces at cse.ohio-state.edu wrote on 09/05/2006 01:59:34 PM:
> Hi Amit,
>
> Thanks for the update. As i had contemplated, we are seeing the
> error in data transmission with the VAPI level test, perf_main.
>
> >
> > ************* RC BW Test started *********************
> >
> > Completion with error on send queue (syndrome=0x81=VAPI_RETRY_EXC_ERR ,
> > opcode=0=VAPI_CQE_SQ_SEND_DATA)
> > PERF_poll_cq: completion with error (12) detected
> >
>
> The error means that even after multiple retries the remote destination
> could not be reached (as shown by VAPI_RETRY_EXEC_ERR). As a result, the
QP
> is broken. Once a QP is in broken state, any descriptors posted further
on
> this QP will also result in error.
>
> The remote side is waiting for data to arrive, which never does
> due to this broken QP and hence it hangs.
>
> >
> >
> > RECEIVER:
> > ==========
> > perf_main -a172.25.23.254
> >
> > ********************************************
> > ********* perf_main version 8.0 *********
> > ********* CPU is: 1.00 MHz *********
> > ********************************************
> >
> >
> >
> > ************* RC BW Test started *********************
> >
> > ..... Nothing shows up it hangs ...
> >
> >
> >
> > Any idea what could be wrong.
>
> I would suggest you to contact your system administrator/vendor to verify
> the proper functioning of the switch ports, cables and HCA ports for the
> machines which you are seeing the errors with.
>
> Please let us know your findings.
>
> Thanks,
>
> -- Abhinav
> >
> > Thank you,
> > -Amit
> >
> >
> >
> > Abhinav Vishnu <vishnu at cse.ohio-state.edu> wrote on 09/02/2006 12:57:03
AM:
> >
> > > Hi Amit,
> > >
> > > > > The problem of getting "VAPI_PORT_ACTIVE""VAPI_PORT_ERROR" events
> > > > > seems like a system setup issue. These are the events
> > > > > generated by VAPI layer once a port comes down and up.
> > > > >
> > > >
> > > > Does it mean that the VAPI library setup needs to be re-visited on
our
> > > > system.
> > > > Or this is something which we can ignore? Also jobs resulting in
these
> > > > errors never get terminated.
> > >
> > > Thanks for reporting this problem.
> > >
> > > Actually, VAPI_PORT_ACTIVE means that the port status
> > > of the HCA is fluctuating. This is a typical symptom of bad
> > > connector either at the HCA end or the switch end. Do you see this
error
> > > with the same two nodes? Because of this asynchronous event, the QPs
> > will
> > > go into the error state, and i do not expect the processes to go to
> > > completion.
> > >
> > > IMO, you will be able to see the errors at the VAPI level tests too,
like
> > > perf_main. Can you please try perf_main on the two nodes which are
> > > showing the error and update us on your findings?
> > >
> > > Thanks again,
> > >
> > > -- Abhinav
> > >
> > > >
> > > > Thank you,
> > > > Amit
> > > >
> > > >
> > > > > Thanks.
> > > > >
> > > > > Lei
> > > > >
> > > > > ----- Original Message -----
> > > > > >
> > > > > > Hi MVAPICH----2-0.9.3
> > > > > >
> > > > > > Kernel Version : 2.4.21-20.ELsmp
> > > > > > Arch : x86_64
> > > > > > Compiler : INTEL8.1
> > > > > > mvapich2 : 0.9.3
> > > > > >
> > > > > >
> > > > > > Another Scenario of errors:
> > > > > >
> > > > > >
> > > > > > Thank you for any feedback,
> > > > > > -Amit
> > > > > >
> > > > > > <===================
> > > > > > sched_setaffinity: Bad address
> > > > > > sched_setaffinity: Bad address
> > > > > > sched_setaffinity: Bad address
> > > > > > sched_setaffinity: Bad address
> > > > > > sched_setaffinity: Bad address
> > > > > > sched_setaffinity: Bad address
> > > > > > sched_setaffinity: Bad address
> > > > > > sched_setaffinity: Bad address
> > > > > > sched_setaffinity: Bad address
> > > > > > sched_setaffinity: Bad address
> > > > > > # OSU MPI Bandwidth Test (Version 2.2)
> > > > > > # Size Bandwidth (MB/s)
> > > > > > 1 0.640062
> > > > > > 2 1.280123
> > > > > > 4 2.560247
> > > > > > 8 5.120615
> > > > > > 16 10.240986
> > > > > > 32 20.481973
> > > > > > 64 40.963946
> > > > > > 128 81.927891
> > > > > > 256 163.855783
> > > > > > 512 327.719380
> > > > > > 1024 655.485649
> > > > > > 2048 655.423131
> > > > > > 4096 655.427038
> > > > > > 8192 1048.682011
> > > > > > 16384 1048.727022
> > > > > > 32768 1048.677010
> > > > > > 65536 1048.680135
> > > > > > 131072 1048.687949
> > > > > > rank 6, Got an asynchronous event: VAPI_PORT_ACTIVE
> > > > > > rank 7, Got an asynchronous event: VAPI_PORT_ACTIVE
> > > > > > 262144 986.999817
> > > > > > 524288 828.593573
> > > > > > 1048576 803.785572
> > > > > > 2097152 816.001536
> > > > > > 4194304 822.250940
> > > > > > rank 6, Got an asynchronous event: VAPI_PORT_ERROR
> > > > > > (VAPI_EV_SYNDROME_NONE)rank 7, Got an asynchronous event:
> > > > > > VAPI_PORT_ERROR(VAPI_EV_SYNDROME_NONE)ran
> > > > > > k 7, Got an asynchronous event: VAPI_PORT_ACTIVE
> > > > > > rank 6, Got an asynchronous event: VAPI_PORT_ACTIVE
> > > > > > <==================
> > > > > >
> > > > > > _______________________________________________
> > > > > > mvapich-discuss mailing list
> > > > > > mvapich-discuss at mail.cse.ohio-state.edu
> > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > > > > >
> > > > >
> > > >
> > > > _______________________________________________
> > > > mvapich-discuss mailing list
> > > > mvapich-discuss at mail.cse.ohio-state.edu
> > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > > >
> > >
> >
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at mail.cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
More information about the mvapich-discuss
mailing list