[mvapich-discuss] Erros with MVAPICH2-0.9.3
Abhinav Vishnu
vishnu at cse.ohio-state.edu
Tue Sep 5 13:59:34 EDT 2006
Hi Amit,
Thanks for the update. As i had contemplated, we are seeing the
error in data transmission with the VAPI level test, perf_main.
>
> ************* RC BW Test started *********************
>
> Completion with error on send queue (syndrome=0x81=VAPI_RETRY_EXC_ERR ,
> opcode=0=VAPI_CQE_SQ_SEND_DATA)
> PERF_poll_cq: completion with error (12) detected
>
The error means that even after multiple retries the remote destination
could not be reached (as shown by VAPI_RETRY_EXEC_ERR). As a result, the QP
is broken. Once a QP is in broken state, any descriptors posted further on
this QP will also result in error.
The remote side is waiting for data to arrive, which never does
due to this broken QP and hence it hangs.
>
>
> RECEIVER:
> ==========
> perf_main -a172.25.23.254
>
> ********************************************
> ********* perf_main version 8.0 *********
> ********* CPU is: 1.00 MHz *********
> ********************************************
>
>
>
> ************* RC BW Test started *********************
>
> ..... Nothing shows up it hangs ...
>
>
>
> Any idea what could be wrong.
I would suggest you to contact your system administrator/vendor to verify
the proper functioning of the switch ports, cables and HCA ports for the
machines which you are seeing the errors with.
Please let us know your findings.
Thanks,
-- Abhinav
>
> Thank you,
> -Amit
>
>
>
> Abhinav Vishnu <vishnu at cse.ohio-state.edu> wrote on 09/02/2006 12:57:03 AM:
>
> > Hi Amit,
> >
> > > > The problem of getting "VAPI_PORT_ACTIVE""VAPI_PORT_ERROR" events
> > > > seems like a system setup issue. These are the events
> > > > generated by VAPI layer once a port comes down and up.
> > > >
> > >
> > > Does it mean that the VAPI library setup needs to be re-visited on our
> > > system.
> > > Or this is something which we can ignore? Also jobs resulting in these
> > > errors never get terminated.
> >
> > Thanks for reporting this problem.
> >
> > Actually, VAPI_PORT_ACTIVE means that the port status
> > of the HCA is fluctuating. This is a typical symptom of bad
> > connector either at the HCA end or the switch end. Do you see this error
> > with the same two nodes? Because of this asynchronous event, the QPs
> will
> > go into the error state, and i do not expect the processes to go to
> > completion.
> >
> > IMO, you will be able to see the errors at the VAPI level tests too, like
> > perf_main. Can you please try perf_main on the two nodes which are
> > showing the error and update us on your findings?
> >
> > Thanks again,
> >
> > -- Abhinav
> >
> > >
> > > Thank you,
> > > Amit
> > >
> > >
> > > > Thanks.
> > > >
> > > > Lei
> > > >
> > > > ----- Original Message -----
> > > > >
> > > > > Hi MVAPICH----2-0.9.3
> > > > >
> > > > > Kernel Version : 2.4.21-20.ELsmp
> > > > > Arch : x86_64
> > > > > Compiler : INTEL8.1
> > > > > mvapich2 : 0.9.3
> > > > >
> > > > >
> > > > > Another Scenario of errors:
> > > > >
> > > > >
> > > > > Thank you for any feedback,
> > > > > -Amit
> > > > >
> > > > > <===================
> > > > > sched_setaffinity: Bad address
> > > > > sched_setaffinity: Bad address
> > > > > sched_setaffinity: Bad address
> > > > > sched_setaffinity: Bad address
> > > > > sched_setaffinity: Bad address
> > > > > sched_setaffinity: Bad address
> > > > > sched_setaffinity: Bad address
> > > > > sched_setaffinity: Bad address
> > > > > sched_setaffinity: Bad address
> > > > > sched_setaffinity: Bad address
> > > > > # OSU MPI Bandwidth Test (Version 2.2)
> > > > > # Size Bandwidth (MB/s)
> > > > > 1 0.640062
> > > > > 2 1.280123
> > > > > 4 2.560247
> > > > > 8 5.120615
> > > > > 16 10.240986
> > > > > 32 20.481973
> > > > > 64 40.963946
> > > > > 128 81.927891
> > > > > 256 163.855783
> > > > > 512 327.719380
> > > > > 1024 655.485649
> > > > > 2048 655.423131
> > > > > 4096 655.427038
> > > > > 8192 1048.682011
> > > > > 16384 1048.727022
> > > > > 32768 1048.677010
> > > > > 65536 1048.680135
> > > > > 131072 1048.687949
> > > > > rank 6, Got an asynchronous event: VAPI_PORT_ACTIVE
> > > > > rank 7, Got an asynchronous event: VAPI_PORT_ACTIVE
> > > > > 262144 986.999817
> > > > > 524288 828.593573
> > > > > 1048576 803.785572
> > > > > 2097152 816.001536
> > > > > 4194304 822.250940
> > > > > rank 6, Got an asynchronous event: VAPI_PORT_ERROR
> > > > > (VAPI_EV_SYNDROME_NONE)rank 7, Got an asynchronous event:
> > > > > VAPI_PORT_ERROR(VAPI_EV_SYNDROME_NONE)ran
> > > > > k 7, Got an asynchronous event: VAPI_PORT_ACTIVE
> > > > > rank 6, Got an asynchronous event: VAPI_PORT_ACTIVE
> > > > > <==================
> > > > >
> > > > > _______________________________________________
> > > > > mvapich-discuss mailing list
> > > > > mvapich-discuss at mail.cse.ohio-state.edu
> > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > > > >
> > > >
> > >
> > > _______________________________________________
> > > mvapich-discuss mailing list
> > > mvapich-discuss at mail.cse.ohio-state.edu
> > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > >
> >
>
More information about the mvapich-discuss
mailing list