[mvapich-discuss] hydra errors

Walid walid.shaari at gmail.com
Mon Jun 4 08:41:36 EDT 2012


for this specific user I have been told it happens to him so often, between
me and the user there is another layer of application support. i have
already given them the --stdout option and wondering if this worked for
them or not, i am not sure at what stage his job dies. will see if can have
time and reproduce it with his code or generic mpi code.

On 4 June 2012 15:17, Jonathan Perkins <perkinjo at cse.ohio-state.edu> wrote:

> Thanks for the reply. Is there a way we can reproduce this? What happens
> when you use mpirun_rsh? Have you tried mvapich2-1.8?
> On Jun 4, 2012 8:09 AM, "Walid" <walid.shaari at gmail.com> wrote:
>
>> Dear Jonathan,
>>
>> it is 1.7 compiled with intel 10 for application compatibility and system
>> reason the configuration as below
>>
>>  /usr/local/mpi/mvapich2/intel10/1.7/bin/mpiname -o
>> Configuration
>> --prefix=/usr/local/mpi/mvapich2/intel10/1.7 --with-device=ch3:psm
>> --enable-g=dbg --enable-romio --enable-debuginfo
>> -with-file-system=panfs+nfs+ufs --with-psm-include=/usr/include
>> --with-psm=/usr/lib64
>>
>> On 4 June 2012 02:05, Jonathan Perkins <perkinjo at cse.ohio-state.edu>wrote:
>>
>>> Thanks for the report Walid.  Can you tell us the version of MVAPICH2
>>> being used and whether or not this is reproduceable with the
>>> OSU-Micro-Benchmarks?  Providing the configuration options used to build
>>> MVAPICH2 as well as architecture information of the machines this is
>>> being run on may be helpful as well.
>>>
>>> On Sun, Jun 03, 2012 at 01:56:55PM +0300, Walid wrote:
>>> > Dear all,
>>> >
>>> > One of the users have reported that almost all of his jobs die when he
>>> run
>>> > using mvapich2, below are the error messages, he is using a simple
>>> call:
>>> >
>>> >                                 mpirun program program options  >
>>> output
>>> > file
>>> >
>>> > I have asked him to use --stdout=output file, and mpiexec.hydra, he
>>> did not
>>> > come back to me yet with if it was successful or not, however i wanted
>>> to
>>> > see if these errors were seen before or not
>>> >
>>> >
>>> > [mpiexec at plci340] stdoe_cb (./ui/utils/uiu.c:309): assert (!closed)
>>> failed
>>> >
>>> > porgram.err:[mpiexec at plci340] control_cb
>>> (./pm/pmiserv/pmiserv_cb.c:306):
>>> > error in the UI defined callback
>>> >
>>> > H.err:[mpiexec at plci340] HYDT_dmxu_poll_wait_for_event
>>> > (./tools/demux/demux_poll.c:77): callback returned error status
>>> > SW.err:[mpiexec at plci340] HYD_pmci_wait_for_completion
>>> > (./pm/pmiserv/pmiserv_pmci.c:181): error waiting for event
>>> > SW.err:[mpiexec at plci340] main (./ui/mpich/mpiexec.c:405): process
>>> manager
>>> > error waiting for completion
>>> >
>>> > Test.err:[mpiexec at ulca103] control_cb (./pm/pmiserv/pmiserv_cb.c:215):
>>> > assert (!closed) failed SW_Test.err:
>>> >
>>> > [mpiexec at ulca103] HYDT_dmxu_poll_wait_for_event
>>> > (./tools/demux/demux_poll.c:77): callback returned error status
>>> >
>>> > SW_Test.err:[mpiexec at ulca103] HYD_pmci_wait_for_completion
>>> > (./pm/pmiserv/pmiserv_pmci.c:181): error waiting for event
>>> >
>>> > _SW_Test.err:[mpiexec at ulca103] main (./ui/mpich/mpiexec.c:405):
>>> process
>>> > manager error waiting for completion _2.err:[mpiexec at plch419] stdoe_cb
>>> > (./ui/utils/uiu.c:309): assert (!closed) failed VG_2.err:
>>> >
>>> > [mpiexec at plch419] control_cb (./pm/pmiserv/pmiserv_cb.c:306): error
>>> in the
>>> > UI defined callback _2.err:[mpiexec at plch419]
>>> >
>>> > HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback
>>> > returned error status
>>> >
>>> > VG_2.err:[mpiexec at plch419] HYD_pmci_wait_for_completion
>>> > (./pm/pmiserv/pmiserv_pmci.c:181): error waiting for event
>>> >
>>> > VG_2.err:[mpiexec at plch419] main (./ui/mpich/mpiexec.c:405): process
>>> manager
>>> > error waiting for completion
>>> >
>>> > VG_CO_2.err:[mpiexec at plch374] stdoe_cb (./ui/utils/uiu.c:309): assert
>>> > (!closed) failed
>>> >
>>> > CO_2.err:[mpiexec at plch374] control_cb (./pm/pmiserv/pmiserv_cb.c:306):
>>> > error in the UI defined callback
>>> >
>>> >  ion available (required by /red/ct2/GP/GP_plch.exe)
>>> >
>>> > [mpiexec at plch416] stdoe_cb (./ui/utils/uiu.c:309): assert (!closed)
>>> failed
>>> > [mpiexec at plch416] control_cb (./pm/pmiserv/pmiserv_cb.c:306): error
>>> in the
>>> > UI defined callback [mpiexec at plch416] HYDT_dmxu_poll_wait_for_event
>>> > (./tools/demux/demux_poll.c:77): callback returned error status
>>> > [mpiexec at plch416] HYD_pmci_wait_for_completion
>>> > (./pm/pmiserv/pmiserv_pmci.c:181): error waiting for event
>>> [mpiexec at plch416]
>>> > main (./ui/mpich/mpiexec.c:405): process manager error waiting for
>>> > completion
>>> >
>>> >
>>> > thank you,
>>> >
>>> > Walid
>>>
>>> > _______________________________________________
>>> > mvapich-discuss mailing list
>>> > mvapich-discuss at cse.ohio-state.edu
>>> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>
>>>
>>> --
>>> Jonathan Perkins
>>> http://www.cse.ohio-state.edu/~perkinjo
>>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20120604/4f5a5936/attachment.html


More information about the mvapich-discuss mailing list