[mvapich-discuss] Mvapich2 terminates with errors for > 64 procs

Mehmet mbelgin at gmail.com
Thu Sep 29 16:53:07 EDT 2011


Hi Everyone,

I am using two 48-core nodes to try out mvapich, which is compiled using gcc
4.4.5 on RHEL6. I noticed that even a simple hello_world does not work for >
64 processors using mvapich. If you try a generic mpirun_rsh -np 96 ... this
is what you will get:

Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(413): Initialization failed
(unknown)(): Other MPI error

After some hair pulling, I found this could be due to failing on-demand
connection management. I know that disabling "registration caching" when
compiling mvapich could cause this, but it is enabled by default and I did
not use any flags to disable it. I cannot think of any other things to check
and will very much appreciate your help.

I tried to bypass the problem by increasing the on-demand threshold
(MV2_ON_DEMAND_THRESHOLD=96). It made some difference, allowing code to run
for a while more, but it eventually crashes with the same errors.

Have you ever seen this happening? Any thoughts?

Thanks in advance!
-Mehmet
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20110929/4bb20227/attachment.html


More information about the mvapich-discuss mailing list