[mvapich-discuss] MVAPICH2-PSM 1.4 InfiniPath context sharing problems, including patch

Dhabaleswar Panda panda at cse.ohio-state.edu
Mon Feb 1 14:48:12 EST 2010


Hi Ben,

Thanks for your note. To the best of our knowledge, InfiniPath 2.8 is
publicly not available. Thus, MVAPICH2 1.4 has not been tested with it
yet. It has been tested with InfiniPath version 2.2. Do you see any
problem with MVAPICH2 1.4 and InfiniPath version 2.2? Once we have access
to InfiniPath 2.8, we will be carrying out tests with upcoming versions of
MVAPICH2. Thanks for sending us the design guidelines with InfiniPath 2.8
and the patch. We will review and incorporate these to the next MVAPICH2
release as appropriate.

Thanks,

DK



On Mon, 1 Feb 2010, Ben Truscott wrote:

> Dear all
>
> I am using MVAPICH2 1.4 built for the PSM device on a cluster equipped with
> QLogic InfiniPath QLE7140 Infiniband HCAs. After a recent update of our
> InfiniPath software from version 2.2 to the recently released version 2.8
> (the next major version after 2.2, also known as QLogic OFED+ 1.4) I began
> to notice consistent job failures caused by an inability to acquire the
> proper number of InfiniPath contexts in cases where two or more MPI jobs
> had been queued together on the same node at the same time.
>
> Using the PSM environment variable PSM_VERBOSE_ENV, which is a new addition
> to version 2.8 (PSM_TRACEMASK having disappeared) that prints the effective
> and default values of all variables that affect the operation of PSM, I was
> able to determine that this was due to the effective value for
> PSM_SHAREDCONTEXTS_MAX being set to 16 regardless of the value I had passed
> to the job. In fact the QLE7140 has four hardware contexts, each of which
> can be shared four ways within a single MPI job, but, due to a change in
> the behaviour of PSM from eager sharing to greedy context acquisition in
> the latest version, the specification of PSM_SHAREDCONTEXTS_MAX=16 (default
> value: 4) caused the first job to start on each node to acquire one context
> per process without employing context sharing, thus leaving insufficient
> contexts available for subsequent jobs.
>
> Since I had experienced no problems with the version of PSM supplied with
> the InfiniPath 2.2 distribution, I initially suspected a bug in PSM itself
> and contacted QLogic, but they were unable to reproduce the problem. After
> verifying correct behaviour under OpenMPI I was persuaded that the problem
> must be specific to MVAPICH2 and hence examined the file psm_entry.c, which
> I found to contain a number of logic errors including hard-coded resetting
> of the PSM environment to values that are, in general, likely to give rise
> to problems of the sort that I encountered. I therefore submit the attached
> (commented) patch for your consideration with a view to its possible
> inclusion in the next version of MVAPICH2. Although I hope that its
> original author will not take offence to my saying so, I feel I should note
> as well that this file looks as if it was written very hastily and
> generates more compiler warnings (using Intel C 11.1) than the rest of the
> distribution combined. While the patch is solely intended to correct the
> erroneous context sharing behaviour and, admittedly, does not introduce any
> additional warnings, it may be worthwhile to re-visit psm_entry.c in
> general with a view to re-writing it for a future release.
>
> Best regards,
>
> Yours
>
> Ben Truscott
> School of Chemistry
> University of Bristol (UK)



More information about the mvapich-discuss mailing list