[mvapich-discuss] MVAPICH2-PSM 1.4 InfiniPath context sharing
problems, including patch
Ben Truscott
b.s.truscott at bristol.ac.uk
Mon Feb 1 11:09:24 EST 2010
Dear all
I am using MVAPICH2 1.4 built for the PSM device on a cluster equipped with
QLogic InfiniPath QLE7140 Infiniband HCAs. After a recent update of our
InfiniPath software from version 2.2 to the recently released version 2.8
(the next major version after 2.2, also known as QLogic OFED+ 1.4) I began
to notice consistent job failures caused by an inability to acquire the
proper number of InfiniPath contexts in cases where two or more MPI jobs
had been queued together on the same node at the same time.
Using the PSM environment variable PSM_VERBOSE_ENV, which is a new addition
to version 2.8 (PSM_TRACEMASK having disappeared) that prints the effective
and default values of all variables that affect the operation of PSM, I was
able to determine that this was due to the effective value for
PSM_SHAREDCONTEXTS_MAX being set to 16 regardless of the value I had passed
to the job. In fact the QLE7140 has four hardware contexts, each of which
can be shared four ways within a single MPI job, but, due to a change in
the behaviour of PSM from eager sharing to greedy context acquisition in
the latest version, the specification of PSM_SHAREDCONTEXTS_MAX=16 (default
value: 4) caused the first job to start on each node to acquire one context
per process without employing context sharing, thus leaving insufficient
contexts available for subsequent jobs.
Since I had experienced no problems with the version of PSM supplied with
the InfiniPath 2.2 distribution, I initially suspected a bug in PSM itself
and contacted QLogic, but they were unable to reproduce the problem. After
verifying correct behaviour under OpenMPI I was persuaded that the problem
must be specific to MVAPICH2 and hence examined the file psm_entry.c, which
I found to contain a number of logic errors including hard-coded resetting
of the PSM environment to values that are, in general, likely to give rise
to problems of the sort that I encountered. I therefore submit the attached
(commented) patch for your consideration with a view to its possible
inclusion in the next version of MVAPICH2. Although I hope that its
original author will not take offence to my saying so, I feel I should note
as well that this file looks as if it was written very hastily and
generates more compiler warnings (using Intel C 11.1) than the rest of the
distribution combined. While the patch is solely intended to correct the
erroneous context sharing behaviour and, admittedly, does not introduce any
additional warnings, it may be worthwhile to re-visit psm_entry.c in
general with a view to re-writing it for a future release.
Best regards,
Yours
Ben Truscott
School of Chemistry
University of Bristol (UK)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: psm_entry.patch
Type: application/octet-stream
Size: 5935 bytes
Desc: not available
Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20100201/faa88883/psm_entry-0001.obj
More information about the mvapich-discuss
mailing list