[mvapich-discuss] MVAPICH2-PSM 1.4 InfiniPath context sharing problems, including patch

Ben Truscott b.s.truscott at bristol.ac.uk
Mon Feb 1 11:09:24 EST 2010


Dear all

I am using MVAPICH2 1.4 built for the PSM device on a cluster equipped with 
QLogic InfiniPath QLE7140 Infiniband HCAs. After a recent update of our 
InfiniPath software from version 2.2 to the recently released version 2.8 
(the next major version after 2.2, also known as QLogic OFED+ 1.4) I began 
to notice consistent job failures caused by an inability to acquire the 
proper number of InfiniPath contexts in cases where two or more MPI jobs 
had been queued together on the same node at the same time.

Using the PSM environment variable PSM_VERBOSE_ENV, which is a new addition 
to version 2.8 (PSM_TRACEMASK having disappeared) that prints the effective 
and default values of all variables that affect the operation of PSM, I was 
able to determine that this was due to the effective value for 
PSM_SHAREDCONTEXTS_MAX being set to 16 regardless of the value I had passed 
to the job. In fact the QLE7140 has four hardware contexts, each of which 
can be shared four ways within a single MPI job, but, due to a change in 
the behaviour of PSM from eager sharing to greedy context acquisition in 
the latest version, the specification of PSM_SHAREDCONTEXTS_MAX=16 (default 
value: 4) caused the first job to start on each node to acquire one context 
per process without employing context sharing, thus leaving insufficient 
contexts available for subsequent jobs.

Since I had experienced no problems with the version of PSM supplied with 
the InfiniPath 2.2 distribution, I initially suspected a bug in PSM itself 
and contacted QLogic, but they were unable to reproduce the problem. After 
verifying correct behaviour under OpenMPI I was persuaded that the problem 
must be specific to MVAPICH2 and hence examined the file psm_entry.c, which 
I found to contain a number of logic errors including hard-coded resetting 
of the PSM environment to values that are, in general, likely to give rise 
to problems of the sort that I encountered. I therefore submit the attached 
(commented) patch for your consideration with a view to its possible 
inclusion in the next version of MVAPICH2. Although I hope that its 
original author will not take offence to my saying so, I feel I should note 
as well that this file looks as if it was written very hastily and 
generates more compiler warnings (using Intel C 11.1) than the rest of the 
distribution combined. While the patch is solely intended to correct the 
erroneous context sharing behaviour and, admittedly, does not introduce any 
additional warnings, it may be worthwhile to re-visit psm_entry.c in 
general with a view to re-writing it for a future release.

Best regards,

Yours

Ben Truscott
School of Chemistry
University of Bristol (UK)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: psm_entry.patch
Type: application/octet-stream
Size: 5935 bytes
Desc: not available
Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20100201/faa88883/psm_entry-0001.obj


More information about the mvapich-discuss mailing list