[mvapich-discuss] Runtime parameters for Large memory jobs

Dhabaleswar Panda panda at cse.ohio-state.edu
Fri Jun 1 17:41:51 EDT 2012


Hi Mehmet,
>
> I have one simple and one difficult question :)
>
> 1) Is there a way to bypass IB using runtime parameters? (for
> troubleshooting etc)

No.

> 2) Could you recommend some runtime parameters to prevent buffer overruns
> etc for large memory jobs? I have a multiple-hundred-cores job with
> 6-7GB/core memory utilization per core, which consistently segfaults. The
> issue is not related to unavailable memory, we have systems to support
> that. Ulimit values also look OK. I saw "MV2_SHMEM_COLL_MAX_MSG_SIZE" for
> example, but the manual includes no details for how it is used (is it
> boolean? If it takes a value, is it KB? Do we need to specify the unit
> after the value, e.g. "5GB"?)

Could you give us backtrace information for the segfaults you are seeing.

The unit is in Bytes. Currently, the default value for this parameter is
set to 128KBytes.

You are using MVAPICH2 1.6 which is very old.

Latest version of MVAPICH2 (1.8) has support for jobs to run on larger
number of cores/node (like the 64-core Interlagos nodes you are using).
This version also has additional modes like UD-based and Hybrid-based
(Hybrid UD-RC/XRC) to handle large jobs with reduced memory footprint.
The latest version also has many new features and bug-fixes compared to
1.6. You should upgrade your installation to 1.8 version, try your
application and let us know whether your job runs successfully.

Thanks,

DK


> Any suggestions will be very much appreciated!
>
> Thanks,
> -Mehmet
>
>
> PS:  I am using mvapich2 1.6 on 64-core Interlagos nodes
>



More information about the mvapich-discuss mailing list