[mvapich-discuss] ipath_update_tid_err: failed: Bad address

Jonathan Perkins perkinjo at cse.ohio-state.edu
Wed Jan 28 15:42:35 EST 2015


Hi Jeff:
This looks more like an error with the underlying fabric.

We looked over the PSM code on github and it looks like the "ips_tid_acquire"
function produces this error "Failed to update %d tids". It could be that the
WQEs are running out or something. PSM does not appear to give any knobs to do
tuning, so we don't believe that much can be done at the upper (MPI) layer.

On Wed, Jan 28, 2015 at 07:29:48AM -0800, Jeff Hammond wrote:
> I am running NWChem with ARMCI-MPI3 over MVAPICH2 2.1rc1 on Intel True
> Scale via PSM.
> 
> The follow error occurs in the application around the place where
> nontrivial communication starts:
> 
> ehs110.111084ipath_update_tid_err: failed: Bad address
> ehs110.111084Failed to update 32 tids (err=23)
> 
> Do you have any ideas why this happens or suggestions on how to debug
> it?  NWChem often blasts one rank with MPI_Fetch_and_op operations as
> part of its dynamic load-balancer, if that helps at all.
> 
> This is how I built MVAPICH2:
> 
> ../configure --prefix=/home/jrhammon/nwchem-project-dir/builds/mv2-2.1rc1-icc-psm
> --enable-fortran=f77 --enable-g=dbg CC=icc CXX=icpc FC=ifort
> --with-psm=/usr/local/ofed/3.5-2-MIC-rc3 --with-device=ch3:psm
> 
> I compiled ARMCI-MPI (mpi3rma branch) like this:
> 
> ../configure CC=/panfs/projects/nwchem/builds/mv2-2.1rc1-icc-psm/bin/mpicc
> --prefix=/panfs/projects/nwchem/builds/armci-mpi3-mv2-2.1rc1-icc-psm
> 
> Thanks,
> 
> Jeff
> 
> -- 
> Jeff Hammond
> jeff.science at gmail.com
> http://jeffhammond.github.io/
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss

-- 
Jonathan Perkins


More information about the mvapich-discuss mailing list